* [email protected] <[email protected]> wrote:
>
> The patch titled
> page-owner tracking
> has been added to the -mm tree. Its filename is
> page-owner-tracking.patch
>
> Before you just go and hit "reply", please:
> a) Consider who else should be cc'ed
> b) Prefer to cc a suitable mailing list as well
> c) Ideally: find the original patch on the mailing list and do a
> reply-to-all to that, adding suitable additional cc's
>
> *** Remember to use Documentation/SubmitChecklist when testing your code ***
>
> See http://userweb.kernel.org/~akpm/stuff/added-to-mm.txt to find
> out what to do about this
>
> The current -mm tree may be found at http://userweb.kernel.org/~akpm/mmotm/
>
> ------------------------------------------------------
> Subject: page-owner tracking
> From: Alexander Nyberg <[email protected]>
>
> > On Fri, 20 Mar 2009 15:27:00 +0000 Mel Gorman <[email protected]> wrote:
>
> Here is a rebased reversion. Appears to work as advertised based on a
> quick test with qemu and builds without CONFIG_PROC_PAGEOWNER
>
>
>
> Introduces CONFIG_PROC_PAGEOWNER that keeps track of the call chain
> under which a page was allocated. Includes a user-space helper in
> Documentation/page_owner.c to sort the enormous amount of output that this
> may give (thanks tridge).
>
> Information available through /proc/page_owner
>
> x86_64 introduces some stack noise in certain call chains so for exact
> output use of x86 && CONFIG_FRAME_POINTER is suggested. Tested on x86,
> x86 && CONFIG_FRAME_POINTER, x86_64
>
> Output looks like:
>
> 4819 times:
> Page allocated via order 0, mask 0x50
> [0xc012b7b9] find_lock_page+25
> [0xc012b8c8] find_or_create_page+152
> [0xc0147d74] grow_dev_page+36
> [0xc0148164] __find_get_block+84
> [0xc0147ebc] __getblk_slow+124
> [0xc0148164] __find_get_block+84
> [0xc01481e7] __getblk+55
> [0xc0185d14] do_readahead+100
>
> We use a custom stack unwinder because using __builtin_return_address([0-7])
> causes gcc to generate code that might try to unwind the stack looking for
> function return addresses and "fall off" causing early panics if the call
> chain is not deep enough. So in that case we could have had a depth of
> around 3 functions in all traces (I experimented a bit with this).
>
>
> make page_owner handle non-contiguous page ranges
>
>
> I've cleaned up the __alloc_pages() part to a simple set_page_owner() call.
>
> Signed-off-by: Alexander Nyberg <[email protected]>
> Signed-off-by: Dave Hansen <[email protected]>
> Signed-off-by: Kamezawa Hiroyuki <[email protected]>
> DESC
> Update page->order at an appropriate time when tracking PAGE_OWNER
> EDESC
>
> PAGE_OWNER tracks free pages by setting page->order to -1. However, it is
> set during __free_pages() which is not the only free path as
> __pagevec_free() and free_compound_page() do not go through __free_pages().
> This leads to a situation where free pages are visible in /proc/page_owner
> which is confusing and might be interpreted as a memory leak.
>
> This patch sets page->owner when PageBuddy is set. It also prints a
> warning to the kernel log if a free page is found that does not appear free
> to PAGE_OWNER. This should be considered a fix to
> page-owner-tracking-leak-detector.patch.
>
> This only applies to -mm as PAGE_OWNER is not in mainline.
>
> Signed-off-by: Mel Gorman <[email protected]>
> Acked-by: Andy Whitcroft <[email protected]>
> DESC
> Print out PAGE_OWNER statistics in relation to fragmentation avoidance
> EDESC
>
> When PAGE_OWNER is set, more information is available of relevance to
> fragmentation avoidance. A second line is added to /proc/page_owner showing
> the PFN, the pageblock number, the mobility type of the page based on its
> allocation flags, whether the allocation is improperly placed and the flags.
> A sample entry looks like
>
> Page allocated via order 0, mask 0x1280d2
> PFN 7355 Block 7 type 3 Fallback Flags LA
> [0xc01528c6] __handle_mm_fault+598
> [0xc0320427] do_page_fault+279
> [0xc031ed9a] error_code+114
>
> This information can be used to identify pages that are improperly placed.
> As the format of PAGE_OWNER data is now different, the comment at the top
> of Documentation/page_owner.c is updated with new instructions.
>
> As PAGE_OWNER tracks the GFP flags used to allocate the pages,
> /proc/pagetypeinfo is enhanced to contain how many mixed blocks exist.
> The additional output looks like
>
> Number of mixed blocks Unmovable Reclaimable Movable Reserve
> Node 0, zone DMA 0 1 2 1
> Node 0, zone Normal 2 11 33 0
>
> Signed-off-by: Mel Gorman <[email protected]>
> Acked-by: Andy Whitcroft <[email protected]>
> Acked-by: Christoph Lameter <[email protected]>
> DESC
> Allow PAGE_OWNER to be set on any architecture
> EDESC
>
> Currently PAGE_OWNER depends on CONFIG_X86. This appears to be due to
> pfn_to_page() being called in an inappropriate for many memory models
> and the presense of memory holes. This patch ensures that pfn_valid()
> and pfn_valid_within() is called at the appropriate places and the offsets
> correctly updated so that PAGE_OWNER is safe on any architecture.
>
> In situations where CONFIG_HOLES_IN_ZONES is set (IA64 with VIRTUAL_MEM_MAP),
> there may be cases where pages allocated within a MAX_ORDER_NR_PAGES block
> of pages may not be displayed in /proc/page_owner if the hole is at the
> start of the block. Addressing this would be quite complex, perform slowly
> and is of no clear benefit.
>
> Once PAGE_OWNER is allowed on all architectures, the statistics for grouping
> pages by mobility that declare how many pageblocks contain mixed page types
> becomes optionally available on all arches.
>
> This patch was tested successfully on x86, x86_64, ppc64 and IA64 machines.
>
> Signed-off-by: Mel Gorman <[email protected]>
> Acked-by: Andy Whitcroft <[email protected]>
> DESC
> allow-page_owner-to-be-set-on-any-architecture-fix
> EDESC
>
> Cc: Andy Whitcroft <[email protected]>
> Cc: Mel Gorman <[email protected]>
> DESC
> allow-page_owner-to-be-set-on-any-architecture-fix fix
> EDESC
>
> Page-owner-tracking stores the a backtrace of an allocation in the
> struct page. How the stack trace is generated depends on whether
> CONFIG_FRAME_POINTER is set or not. If CONFIG_FRAME_POINTER is set,
> the frame pointer must be read using some inline assembler which is not
> available for all architectures.
>
> This patch uses the frame pointer where it is available but has a fallback
> where it is not.
>
> Signed-off-by: Mel Gorman <[email protected]>
> Cc: Andy Whitcroft <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
>
>
> Rebase on top of procfs changes
>
> Signed-off-by: Mel Gorman <[email protected]>
>
> new file mode 100644
> index 0000000..9081bd6
> Signed-off-by: Andrew Morton <[email protected]>
> ---
>
> Documentation/page_owner.c | 144 +++++++++++++++++++++++++++++++++++
> fs/proc/Makefile | 1
> include/linux/mm_types.h | 6 +
> lib/Kconfig.debug | 10 ++
> mm/page_alloc.c | 66 ++++++++++++++++
> mm/vmstat.c | 93 ++++++++++++++++++++++
> 6 files changed, 320 insertions(+)
>
> diff -puN /dev/null Documentation/page_owner.c
> --- /dev/null
> +++ a/Documentation/page_owner.c
> @@ -0,0 +1,144 @@
> +/*
> + * User-space helper to sort the output of /proc/page_owner
> + *
> + * Example use:
> + * cat /proc/page_owner > page_owner_full.txt
> + * grep -v ^PFN page_owner_full.txt > page_owner.txt
> + * ./sort page_owner.txt sorted_page_owner.txt
> +*/
> +
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <sys/types.h>
> +#include <sys/stat.h>
> +#include <fcntl.h>
> +#include <unistd.h>
> +#include <string.h>
> +
> +struct block_list {
> + char *txt;
> + int len;
> + int num;
> +};
> +
> +
> +static struct block_list *list;
> +static int list_size;
> +static int max_size;
> +
> +struct block_list *block_head;
> +
> +int read_block(char *buf, FILE *fin)
> +{
> + int ret = 0;
> + int hit = 0;
> + char *curr = buf;
> +
> + for (;;) {
> + *curr = getc(fin);
> + if (*curr == EOF) return -1;
> +
> + ret++;
> + if (*curr == '\n' && hit == 1)
> + return ret - 1;
> + else if (*curr == '\n')
> + hit = 1;
> + else
> + hit = 0;
> + curr++;
> + }
> +}
> +
> +static int compare_txt(const void *d1, const void *d2)
> +{
> + struct block_list *l1 = (struct block_list *)d1;
> + struct block_list *l2 = (struct block_list *)d2;
> + return strcmp(l1->txt, l2->txt);
> +}
> +
> +static int compare_num(const void *d1, const void *d2)
> +{
> + struct block_list *l1 = (struct block_list *)d1;
> + struct block_list *l2 = (struct block_list *)d2;
> + return l2->num - l1->num;
> +}
> +
> +static void add_list(char *buf, int len)
> +{
> + if (list_size != 0 &&
> + len == list[list_size-1].len &&
> + memcmp(buf, list[list_size-1].txt, len) == 0) {
> + list[list_size-1].num++;
> + return;
> + }
> + if (list_size == max_size) {
> + printf("max_size too small??\n");
> + exit(1);
> + }
> + list[list_size].txt = malloc(len+1);
> + list[list_size].len = len;
> + list[list_size].num = 1;
> + memcpy(list[list_size].txt, buf, len);
> + list[list_size].txt[len] = 0;
> + list_size++;
> + if (list_size % 1000 == 0) {
> + printf("loaded %d\r", list_size);
> + fflush(stdout);
> + }
> +}
> +
> +int main(int argc, char **argv)
> +{
> + FILE *fin, *fout;
> + char buf[1024];
> + int ret, i, count;
> + struct block_list *list2;
> + struct stat st;
> +
> + fin = fopen(argv[1], "r");
> + fout = fopen(argv[2], "w");
> + if (!fin || !fout) {
> + printf("Usage: ./program <input> <output>\n");
> + perror("open: ");
> + exit(2);
> + }
> +
> + fstat(fileno(fin), &st);
> + max_size = st.st_size / 100; /* hack ... */
> +
> + list = malloc(max_size * sizeof(*list));
> +
> + for(;;) {
> + ret = read_block(buf, fin);
> + if (ret < 0)
> + break;
> +
> + buf[ret] = '\0';
> + add_list(buf, ret);
> + }
> +
> + printf("loaded %d\n", list_size);
> +
> + printf("sorting ....\n");
> +
> + qsort(list, list_size, sizeof(list[0]), compare_txt);
> +
> + list2 = malloc(sizeof(*list) * list_size);
> +
> + printf("culling\n");
> +
> + for (i=count=0;i<list_size;i++) {
> + if (count == 0 ||
> + strcmp(list2[count-1].txt, list[i].txt) != 0)
> + list2[count++] = list[i];
> + else
> + list2[count-1].num += list[i].num;
> + }
> +
> + qsort(list2, count, sizeof(list[0]), compare_num);
> +
> + for (i=0;i<count;i++) {
> + fprintf(fout, "%d times:\n%s\n", list2[i].num, list2[i].txt);
> + }
> + return 0;
> +}
> diff -puN fs/proc/Makefile~page-owner-tracking fs/proc/Makefile
> --- a/fs/proc/Makefile~page-owner-tracking
> +++ a/fs/proc/Makefile
> @@ -24,6 +24,7 @@ proc-$(CONFIG_PROC_SYSCTL) += proc_sysct
> proc-$(CONFIG_NET) += proc_net.o
> proc-$(CONFIG_PROC_KCORE) += kcore.o
> proc-$(CONFIG_PROC_VMCORE) += vmcore.o
> +proc-$(CONFIG_PROC_PAGEOWNER) += pageowner.o
> proc-$(CONFIG_PROC_DEVICETREE) += proc_devtree.o
> proc-$(CONFIG_PRINTK) += kmsg.o
> proc-$(CONFIG_PROC_PAGE_MONITOR) += page.o
> diff -puN include/linux/mm_types.h~page-owner-tracking include/linux/mm_types.h
> --- a/include/linux/mm_types.h~page-owner-tracking
> +++ a/include/linux/mm_types.h
> @@ -95,6 +95,12 @@ struct page {
> void *virtual; /* Kernel virtual address (NULL if
> not kmapped, ie. highmem) */
> #endif /* WANT_PAGE_VIRTUAL */
> +
> +#ifdef CONFIG_PROC_PAGEOWNER
> + int order;
> + unsigned int gfp_mask;
> + unsigned long trace[8];
> +#endif
> };
>
> /*
> diff -puN lib/Kconfig.debug~page-owner-tracking lib/Kconfig.debug
> --- a/lib/Kconfig.debug~page-owner-tracking
> +++ a/lib/Kconfig.debug
> @@ -66,6 +66,16 @@ config UNUSED_SYMBOLS
> you really need it, and what the merge plan to the mainline kernel for
> your module is.
>
> +config PROC_PAGEOWNER
> + bool "Track page owner"
> + depends on DEBUG_KERNEL
> + help
> + This keeps track of what call chain is the owner of a page, may
> + help to find bare alloc_page(s) leaks. Eats a fair amount of memory.
> + See Documentation/page_owner.c for user-space helper.
> +
> + If unsure, say N.
> +
> config DEBUG_FS
> bool "Debug Filesystem"
> depends on SYSFS
> diff -puN mm/page_alloc.c~page-owner-tracking mm/page_alloc.c
> --- a/mm/page_alloc.c~page-owner-tracking
> +++ a/mm/page_alloc.c
> @@ -359,6 +359,9 @@ static inline void set_page_order(struct
> {
> set_page_private(page, order);
> __SetPageBuddy(page);
> +#ifdef CONFIG_PROC_PAGEOWNER
> + page->order = -1;
> +#endif
> }
>
> static inline void rmv_page_order(struct page *page)
> @@ -1458,6 +1461,62 @@ try_next_zone:
> return page;
> }
>
> +#ifdef CONFIG_PROC_PAGEOWNER
> +static inline int valid_stack_ptr(struct thread_info *tinfo, void *p)
> +{
> + return p > (void *)tinfo &&
> + p < (void *)tinfo + THREAD_SIZE - 3;
> +}
> +
> +static inline void __stack_trace(struct page *page, unsigned long *stack,
> + unsigned long bp)
> +{
> + int i = 0;
> + unsigned long addr;
> + struct thread_info *tinfo = (struct thread_info *)
> + ((unsigned long)stack & (~(THREAD_SIZE - 1)));
> +
> + memset(page->trace, 0, sizeof(long) * 8);
> +
> +#ifdef CONFIG_FRAME_POINTER
> + if (bp) {
> + while (valid_stack_ptr(tinfo, (void *)bp)) {
> + addr = *(unsigned long *)(bp + sizeof(long));
> + page->trace[i] = addr;
> + if (++i >= 8)
> + break;
> + bp = *(unsigned long *)bp;
> + }
> + return;
> + }
> +#endif /* CONFIG_FRAME_POINTER */
> + while (valid_stack_ptr(tinfo, stack)) {
> + addr = *stack++;
> + if (__kernel_text_address(addr)) {
> + page->trace[i] = addr;
> + if (++i >= 8)
> + break;
> + }
> + }
> +}
Uhm, this is not acceptable and broken, we have generic stacktrace
saving facilities for this precise purpose. It has other problems
too.
The purpose of the patch seems genuinely useful and i support the
concept, but the current design is limiting (we could do much
better) and the implementation is awful. Please.
Has this patch been reviewed by or Cc:-ed to anyone versed in kernel
instrumentation details, before it was applied to -mm? Those folks
hang around the tracing tree usually so they are easy to find. :)
Ingo
On Wed, Apr 01, 2009 at 01:15:40PM +0200, Ingo Molnar wrote:
> > <VAST AMOUNTS OF SNIPPAGE>
> >
> > +static inline void __stack_trace(struct page *page, unsigned long *stack,
> > + unsigned long bp)
> > +{
> > + int i = 0;
> > + unsigned long addr;
> > + struct thread_info *tinfo = (struct thread_info *)
> > + ((unsigned long)stack & (~(THREAD_SIZE - 1)));
> > +
> > + memset(page->trace, 0, sizeof(long) * 8);
> > +
> > +#ifdef CONFIG_FRAME_POINTER
> > + if (bp) {
> > + while (valid_stack_ptr(tinfo, (void *)bp)) {
> > + addr = *(unsigned long *)(bp + sizeof(long));
> > + page->trace[i] = addr;
> > + if (++i >= 8)
> > + break;
> > + bp = *(unsigned long *)bp;
> > + }
> > + return;
> > + }
> > +#endif /* CONFIG_FRAME_POINTER */
> > + while (valid_stack_ptr(tinfo, stack)) {
> > + addr = *stack++;
> > + if (__kernel_text_address(addr)) {
> > + page->trace[i] = addr;
> > + if (++i >= 8)
> > + break;
> > + }
> > + }
> > +}
>
> Uhm, this is not acceptable and broken, we have generic stacktrace
> saving facilities for this precise purpose. It has other problems
> too.
>
> The purpose of the patch seems genuinely useful and i support the
> concept, but the current design is limiting (we could do much
> better) and the implementation is awful. Please.
>
> Has this patch been reviewed by or Cc:-ed to anyone versed in kernel
> instrumentation details, before it was applied to -mm? Those folks
> hang around the tracing tree usually so they are easy to find. :)
>
This patch is ancient, predating most of the instrumentation stuff by years. It
was dropped from -mm a few months ago because of changes in proc and this is
a rebase because it came up as being potentially useful pinning down memory
leaks when they occur.
I'm not sure when exactly it was introduced to -mm, but I see references
going back as far as 2.6.12-rc1 so it's no surprise this is now extremly
odd looking. However, there is no plan to merge this to mainline making the
effort of redoing it from scratch a questionable expenditure of time.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
* Mel Gorman <[email protected]> wrote:
> On Wed, Apr 01, 2009 at 01:15:40PM +0200, Ingo Molnar wrote:
> > > <VAST AMOUNTS OF SNIPPAGE>
> > >
> > > +static inline void __stack_trace(struct page *page, unsigned long *stack,
> > > + unsigned long bp)
> > > +{
> > > + int i = 0;
> > > + unsigned long addr;
> > > + struct thread_info *tinfo = (struct thread_info *)
> > > + ((unsigned long)stack & (~(THREAD_SIZE - 1)));
> > > +
> > > + memset(page->trace, 0, sizeof(long) * 8);
> > > +
> > > +#ifdef CONFIG_FRAME_POINTER
> > > + if (bp) {
> > > + while (valid_stack_ptr(tinfo, (void *)bp)) {
> > > + addr = *(unsigned long *)(bp + sizeof(long));
> > > + page->trace[i] = addr;
> > > + if (++i >= 8)
> > > + break;
> > > + bp = *(unsigned long *)bp;
> > > + }
> > > + return;
> > > + }
> > > +#endif /* CONFIG_FRAME_POINTER */
> > > + while (valid_stack_ptr(tinfo, stack)) {
> > > + addr = *stack++;
> > > + if (__kernel_text_address(addr)) {
> > > + page->trace[i] = addr;
> > > + if (++i >= 8)
> > > + break;
> > > + }
> > > + }
> > > +}
> >
> > Uhm, this is not acceptable and broken, we have generic stacktrace
> > saving facilities for this precise purpose. It has other problems
> > too.
> >
> > The purpose of the patch seems genuinely useful and i support the
> > concept, but the current design is limiting (we could do much
> > better) and the implementation is awful. Please.
> >
> > Has this patch been reviewed by or Cc:-ed to anyone versed in kernel
> > instrumentation details, before it was applied to -mm? Those folks
> > hang around the tracing tree usually so they are easy to find. :)
> >
>
> This patch is ancient, predating most of the instrumentation stuff
> by years. It was dropped from -mm a few months ago because of
> changes in proc and this is a rebase because it came up as being
> potentially useful pinning down memory leaks when they occur.
>
> I'm not sure when exactly it was introduced to -mm, but I see
> references going back as far as 2.6.12-rc1 so it's no surprise
> this is now extremly odd looking. However, there is no plan to
> merge this to mainline making the effort of redoing it from
> scratch a questionable expenditure of time.
Hm, why not merge the concept upstream?
I'm sure the kmemtrace maintainers (Eduardo and Pekka) would be
interested in this too: they are already tracking slab allocation
events in a very finegrained way, extending that scheme to the page
allocator seems genuinely useful to me.
And that way the rather ugly bloating of struct page by this patch
would be solved too: it's not needed and there's no need to actually
save the callchain permanently, it's usually enough to _trace_ it -
user-space can then save it into a permanent map if it's interested.
So this feature could go into Linux distros, available by default
and integrated into the user-space side of kmemtrace.
( You could trace the gfp-type like kmemtrace does already - that
way user-space can do GFP_MOVABLE / GFP_RECLAIMABLE summary counts
(and more). )
So hooking up this info into tracepoints and ftrace seems genuinely
useful to me. You could trace the whole system, or you could use
filter expressions to do things like:
echo "comm == Xorg" > /debug/tracing/events/mm/page_alloc/filter
To only track the memory allocations of Xorg task(s) for example.
Later on a tracepoint based sw event could be made out of it too,
hooked up to sw-perfcounters, to dynamically profile page usage by
owning function(s) and task(s).
Later on we could add more info to the tracepoints - for example the
d_path() of the vma that is attached to the page could be traced (if
it's not an anonymous vma and not a kernel-internal allocation).
etc.
Ingo
On Wed, Apr 01, 2009 at 03:17:13PM +0200, Ingo Molnar wrote:
>
> * Mel Gorman <[email protected]> wrote:
>
> > On Wed, Apr 01, 2009 at 01:15:40PM +0200, Ingo Molnar wrote:
> > > > <VAST AMOUNTS OF SNIPPAGE>
> > > >
> > > > +static inline void __stack_trace(struct page *page, unsigned long *stack,
> > > > + unsigned long bp)
> > > > +{
> > > > + int i = 0;
> > > > + unsigned long addr;
> > > > + struct thread_info *tinfo = (struct thread_info *)
> > > > + ((unsigned long)stack & (~(THREAD_SIZE - 1)));
> > > > +
> > > > + memset(page->trace, 0, sizeof(long) * 8);
> > > > +
> > > > +#ifdef CONFIG_FRAME_POINTER
> > > > + if (bp) {
> > > > + while (valid_stack_ptr(tinfo, (void *)bp)) {
> > > > + addr = *(unsigned long *)(bp + sizeof(long));
> > > > + page->trace[i] = addr;
> > > > + if (++i >= 8)
> > > > + break;
> > > > + bp = *(unsigned long *)bp;
> > > > + }
> > > > + return;
> > > > + }
> > > > +#endif /* CONFIG_FRAME_POINTER */
> > > > + while (valid_stack_ptr(tinfo, stack)) {
> > > > + addr = *stack++;
> > > > + if (__kernel_text_address(addr)) {
> > > > + page->trace[i] = addr;
> > > > + if (++i >= 8)
> > > > + break;
> > > > + }
> > > > + }
> > > > +}
> > >
> > > Uhm, this is not acceptable and broken, we have generic stacktrace
> > > saving facilities for this precise purpose. It has other problems
> > > too.
> > >
> > > The purpose of the patch seems genuinely useful and i support the
> > > concept, but the current design is limiting (we could do much
> > > better) and the implementation is awful. Please.
> > >
> > > Has this patch been reviewed by or Cc:-ed to anyone versed in kernel
> > > instrumentation details, before it was applied to -mm? Those folks
> > > hang around the tracing tree usually so they are easy to find. :)
> > >
> >
> > This patch is ancient, predating most of the instrumentation stuff
> > by years. It was dropped from -mm a few months ago because of
> > changes in proc and this is a rebase because it came up as being
> > potentially useful pinning down memory leaks when they occur.
> >
> > I'm not sure when exactly it was introduced to -mm, but I see
> > references going back as far as 2.6.12-rc1 so it's no surprise
> > this is now extremly odd looking. However, there is no plan to
> > merge this to mainline making the effort of redoing it from
> > scratch a questionable expenditure of time.
>
> Hm, why not merge the concept upstream?
>
I suspect at the time the patch was put together, it was not merged upstream
because it severely bloated struct page and would be something that was never
enabled by default in distros. Anyone wanted it for debugging could easily
apply the patch. It's a decision that simply has never been revisited as
it's not used that often - it's just seriously useful when you do need it.
I'm guessing though, I wasn't active in kernel development at the time
and I haven't dug through the archives to see the history.
> I'm sure the kmemtrace maintainers (Eduardo and Pekka) would be
> interested in this too: they are already tracking slab allocation
> events in a very finegrained way, extending that scheme to the page
> allocator seems genuinely useful to me.
>
In light of kmemtrace's success, it does make sense to revisit if someone
has the cycles to spare. Again bear in mind that this owner patch was in
-mm long before kmemtrace came along so it was a good idea at the time whose
implementation has not quite stood the test of time.
> And that way the rather ugly bloating of struct page by this patch
> would be solved too: it's not needed and there's no need to actually
> save the callchain permanently, it's usually enough to _trace_ it -
> user-space can then save it into a permanent map if it's interested.
>
It's picky but you lose details from boot-time allocations that way but
that's a relatively small detail. If it's leaks you really care about though,
then tracing is probably sufficient.
> So this feature could go into Linux distros, available by default
> and integrated into the user-space side of kmemtrace.
>
> ( You could trace the gfp-type like kmemtrace does already - that
> way user-space can do GFP_MOVABLE / GFP_RECLAIMABLE summary counts
> (and more). )
>
I concede that there is overlap but GFP_MOVABLE and GFP_RECLAIMABLE tracking
would be useful I have to say. In the past, I modified this patch to track
that information so I could figure out what was going on in the system. It'd
be nice to do similar in the field.
> So hooking up this info into tracepoints and ftrace seems genuinely
> useful to me. You could trace the whole system, or you could use
> filter expressions to do things like:
>
> echo "comm == Xorg" > /debug/tracing/events/mm/page_alloc/filter
>
> To only track the memory allocations of Xorg task(s) for example.
>
> Later on a tracepoint based sw event could be made out of it too,
> hooked up to sw-perfcounters, to dynamically profile page usage by
> owning function(s) and task(s).
>
> Later on we could add more info to the tracepoints - for example the
> d_path() of the vma that is attached to the page could be traced (if
> it's not an anonymous vma and not a kernel-internal allocation).
>
> etc.
>
Ok, I fully concede that this would all be great to have and I've taken
note to investigate it at some time in the future but it won't be any
time soon I'm afraid. Maybe this thread will prod someone familiar with
tracing with some time to spare to take a look at what that provides
reimplement in a sensible manner as you suggest?
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
* Mel Gorman <[email protected]> wrote:
> On Wed, Apr 01, 2009 at 03:17:13PM +0200, Ingo Molnar wrote:
> >
> > * Mel Gorman <[email protected]> wrote:
> >
> > > On Wed, Apr 01, 2009 at 01:15:40PM +0200, Ingo Molnar wrote:
> > > > > <VAST AMOUNTS OF SNIPPAGE>
> > > > >
> > > > > +static inline void __stack_trace(struct page *page, unsigned long *stack,
> > > > > + unsigned long bp)
> > > > > +{
> > > > > + int i = 0;
> > > > > + unsigned long addr;
> > > > > + struct thread_info *tinfo = (struct thread_info *)
> > > > > + ((unsigned long)stack & (~(THREAD_SIZE - 1)));
> > > > > +
> > > > > + memset(page->trace, 0, sizeof(long) * 8);
> > > > > +
> > > > > +#ifdef CONFIG_FRAME_POINTER
> > > > > + if (bp) {
> > > > > + while (valid_stack_ptr(tinfo, (void *)bp)) {
> > > > > + addr = *(unsigned long *)(bp + sizeof(long));
> > > > > + page->trace[i] = addr;
> > > > > + if (++i >= 8)
> > > > > + break;
> > > > > + bp = *(unsigned long *)bp;
> > > > > + }
> > > > > + return;
> > > > > + }
> > > > > +#endif /* CONFIG_FRAME_POINTER */
> > > > > + while (valid_stack_ptr(tinfo, stack)) {
> > > > > + addr = *stack++;
> > > > > + if (__kernel_text_address(addr)) {
> > > > > + page->trace[i] = addr;
> > > > > + if (++i >= 8)
> > > > > + break;
> > > > > + }
> > > > > + }
> > > > > +}
> > > >
> > > > Uhm, this is not acceptable and broken, we have generic stacktrace
> > > > saving facilities for this precise purpose. It has other problems
> > > > too.
> > > >
> > > > The purpose of the patch seems genuinely useful and i support the
> > > > concept, but the current design is limiting (we could do much
> > > > better) and the implementation is awful. Please.
> > > >
> > > > Has this patch been reviewed by or Cc:-ed to anyone versed in kernel
> > > > instrumentation details, before it was applied to -mm? Those folks
> > > > hang around the tracing tree usually so they are easy to find. :)
> > > >
> > >
> > > This patch is ancient, predating most of the instrumentation stuff
> > > by years. It was dropped from -mm a few months ago because of
> > > changes in proc and this is a rebase because it came up as being
> > > potentially useful pinning down memory leaks when they occur.
> > >
> > > I'm not sure when exactly it was introduced to -mm, but I see
> > > references going back as far as 2.6.12-rc1 so it's no surprise
> > > this is now extremly odd looking. However, there is no plan to
> > > merge this to mainline making the effort of redoing it from
> > > scratch a questionable expenditure of time.
> >
> > Hm, why not merge the concept upstream?
> >
>
> I suspect at the time the patch was put together, it was not
> merged upstream because it severely bloated struct page and would
> be something that was never enabled by default in distros. Anyone
> wanted it for debugging could easily apply the patch. It's a
> decision that simply has never been revisited as it's not used
> that often - it's just seriously useful when you do need it.
>
> I'm guessing though, I wasn't active in kernel development at the
> time and I haven't dug through the archives to see the history.
it sounds plausible. The reason i replied is that i saw this patch
pop up freshly with a lot of MM signoffs. There's been a few weird
instrumentation patches from -mm to mainline lately so i'm more
cautious ...
A [not-for-upstream] tag would be useful for such cases.
> > I'm sure the kmemtrace maintainers (Eduardo and Pekka) would be
> > interested in this too: they are already tracking slab
> > allocation events in a very finegrained way, extending that
> > scheme to the page allocator seems genuinely useful to me.
>
> In light of kmemtrace's success, it does make sense to revisit if
> someone has the cycles to spare. Again bear in mind that this
> owner patch was in -mm long before kmemtrace came along so it was
> a good idea at the time whose implementation has not quite stood
> the test of time.
it popped up new. And i commented on it a few months ago.
> > And that way the rather ugly bloating of struct page by this
> > patch would be solved too: it's not needed and there's no need
> > to actually save the callchain permanently, it's usually enough
> > to _trace_ it - user-space can then save it into a permanent map
> > if it's interested.
>
> It's picky but you lose details from boot-time allocations that
> way but that's a relatively small detail. If it's leaks you really
> care about though, then tracing is probably sufficient.
You can build a full map from scratch as well via:
echo 3 > /proc/sys/vm/drop_caches
Anything that is not flushed out by that is probably not that
interesting from a tracking perspective.
> [...] Maybe this thread will prod someone familiar with tracing
> with some time to spare to take a look at what that provides
> reimplement in a sensible manner as you suggest?
Yeah - i've added tracing Cc:s.
There's a proposed set of MM tracepoints by Jason Baron. Not sure
how far it got in terms of getting Ack's and Reviewed-by's from MM
folks though.
And this info could be added to that, and it would sure be nice to
hook it up to kmemtrace primarily, which does a lot of similar
looking work in the slab space. (but Eduard and Pekka will know how
feasible/interesting this is to them.)
Ingo
On Wed, 2009-04-01 at 15:49 +0200, Ingo Molnar wrote:
> And this info could be added to that, and it would sure be nice to
> hook it up to kmemtrace primarily, which does a lot of similar
> looking work in the slab space. (but Eduard and Pekka will know how
> feasible/interesting this is to them.)
Yup, makes sense to me. Something like this is probably a good starting
point for a proper patch.
Pekka
diff --git a/include/trace/kmemtrace.h b/include/trace/kmemtrace.h
index 28ee69f..c71f28b 100644
--- a/include/trace/kmemtrace.h
+++ b/include/trace/kmemtrace.h
@@ -20,6 +20,10 @@ static inline void kmemtrace_init(void)
}
#endif
+DECLARE_TRACE(alloc_pages,
+ TP_PROTO(gfp_t gfp_mask, int order),
+ TP_ARGS(gfp_mask, order));
+
DECLARE_TRACE(kmalloc,
TP_PROTO(unsigned long call_site,
const void *ptr,
diff --git a/kernel/trace/kmemtrace.c b/kernel/trace/kmemtrace.c
index 5011f4d..bd5bba5 100644
--- a/kernel/trace/kmemtrace.c
+++ b/kernel/trace/kmemtrace.c
@@ -34,6 +34,28 @@ static struct tracer_flags kmem_tracer_flags = {
static struct trace_array *kmemtrace_array;
/* Trace allocations */
+static void kmemtrace_alloc_pages(gfp_t gfp_mask, int order)
+{
+ struct trace_array *tr = kmemtrace_array;
+ struct kmemtrace_page_alloc_entry *entry;
+ struct ring_buffer_event *event;
+
+ event = ring_buffer_lock_reserve(tr->buffer, sizeof(*entry));
+ if (!event)
+ return;
+ entry = ring_buffer_event_data(event);
+ tracing_generic_entry_update(&entry->ent, 0, 0);
+
+ entry->ent.type = TRACE_KMEM_PAGE_ALLOC;
+ entry->type_id = KMEMTRACE_TYPE_PAGES,
+ entry->gfp_mask = gfp_mask;
+ entry->order = order;
+
+ ring_buffer_unlock_commit(tr->buffer, event);
+
+ trace_wake_up();
+}
+
static inline void kmemtrace_alloc(enum kmemtrace_type_id type_id,
unsigned long call_site,
const void *ptr,
@@ -147,6 +169,10 @@ static int kmemtrace_start_probes(void)
{
int err;
+ err = register_trace_alloc_pages(kmemtrace_alloc_pages);
+ if (err)
+ return err;
+
err = register_trace_kmalloc(kmemtrace_kmalloc);
if (err)
return err;
@@ -214,8 +240,9 @@ static void kmemtrace_headers(struct seq_file *s)
* plus the origin CPU, since reordering occurs in-kernel now.
*/
-#define KMEMTRACE_USER_ALLOC 0
-#define KMEMTRACE_USER_FREE 1
+#define KMEMTRACE_USER_ALLOC 0
+#define KMEMTRACE_USER_FREE 1
+#define KMEMTRACE_USER_PAGE_ALLOC 2
struct kmemtrace_user_event {
u8 event_id;
@@ -227,6 +254,11 @@ struct kmemtrace_user_event {
unsigned long ptr;
};
+struct kmemtrace_user_event_page_alloc {
+ unsigned gfp_mask;
+ int order;
+};
+
struct kmemtrace_user_event_alloc {
size_t bytes_req;
size_t bytes_alloc;
@@ -235,6 +267,36 @@ struct kmemtrace_user_event_alloc {
};
static enum print_line_t
+kmemtrace_print_page_alloc_user(struct trace_iterator *iter,
+ struct kmemtrace_page_alloc_entry *entry)
+{
+ struct kmemtrace_user_event_page_alloc *ev_alloc;
+ struct trace_seq *s = &iter->seq;
+ struct kmemtrace_user_event *ev;
+
+ ev = trace_seq_reserve(s, sizeof(*ev));
+ if (!ev)
+ return TRACE_TYPE_PARTIAL_LINE;
+
+ ev->event_id = KMEMTRACE_USER_PAGE_ALLOC;
+ ev->type_id = entry->type_id;
+ ev->event_size = sizeof(*ev) + sizeof(*ev_alloc);
+ ev->cpu = iter->cpu;
+ ev->timestamp = iter->ts;
+ ev->call_site = 0ULL; /* FIXME */
+ ev->ptr = 0ULL; /* FIXME */
+
+ ev_alloc = trace_seq_reserve(s, sizeof(*ev_alloc));
+ if (!ev_alloc)
+ return TRACE_TYPE_PARTIAL_LINE;
+
+ ev_alloc->gfp_mask = entry->gfp_mask;
+ ev_alloc->order = entry->order;
+
+ return TRACE_TYPE_HANDLED;
+}
+
+static enum print_line_t
kmemtrace_print_alloc_user(struct trace_iterator *iter,
struct kmemtrace_alloc_entry *entry)
{
@@ -288,7 +350,49 @@ kmemtrace_print_free_user(struct trace_iterator *iter,
return TRACE_TYPE_HANDLED;
}
-/* The two other following provide a more minimalistic output */
+/* The three other following provide a more minimalistic output */
+static enum print_line_t
+kmemtrace_print_page_alloc_compress(struct trace_iterator *iter,
+ struct kmemtrace_page_alloc_entry *entry)
+{
+ struct trace_seq *s = &iter->seq;
+ int ret;
+
+ /* Alloc entry */
+ ret = trace_seq_printf(s, " + ");
+ if (!ret)
+ return TRACE_TYPE_PARTIAL_LINE;
+
+ /* Type */
+ switch (entry->type_id) {
+ case KMEMTRACE_TYPE_PAGES:
+ ret = trace_seq_printf(s, "P ");
+ break;
+ default:
+ ret = trace_seq_printf(s, "? ");
+ }
+
+ if (!ret)
+ return TRACE_TYPE_PARTIAL_LINE;
+
+ /* Flags
+ * TODO: would be better to see the name of the GFP flag names
+ */
+ ret = trace_seq_printf(s, "%08x ", entry->gfp_mask);
+ if (!ret)
+ return TRACE_TYPE_PARTIAL_LINE;
+
+ /* Node */
+ ret = trace_seq_printf(s, "%4d ", entry->order);
+ if (!ret)
+ return TRACE_TYPE_PARTIAL_LINE;
+
+ if (!trace_seq_printf(s, "\n"))
+ return TRACE_TYPE_PARTIAL_LINE;
+
+ return TRACE_TYPE_HANDLED;
+}
+
static enum print_line_t
kmemtrace_print_alloc_compress(struct trace_iterator *iter,
struct kmemtrace_alloc_entry *entry)
@@ -418,6 +522,16 @@ static enum print_line_t kmemtrace_print_line(struct trace_iterator *iter)
struct trace_entry *entry = iter->ent;
switch (entry->type) {
+ case TRACE_KMEM_PAGE_ALLOC: {
+ struct kmemtrace_page_alloc_entry *field;
+
+ trace_assign_type(field, entry);
+ if (kmem_tracer_flags.val & TRACE_KMEM_OPT_MINIMAL)
+ return kmemtrace_print_page_alloc_compress(iter, field);
+ else
+ return kmemtrace_print_page_alloc_user(iter, field);
+ }
+
case TRACE_KMEM_ALLOC: {
struct kmemtrace_alloc_entry *field;
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index cbc168f..3b1bfa5 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -35,6 +35,7 @@ enum trace_type {
TRACE_SYSCALL_EXIT,
TRACE_KMEM_ALLOC,
TRACE_KMEM_FREE,
+ TRACE_KMEM_PAGE_ALLOC,
TRACE_POWER,
TRACE_BLK,
@@ -188,6 +189,13 @@ enum kmemtrace_type_id {
KMEMTRACE_TYPE_PAGES, /* __get_free_pages() and friends. */
};
+struct kmemtrace_page_alloc_entry {
+ struct trace_entry ent;
+ enum kmemtrace_type_id type_id;
+ gfp_t gfp_mask;
+ int order;
+};
+
struct kmemtrace_alloc_entry {
struct trace_entry ent;
enum kmemtrace_type_id type_id;
@@ -328,6 +336,8 @@ extern void __ftrace_bad_type(void);
TRACE_GRAPH_RET); \
IF_ASSIGN(var, ent, struct hw_branch_entry, TRACE_HW_BRANCHES);\
IF_ASSIGN(var, ent, struct trace_power, TRACE_POWER); \
+ IF_ASSIGN(var, ent, struct kmemtrace_page_alloc_entry, \
+ TRACE_KMEM_PAGE_ALLOC); \
IF_ASSIGN(var, ent, struct kmemtrace_alloc_entry, \
TRACE_KMEM_ALLOC); \
IF_ASSIGN(var, ent, struct kmemtrace_free_entry, \
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a3803ea..f939800 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1479,6 +1479,8 @@ __alloc_pages_internal(gfp_t gfp_mask, unsigned int order,
unsigned long did_some_progress;
unsigned long pages_reclaimed = 0;
+ trace_alloc_pages(gfp_mask, order);
+
lockdep_trace_alloc(gfp_mask);
might_sleep_if(wait);
@@ -1676,6 +1678,8 @@ got_pg:
return page;
}
EXPORT_SYMBOL(__alloc_pages_internal);
+DEFINE_TRACE(alloc_pages);
+EXPORT_TRACEPOINT_SYMBOL(alloc_pages);
/*
* Common helper functions.
* Pekka Enberg <[email protected]> wrote:
> On Wed, 2009-04-01 at 15:49 +0200, Ingo Molnar wrote:
> > And this info could be added to that, and it would sure be nice to
> > hook it up to kmemtrace primarily, which does a lot of similar
> > looking work in the slab space. (but Eduard and Pekka will know how
> > feasible/interesting this is to them.)
>
> Yup, makes sense to me. Something like this is probably a good
> starting point for a proper patch.
looks like an excellent starting point.
> +kmemtrace_print_page_alloc_user(struct trace_iterator *iter,
> + struct kmemtrace_page_alloc_entry *entry)
> +{
> + struct kmemtrace_user_event_page_alloc *ev_alloc;
> + struct trace_seq *s = &iter->seq;
> + struct kmemtrace_user_event *ev;
> +
> + ev = trace_seq_reserve(s, sizeof(*ev));
> + if (!ev)
> + return TRACE_TYPE_PARTIAL_LINE;
> +
> + ev->event_id = KMEMTRACE_USER_PAGE_ALLOC;
> + ev->type_id = entry->type_id;
> + ev->event_size = sizeof(*ev) + sizeof(*ev_alloc);
> + ev->cpu = iter->cpu;
> + ev->timestamp = iter->ts;
> + ev->call_site = 0ULL; /* FIXME */
> + ev->ptr = 0ULL; /* FIXME */
Here we could call save_stack_trace(), in a way like this, to save
up to 8 entries of the allocation back-trace:
#define NR_ENTRIES 8
struct kmemtrace_user_event {
...
unsigned long entries[NR_ENTRIES];
...
};
struct stack_trace trace;
trace.nr_entries = 0;
trace.max_entries = NR_ENTRIES;
trace.entries = ev->entries;
trace.skip = 2;
save_stack_trace(&trace);
( the '2' for skip will skip the useless tracer-internal backtrace
bits. )
ftrace has built-in stacktrace capabilities as well - but they are
not hooked up to the binary-tracing pathway yet - right Steve?
Ingo
On Wed, 2009-04-01 at 17:22 +0200, Ingo Molnar wrote:
> > +kmemtrace_print_page_alloc_user(struct trace_iterator *iter,
> > + struct kmemtrace_page_alloc_entry *entry)
> > +{
> > + struct kmemtrace_user_event_page_alloc *ev_alloc;
> > + struct trace_seq *s = &iter->seq;
> > + struct kmemtrace_user_event *ev;
> > +
> > + ev = trace_seq_reserve(s, sizeof(*ev));
> > + if (!ev)
> > + return TRACE_TYPE_PARTIAL_LINE;
> > +
> > + ev->event_id = KMEMTRACE_USER_PAGE_ALLOC;
> > + ev->type_id = entry->type_id;
> > + ev->event_size = sizeof(*ev) + sizeof(*ev_alloc);
> > + ev->cpu = iter->cpu;
> > + ev->timestamp = iter->ts;
> > + ev->call_site = 0ULL; /* FIXME */
> > + ev->ptr = 0ULL; /* FIXME */
>
> Here we could call save_stack_trace(), in a way like this, to save
> up to 8 entries of the allocation back-trace:
The example code compiled as-is so here's an updated patch! :-) We
should probably do stack traces for _all_ events like you suggested but
that's a bigger ABI change so I did it only for the new page allocation
event.
Pekka
diff --git a/include/trace/kmemtrace.h b/include/trace/kmemtrace.h
index 28ee69f..c71f28b 100644
--- a/include/trace/kmemtrace.h
+++ b/include/trace/kmemtrace.h
@@ -20,6 +20,10 @@ static inline void kmemtrace_init(void)
}
#endif
+DECLARE_TRACE(alloc_pages,
+ TP_PROTO(gfp_t gfp_mask, int order),
+ TP_ARGS(gfp_mask, order));
+
DECLARE_TRACE(kmalloc,
TP_PROTO(unsigned long call_site,
const void *ptr,
diff --git a/kernel/trace/kmemtrace.c b/kernel/trace/kmemtrace.c
index 5011f4d..527198b 100644
--- a/kernel/trace/kmemtrace.c
+++ b/kernel/trace/kmemtrace.c
@@ -6,6 +6,7 @@
* Copyright (C) 2008 Frederic Weisbecker <[email protected]>
*/
+#include <linux/stacktrace.h>
#include <linux/tracepoint.h>
#include <linux/seq_file.h>
#include <linux/debugfs.h>
@@ -34,6 +35,28 @@ static struct tracer_flags kmem_tracer_flags = {
static struct trace_array *kmemtrace_array;
/* Trace allocations */
+static void kmemtrace_alloc_pages(gfp_t gfp_mask, int order)
+{
+ struct trace_array *tr = kmemtrace_array;
+ struct kmemtrace_page_alloc_entry *entry;
+ struct ring_buffer_event *event;
+
+ event = ring_buffer_lock_reserve(tr->buffer, sizeof(*entry));
+ if (!event)
+ return;
+ entry = ring_buffer_event_data(event);
+ tracing_generic_entry_update(&entry->ent, 0, 0);
+
+ entry->ent.type = TRACE_KMEM_PAGE_ALLOC;
+ entry->type_id = KMEMTRACE_TYPE_PAGES,
+ entry->gfp_mask = gfp_mask;
+ entry->order = order;
+
+ ring_buffer_unlock_commit(tr->buffer, event);
+
+ trace_wake_up();
+}
+
static inline void kmemtrace_alloc(enum kmemtrace_type_id type_id,
unsigned long call_site,
const void *ptr,
@@ -147,6 +170,10 @@ static int kmemtrace_start_probes(void)
{
int err;
+ err = register_trace_alloc_pages(kmemtrace_alloc_pages);
+ if (err)
+ return err;
+
err = register_trace_kmalloc(kmemtrace_kmalloc);
if (err)
return err;
@@ -214,8 +241,9 @@ static void kmemtrace_headers(struct seq_file *s)
* plus the origin CPU, since reordering occurs in-kernel now.
*/
-#define KMEMTRACE_USER_ALLOC 0
-#define KMEMTRACE_USER_FREE 1
+#define KMEMTRACE_USER_ALLOC 0
+#define KMEMTRACE_USER_FREE 1
+#define KMEMTRACE_USER_PAGE_ALLOC 2
struct kmemtrace_user_event {
u8 event_id;
@@ -227,6 +255,14 @@ struct kmemtrace_user_event {
unsigned long ptr;
};
+#define NR_ENTRIES 8
+
+struct kmemtrace_user_event_page_alloc {
+ unsigned gfp_mask;
+ int order;
+ unsigned long entries[NR_ENTRIES];
+};
+
struct kmemtrace_user_event_alloc {
size_t bytes_req;
size_t bytes_alloc;
@@ -235,6 +271,44 @@ struct kmemtrace_user_event_alloc {
};
static enum print_line_t
+kmemtrace_print_page_alloc_user(struct trace_iterator *iter,
+ struct kmemtrace_page_alloc_entry *entry)
+{
+ struct kmemtrace_user_event_page_alloc *ev_alloc;
+ struct trace_seq *s = &iter->seq;
+ struct kmemtrace_user_event *ev;
+ struct stack_trace trace;
+
+ ev = trace_seq_reserve(s, sizeof(*ev));
+ if (!ev)
+ return TRACE_TYPE_PARTIAL_LINE;
+
+ ev->event_id = KMEMTRACE_USER_PAGE_ALLOC;
+ ev->type_id = entry->type_id;
+ ev->event_size = sizeof(*ev) + sizeof(*ev_alloc);
+ ev->cpu = iter->cpu;
+ ev->timestamp = iter->ts;
+ ev->call_site = 0ULL; /* FIXME */
+ ev->ptr = 0ULL; /* FIXME */
+
+ ev_alloc = trace_seq_reserve(s, sizeof(*ev_alloc));
+ if (!ev_alloc)
+ return TRACE_TYPE_PARTIAL_LINE;
+
+ ev_alloc->gfp_mask = entry->gfp_mask;
+ ev_alloc->order = entry->order;
+
+ trace.nr_entries = 0;
+ trace.max_entries = NR_ENTRIES;
+ trace.entries = ev_alloc->entries;
+ trace.skip = 2;
+
+ save_stack_trace(&trace);
+
+ return TRACE_TYPE_HANDLED;
+}
+
+static enum print_line_t
kmemtrace_print_alloc_user(struct trace_iterator *iter,
struct kmemtrace_alloc_entry *entry)
{
@@ -288,7 +362,49 @@ kmemtrace_print_free_user(struct trace_iterator *iter,
return TRACE_TYPE_HANDLED;
}
-/* The two other following provide a more minimalistic output */
+/* The three other following provide a more minimalistic output */
+static enum print_line_t
+kmemtrace_print_page_alloc_compress(struct trace_iterator *iter,
+ struct kmemtrace_page_alloc_entry *entry)
+{
+ struct trace_seq *s = &iter->seq;
+ int ret;
+
+ /* Alloc entry */
+ ret = trace_seq_printf(s, " + ");
+ if (!ret)
+ return TRACE_TYPE_PARTIAL_LINE;
+
+ /* Type */
+ switch (entry->type_id) {
+ case KMEMTRACE_TYPE_PAGES:
+ ret = trace_seq_printf(s, "P ");
+ break;
+ default:
+ ret = trace_seq_printf(s, "? ");
+ }
+
+ if (!ret)
+ return TRACE_TYPE_PARTIAL_LINE;
+
+ /* Flags
+ * TODO: would be better to see the name of the GFP flag names
+ */
+ ret = trace_seq_printf(s, "%08x ", entry->gfp_mask);
+ if (!ret)
+ return TRACE_TYPE_PARTIAL_LINE;
+
+ /* Node */
+ ret = trace_seq_printf(s, "%4d ", entry->order);
+ if (!ret)
+ return TRACE_TYPE_PARTIAL_LINE;
+
+ if (!trace_seq_printf(s, "\n"))
+ return TRACE_TYPE_PARTIAL_LINE;
+
+ return TRACE_TYPE_HANDLED;
+}
+
static enum print_line_t
kmemtrace_print_alloc_compress(struct trace_iterator *iter,
struct kmemtrace_alloc_entry *entry)
@@ -418,6 +534,16 @@ static enum print_line_t kmemtrace_print_line(struct trace_iterator *iter)
struct trace_entry *entry = iter->ent;
switch (entry->type) {
+ case TRACE_KMEM_PAGE_ALLOC: {
+ struct kmemtrace_page_alloc_entry *field;
+
+ trace_assign_type(field, entry);
+ if (kmem_tracer_flags.val & TRACE_KMEM_OPT_MINIMAL)
+ return kmemtrace_print_page_alloc_compress(iter, field);
+ else
+ return kmemtrace_print_page_alloc_user(iter, field);
+ }
+
case TRACE_KMEM_ALLOC: {
struct kmemtrace_alloc_entry *field;
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index cbc168f..3b1bfa5 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -35,6 +35,7 @@ enum trace_type {
TRACE_SYSCALL_EXIT,
TRACE_KMEM_ALLOC,
TRACE_KMEM_FREE,
+ TRACE_KMEM_PAGE_ALLOC,
TRACE_POWER,
TRACE_BLK,
@@ -188,6 +189,13 @@ enum kmemtrace_type_id {
KMEMTRACE_TYPE_PAGES, /* __get_free_pages() and friends. */
};
+struct kmemtrace_page_alloc_entry {
+ struct trace_entry ent;
+ enum kmemtrace_type_id type_id;
+ gfp_t gfp_mask;
+ int order;
+};
+
struct kmemtrace_alloc_entry {
struct trace_entry ent;
enum kmemtrace_type_id type_id;
@@ -328,6 +336,8 @@ extern void __ftrace_bad_type(void);
TRACE_GRAPH_RET); \
IF_ASSIGN(var, ent, struct hw_branch_entry, TRACE_HW_BRANCHES);\
IF_ASSIGN(var, ent, struct trace_power, TRACE_POWER); \
+ IF_ASSIGN(var, ent, struct kmemtrace_page_alloc_entry, \
+ TRACE_KMEM_PAGE_ALLOC); \
IF_ASSIGN(var, ent, struct kmemtrace_alloc_entry, \
TRACE_KMEM_ALLOC); \
IF_ASSIGN(var, ent, struct kmemtrace_free_entry, \
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a3803ea..f939800 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1479,6 +1479,8 @@ __alloc_pages_internal(gfp_t gfp_mask, unsigned int order,
unsigned long did_some_progress;
unsigned long pages_reclaimed = 0;
+ trace_alloc_pages(gfp_mask, order);
+
lockdep_trace_alloc(gfp_mask);
might_sleep_if(wait);
@@ -1676,6 +1678,8 @@ got_pg:
return page;
}
EXPORT_SYMBOL(__alloc_pages_internal);
+DEFINE_TRACE(alloc_pages);
+EXPORT_TRACEPOINT_SYMBOL(alloc_pages);
/*
* Common helper functions.
On Thu, Apr 02, 2009 at 10:12:11AM +0300, Pekka Enberg wrote:
> On Wed, 2009-04-01 at 17:22 +0200, Ingo Molnar wrote:
> > > +kmemtrace_print_page_alloc_user(struct trace_iterator *iter,
> > > + struct kmemtrace_page_alloc_entry *entry)
> > > +{
> > > + struct kmemtrace_user_event_page_alloc *ev_alloc;
> > > + struct trace_seq *s = &iter->seq;
> > > + struct kmemtrace_user_event *ev;
> > > +
> > > + ev = trace_seq_reserve(s, sizeof(*ev));
> > > + if (!ev)
> > > + return TRACE_TYPE_PARTIAL_LINE;
> > > +
> > > + ev->event_id = KMEMTRACE_USER_PAGE_ALLOC;
> > > + ev->type_id = entry->type_id;
> > > + ev->event_size = sizeof(*ev) + sizeof(*ev_alloc);
> > > + ev->cpu = iter->cpu;
> > > + ev->timestamp = iter->ts;
> > > + ev->call_site = 0ULL; /* FIXME */
> > > + ev->ptr = 0ULL; /* FIXME */
> >
> > Here we could call save_stack_trace(), in a way like this, to save
> > up to 8 entries of the allocation back-trace:
>
> The example code compiled as-is so here's an updated patch! :-) We
> should probably do stack traces for _all_ events like you suggested but
> that's a bigger ABI change so I did it only for the new page allocation
> event.
I've thought about exporting a stack trace instead of a call-site
pointer, sounds nice. But I figure we should leave ev->call_site filled
as before, it's useful in some cases to show who is the intended caller,
e.g. caller-tracking variants of kmalloc.
It could also work out-of-the-box with the existing kmemtrace, since we
consume and ignore the remaining ev->event_size bytes we don't know how to
interpret. Moreover, events can hold 2^16 bytes which is enough to
export more than 8 previous frames.
One thing I'm not sure about this patch is whether it manages to record
an allocation only once, i.e. does it log a single event when/if the slab
allocator requests pages? Some time ago I sent a patch adding GFP_NOTRACE
to gfp.h, but was rejected. Maybe this could be a way out of the mess.
(GFP_NOTRACE would also allow us to log "backend" allocations easily and
treat them separately, for the record, or simply filter them out.)
Cheers,
Eduard
* Eduard - Gabriel Munteanu <[email protected]> wrote:
> One thing I'm not sure about this patch is whether it manages to
> record an allocation only once, i.e. does it log a single event
> when/if the slab allocator requests pages? Some time ago I sent a
> patch adding GFP_NOTRACE to gfp.h, but was rejected. Maybe this
> could be a way out of the mess.
>
> (GFP_NOTRACE would also allow us to log "backend" allocations
> easily and treat them separately, for the record, or simply filter
> them out.)
makes a lot of sense IMO to annotate these via a GFP flag.
Ingo
Ingo Molnar wrote:
> * Eduard - Gabriel Munteanu <[email protected]> wrote:
>
>> One thing I'm not sure about this patch is whether it manages to
>> record an allocation only once, i.e. does it log a single event
>> when/if the slab allocator requests pages? Some time ago I sent a
>> patch adding GFP_NOTRACE to gfp.h, but was rejected. Maybe this
>> could be a way out of the mess.
>>
>> (GFP_NOTRACE would also allow us to log "backend" allocations
>> easily and treat them separately, for the record, or simply filter
>> them out.)
>
> makes a lot of sense IMO to annotate these via a GFP flag.
Yup, make sense. I think I rejected the patch (did I?) because I wanted
to fix the slub/slab mess differently but here it makes perfect sense.
Pekka
* Pekka Enberg <[email protected]> wrote:
> Ingo Molnar wrote:
>> * Eduard - Gabriel Munteanu <[email protected]> wrote:
>>
>>> One thing I'm not sure about this patch is whether it manages to
>>> record an allocation only once, i.e. does it log a single event
>>> when/if the slab allocator requests pages? Some time ago I sent a
>>> patch adding GFP_NOTRACE to gfp.h, but was rejected. Maybe this
>>> could be a way out of the mess.
>>>
>>> (GFP_NOTRACE would also allow us to log "backend" allocations easily
>>> and treat them separately, for the record, or simply filter them
>>> out.)
>>
>> makes a lot of sense IMO to annotate these via a GFP flag.
>
> Yup, make sense. I think I rejected the patch (did I?) because I
> wanted to fix the slub/slab mess differently but here it makes
> perfect sense.
I'm wondering how much could be shared with the kmemcheck's
internal-allocation annotations. There's some overlap (although not
a full match) i suspect?
Ingo
On Fri, Apr 03, 2009 at 04:43:44PM +0200, Ingo Molnar wrote:
>
> * Pekka Enberg <[email protected]> wrote:
>
> > Ingo Molnar wrote:
> >> * Eduard - Gabriel Munteanu <[email protected]> wrote:
> >>
> >>> One thing I'm not sure about this patch is whether it manages to
> >>> record an allocation only once, i.e. does it log a single event
> >>> when/if the slab allocator requests pages? Some time ago I sent a
> >>> patch adding GFP_NOTRACE to gfp.h, but was rejected. Maybe this
> >>> could be a way out of the mess.
> >>>
> >>> (GFP_NOTRACE would also allow us to log "backend" allocations easily
> >>> and treat them separately, for the record, or simply filter them
> >>> out.)
> >>
> >> makes a lot of sense IMO to annotate these via a GFP flag.
> >
> > Yup, make sense. I think I rejected the patch (did I?) because I
> > wanted to fix the slub/slab mess differently but here it makes
> > perfect sense.
Uh, I don't remember exactly, not sure who did it.
But to expand on the upside of doing it this way, we could expect to see
something like...
xxxxx CPU0 GFP_KERNEL kmalloc
xxxxx CPU0 GFP_KERNEL | GFP_NOTRACE __get_free_pages
xxxxx CPU0 GFP_KERNEL | GFP_NOTRACE kmalloc
xxxxx CPU0 GFP_KERNEL kmem_cache_alloc
So it's easy to see that allocations 2 and 3 are in fact part of the
fourth. It will get confused by IRQs, preemption and sleeping while
allocating, but we could use additional information (e.g. tracking
returned ptr's) to eliminate some/many such confusions in userspace.
> I'm wondering how much could be shared with the kmemcheck's
> internal-allocation annotations. There's some overlap (although not
> a full match) i suspect?
>
> Ingo
I had already suggested this to Vegard some time ago. Basically,
kmemcheck should hook to the same tracepoints. As far as I remember,
kmemtrace gets more info out of the allocators, so this should suit
kmemcheck as well. Though this would incur additional overhead for the
latter due to the extra parameters being passed, but kmemcheck is slow
anyway.
Eduard
Hi Ingo,
On Fri, 2009-04-03 at 16:43 +0200, Ingo Molnar wrote:
> * Pekka Enberg <[email protected]> wrote:
>
> > Ingo Molnar wrote:
> >> * Eduard - Gabriel Munteanu <[email protected]> wrote:
> >>
> >>> One thing I'm not sure about this patch is whether it manages to
> >>> record an allocation only once, i.e. does it log a single event
> >>> when/if the slab allocator requests pages? Some time ago I sent a
> >>> patch adding GFP_NOTRACE to gfp.h, but was rejected. Maybe this
> >>> could be a way out of the mess.
> >>>
> >>> (GFP_NOTRACE would also allow us to log "backend" allocations easily
> >>> and treat them separately, for the record, or simply filter them
> >>> out.)
> >>
> >> makes a lot of sense IMO to annotate these via a GFP flag.
> >
> > Yup, make sense. I think I rejected the patch (did I?) because I
> > wanted to fix the slub/slab mess differently but here it makes
> > perfect sense.
>
> I'm wondering how much could be shared with the kmemcheck's
> internal-allocation annotations. There's some overlap (although not
> a full match) i suspect?
I didn't check but I suspect it's not a perfect match. Kmemcheck wants
to know a lot more of the internal workings of an allocator than
kmemtrace. That is, we need to deal with constructor special cases for
initialization and debugging, for instance.
Pekka