2009-04-21 22:45:40

by Larry Woodman

[permalink] [raw]
Subject: [Patch] mm tracepoints update


I've cleaned up the mm tracepoints to track page allocation and
freeing, various types of pagefaults and unmaps, and critical page
reclamation routines. This is useful for debugging memory allocation
issues and system performance problems under heavy memory loads.


----------------------------------------------------------------------


# tracer: mm
#
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |
pdflush-624 [004] 184.293169: wb_kupdate:
mm_pdflush_kupdate count=3e48
pdflush-624 [004] 184.293439: get_page_from_freelist:
mm_page_allocation pfn=447c27 zone_free=1940910
events/6-33 [006] 184.962879: free_hot_cold_page:
mm_page_free pfn=44bba9
irqbalance-8313 [001] 188.042951: unmap_vmas:
mm_anon_userfree mm=ffff88044a7300c0 address=7f9a2eb70000 pfn=24c29a
cat-9122 [005] 191.141173: filemap_fault:
mm_filemap_fault primary fault: mm=ffff88024c9d8f40 address=3cea2dd000
pfn=44d68e
cat-9122 [001] 191.143036: handle_mm_fault:
mm_anon_fault mm=ffff88024c8beb40 address=7fffbde99f94 pfn=24ce22
-------------------------------------------------------------------------

Signed-off-by: Larry Woodman <[email protected]>
Acked-by: Rik van Riel <[email protected]>


The patch applies to ingo's latest tip tree:


Attachments:
0001-Merge-mm-tracepoints-into-upstream-tip-tree.patch (19.56 kB)

2009-04-22 01:00:43

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [Patch] mm tracepoints update

>
> I've cleaned up the mm tracepoints to track page allocation and
> freeing, various types of pagefaults and unmaps, and critical page
> reclamation routines. This is useful for debugging memory allocation
> issues and system performance problems under heavy memory loads.

In past thread, Andrew pointed out bare page tracer isn't useful.
Can you make good consumer?


>
>
> ----------------------------------------------------------------------
>
>
> # tracer: mm
> #
> # TASK-PID CPU# TIMESTAMP FUNCTION
> # | | | | |
> pdflush-624 [004] 184.293169: wb_kupdate:
> mm_pdflush_kupdate count=3e48
> pdflush-624 [004] 184.293439: get_page_from_freelist:
> mm_page_allocation pfn=447c27 zone_free=1940910
> events/6-33 [006] 184.962879: free_hot_cold_page:
> mm_page_free pfn=44bba9
> irqbalance-8313 [001] 188.042951: unmap_vmas:
> mm_anon_userfree mm=ffff88044a7300c0 address=7f9a2eb70000 pfn=24c29a
> cat-9122 [005] 191.141173: filemap_fault:
> mm_filemap_fault primary fault: mm=ffff88024c9d8f40 address=3cea2dd000
> pfn=44d68e
> cat-9122 [001] 191.143036: handle_mm_fault:
> mm_anon_fault mm=ffff88024c8beb40 address=7fffbde99f94 pfn=24ce22
> -------------------------------------------------------------------------
>
> Signed-off-by: Larry Woodman <[email protected]>
> Acked-by: Rik van Riel <[email protected]>
>
>
> The patch applies to ingo's latest tip tree:


2009-04-22 09:58:12

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Patch] mm tracepoints update


* KOSAKI Motohiro <[email protected]> wrote:

> > I've cleaned up the mm tracepoints to track page allocation and
> > freeing, various types of pagefaults and unmaps, and critical
> > page reclamation routines. This is useful for debugging memory
> > allocation issues and system performance problems under heavy
> > memory loads.
>
> In past thread, Andrew pointed out bare page tracer isn't useful.

(do you have a link to that mail?)

> Can you make good consumer?

These MM tracepoints would be automatically seen by the
ftrace-analyzer GUI tool for example:

git://git.kernel.org/pub/scm/utils/kernel/ftrace/ftrace.git

And could also be seen by other tools such as kmemtrace. Beyond, of
course, embedding in function tracer output.

Here's the list of advantages of the types of tracepoints Larry is
proposing:

- zero-copy and per-cpu splice() based tracing
- binary tracing without printf overhead
- structured logging records exposed under /debug/tracing/events
- trace events embedded in function tracer output and other plugins
- user-defined, per tracepoint filter expressions

I think the main review question is: are they properly structured
and do they expose essential information to analyze behavioral
details of the kernel in this area?

Ingo

2009-04-22 12:11:32

by Larry Woodman

[permalink] [raw]
Subject: Re: [Patch] mm tracepoints update

On Wed, 2009-04-22 at 11:57 +0200, Ingo Molnar wrote:
> * KOSAKI Motohiro <[email protected]> wrote:
>
> > > I've cleaned up the mm tracepoints to track page allocation and
> > > freeing, various types of pagefaults and unmaps, and critical
> > > page reclamation routines. This is useful for debugging memory
> > > allocation issues and system performance problems under heavy
> > > memory loads.
> >
> > In past thread, Andrew pointed out bare page tracer isn't useful.
>
> (do you have a link to that mail?)
>
> > Can you make good consumer?

I will work up some good examples of what these are useful for. I use
the mm tracepoint data in the debugfs trace buffer to locate customer
performance problems associated with memory allocation, deallocation,
paging and swapping frequently, especially on large systems.

Larry

>
> These MM tracepoints would be automatically seen by the
> ftrace-analyzer GUI tool for example:
>
> git://git.kernel.org/pub/scm/utils/kernel/ftrace/ftrace.git
>
> And could also be seen by other tools such as kmemtrace. Beyond, of
> course, embedding in function tracer output.
>
> Here's the list of advantages of the types of tracepoints Larry is
> proposing:
>
> - zero-copy and per-cpu splice() based tracing
> - binary tracing without printf overhead
> - structured logging records exposed under /debug/tracing/events
> - trace events embedded in function tracer output and other plugins
> - user-defined, per tracepoint filter expressions
>
> I think the main review question is: are they properly structured
> and do they expose essential information to analyze behavioral
> details of the kernel in this area?
>
> Ingo
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2009-04-22 19:23:19

by Larry Woodman

[permalink] [raw]
Subject: Re: [Patch] mm tracepoints update - use case.

On Wed, 2009-04-22 at 08:07 -0400, Larry Woodman wrote:
> On Wed, 2009-04-22 at 11:57 +0200, Ingo Molnar wrote:
> > * KOSAKI Motohiro <[email protected]> wrote:

> > > In past thread, Andrew pointed out bare page tracer isn't useful.
> >
> > (do you have a link to that mail?)
> >
> > > Can you make good consumer?
>
> I will work up some good examples of what these are useful for. I use
> the mm tracepoint data in the debugfs trace buffer to locate customer
> performance problems associated with memory allocation, deallocation,
> paging and swapping frequently, especially on large systems.
>
> Larry

Attached is an example of what the mm tracepoints can be used for:



Attachments:
usecase (7.10 kB)

2009-04-23 00:48:36

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [Patch] mm tracepoints update - use case.

> On Wed, 2009-04-22 at 08:07 -0400, Larry Woodman wrote:
> > On Wed, 2009-04-22 at 11:57 +0200, Ingo Molnar wrote:
> > > * KOSAKI Motohiro <[email protected]> wrote:
>
> > > > In past thread, Andrew pointed out bare page tracer isn't useful.
> > >
> > > (do you have a link to that mail?)
> > >
> > > > Can you make good consumer?
> >
> > I will work up some good examples of what these are useful for. I use
> > the mm tracepoint data in the debugfs trace buffer to locate customer
> > performance problems associated with memory allocation, deallocation,
> > paging and swapping frequently, especially on large systems.
> >
> > Larry
>
> Attached is an example of what the mm tracepoints can be used for:

I have some comment.

1. Yes, current zone_reclaim have strange behavior. I plan to fix
some bug-like bahavior.
2. your scenario only use the information of "zone_reclaim called".
function tracer already provide it.
3. but yes, you are going to proper direction. we definitely need
some fine grained tracepoint in this area. we are welcome to you.
but in my personal feeling, your tracepoint have worthless argument
a lot. we need more good information.
I think I can help you in this area. I hope to work together.






2009-04-23 04:58:28

by Andrew Morton

[permalink] [raw]
Subject: Re: [Patch] mm tracepoints update - use case.

On Thu, 23 Apr 2009 09:48:04 +0900 (JST) KOSAKI Motohiro <[email protected]> wrote:

> > On Wed, 2009-04-22 at 08:07 -0400, Larry Woodman wrote:
> > > On Wed, 2009-04-22 at 11:57 +0200, Ingo Molnar wrote:
> > > > * KOSAKI Motohiro <[email protected]> wrote:
> >
> > > > > In past thread, Andrew pointed out bare page tracer isn't useful.
> > > >
> > > > (do you have a link to that mail?)

http://lkml.indiana.edu/hypermail/linux/kernel/0903.0/02674.html

And Larry's example use case here tends to reinforce what I said then. Look:

: In addition I could see that the priority was decremented to zero and
: that 12342 pages had been reclaimed rather than just enough to satisfy
: the page allocation request.
:
: -----------------------------------------------------------------------------
: # tracer: nop
: #
: # TASK-PID CPU# TIMESTAMP FUNCTION
: # | | | | |
: <mem>-10723 [005] 6976.285610: mm_directreclaim_reclaimzone: reclaimed=12342, priority=0

and

: -----------------------------------------------------------------------------
: # tracer: nop
: #
: # TASK-PID CPU# TIMESTAMP FUNCTION
: # | | | | |
: <mem>-10723 [005] 282.776271: mm_pagereclaim_shrinkzone: reclaimed=12342
: <mem>-10723 [005] 282.781209: mm_pagereclaim_shrinkzone: reclaimed=3540
: <mem>-10723 [005] 282.801194: mm_pagereclaim_shrinkzone: reclaimed=7528
: -----------------------------------------------------------------------------

This diagnosis was successful because the "reclaimed" number was weird.
By sheer happy coincidence, page-reclaim is already generating the
aggregated numbers for us, and the tracer just prints it out.

If some other problem is being worked on and if there _isn't_ some
convenient already-present aggregated result for the tracer to print,
the problem won't be solved. Unless a vast number of trace events are
emitted and problem-specific userspace code is written to aggregate
them into something which the developer can use.

2009-04-23 08:43:35

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Patch] mm tracepoints update - use case.


* Andrew Morton <[email protected]> wrote:

> On Thu, 23 Apr 2009 09:48:04 +0900 (JST) KOSAKI Motohiro <[email protected]> wrote:
>
> > > On Wed, 2009-04-22 at 08:07 -0400, Larry Woodman wrote:
> > > > On Wed, 2009-04-22 at 11:57 +0200, Ingo Molnar wrote:
> > > > > * KOSAKI Motohiro <[email protected]> wrote:
> > >
> > > > > > In past thread, Andrew pointed out bare page tracer isn't useful.
> > > > >
> > > > > (do you have a link to that mail?)
>
> http://lkml.indiana.edu/hypermail/linux/kernel/0903.0/02674.html
>
> And Larry's example use case here tends to reinforce what I said then. Look:
>
> : In addition I could see that the priority was decremented to zero and
> : that 12342 pages had been reclaimed rather than just enough to satisfy
> : the page allocation request.
> :
> : -----------------------------------------------------------------------------
> : # tracer: nop
> : #
> : # TASK-PID CPU# TIMESTAMP FUNCTION
> : # | | | | |
> : <mem>-10723 [005] 6976.285610: mm_directreclaim_reclaimzone: reclaimed=12342, priority=0
>
> and
>
> : -----------------------------------------------------------------------------
> : # tracer: nop
> : #
> : # TASK-PID CPU# TIMESTAMP FUNCTION
> : # | | | | |
> : <mem>-10723 [005] 282.776271: mm_pagereclaim_shrinkzone: reclaimed=12342
> : <mem>-10723 [005] 282.781209: mm_pagereclaim_shrinkzone: reclaimed=3540
> : <mem>-10723 [005] 282.801194: mm_pagereclaim_shrinkzone: reclaimed=7528
> : -----------------------------------------------------------------------------
>
> This diagnosis was successful because the "reclaimed" number was
> weird. By sheer happy coincidence, page-reclaim is already
> generating the aggregated numbers for us, and the tracer just
> prints it out.
>
> If some other problem is being worked on and if there _isn't_ some
> convenient already-present aggregated result for the tracer to
> print, the problem won't be solved. Unless a vast number of trace
> events are emitted and problem-specific userspace code is written
> to aggregate them into something which the developer can use.

Not so in the usescases i made use of tracers. The key is not to
trace everything, but to have a few key _concepts_ traced
pervasively. Having a dynamic notion of a per event changes is also
obviously good. In a fast changing workload you cannot just tell
based on summary statistics whether rapid changes are the product of
the inherent entropy of the workload, or the result of the MM being
confused.

/proc/ statisitics versus good tracing is like the difference
between a magnifying glass and an electron microscope. Both have
their strengths, and they are best if used together.

One such conceptual thing in the scheduler is the lifetime of a
task, its schedule, deschedule and wakeup events. It can already
show a massive amount of badness in practice, and it only takes a
few tracepoints to do.

Same goes for the MM IMHO. Number of pages reclaimed is obviously a
key metric to follow. Larry is an expert who fixed a _lot_ of MM
crap in the last 5-10 years at Red Hat, so if he says that these
tracepoints are useful to him, we shouldnt just dismiss that
experience like that. I wish Larry spent some of his energies on
fixing the upstream MM too ;-)

A balanced number of MM tracepoints, showing the concepts and the
inner dynamics of the MM would be useful. We dont need every little
detail traced (we have the function tracer for that), but a few key
aspects would be nice to capture ...

pagefaults, allocations, cache-misses, cache flushes and how pages
shift between various queues in the MM would be a good start IMHO.

Anyway, i suspect your answer means a NAK :-( Would be nice if you
would suggest a path out of that NAK.

Ingo

2009-04-23 11:52:18

by Larry Woodman

[permalink] [raw]
Subject: Re: [Patch] mm tracepoints update - use case.

On Thu, 2009-04-23 at 10:42 +0200, Ingo Molnar wrote:

>
> Not so in the usescases i made use of tracers. The key is not to
> trace everything, but to have a few key _concepts_ traced
> pervasively. Having a dynamic notion of a per event changes is also
> obviously good. In a fast changing workload you cannot just tell
> based on summary statistics whether rapid changes are the product of
> the inherent entropy of the workload, or the result of the MM being
> confused.
>
> /proc/ statisitics versus good tracing is like the difference
> between a magnifying glass and an electron microscope. Both have
> their strengths, and they are best if used together.
>
> One such conceptual thing in the scheduler is the lifetime of a
> task, its schedule, deschedule and wakeup events. It can already
> show a massive amount of badness in practice, and it only takes a
> few tracepoints to do.
>
> Same goes for the MM IMHO. Number of pages reclaimed is obviously a
> key metric to follow. Larry is an expert who fixed a _lot_ of MM
> crap in the last 5-10 years at Red Hat, so if he says that these
> tracepoints are useful to him, we shouldnt just dismiss that
> experience like that. I wish Larry spent some of his energies on
> fixing the upstream MM too ;-)
>
> A balanced number of MM tracepoints, showing the concepts and the
> inner dynamics of the MM would be useful. We dont need every little
> detail traced (we have the function tracer for that), but a few key
> aspects would be nice to capture ...

I hear you, there is lot of data coming out of these mm tracepoints as
well as must of the other tracepoints I've played around with, we have
to filter them. I added them in locations that would allow us to debug
a variety of real running systems such as a Wall St. trading server
during the heaviest period of the day without rebooting a debug kernel.
We can collect whatever is needed to figure out whats happening then
turning it all off when we've collected enough. We've seen systems
experiencing performance problems caused by the "inner'ds" of the page
reclaim code, memory leak problems cause by applications, excessive COW
faults caused by applications that mmap() gigs of files then fork and
applications that rely the kernel to flush out every modified page of
those gigs of mmap()'d file data every 30 seconds via kupdate because
other kernel do. The list goes on and on... These tracepoints are in
the same locations that we've placed debug code in debug kernels in the
past.

Larry



>
> pagefaults, allocations, cache-misses, cache flushes and how pages
> shift between various queues in the MM would be a good start IMHO.
>
> Anyway, i suspect your answer means a NAK :-( Would be nice if you
> would suggest a path out of that NAK.
>
> Ingo

2009-04-24 20:50:30

by Larry Woodman

[permalink] [raw]
Subject: Re: [Patch] mm tracepoints update - use case.

On Thu, 2009-04-23 at 07:47 -0400, Larry Woodman wrote:
> On Thu, 2009-04-23 at 10:42 +0200, Ingo Molnar wrote:

> >
> > A balanced number of MM tracepoints, showing the concepts and the
> > inner dynamics of the MM would be useful. We dont need every little
> > detail traced (we have the function tracer for that), but a few key
> > aspects would be nice to capture ...
>
> I hear you, there is lot of data coming out of these mm tracepoints as
> well as must of the other tracepoints I've played around with, we have
> to filter them. I added them in locations that would allow us to debug
> a variety of real running systems such as a Wall St. trading server
> during the heaviest period of the day without rebooting a debug kernel.
> We can collect whatever is needed to figure out whats happening then
> turning it all off when we've collected enough. We've seen systems
> experiencing performance problems caused by the "inner'ds" of the page
> reclaim code, memory leak problems cause by applications, excessive COW
> faults caused by applications that mmap() gigs of files then fork and
> applications that rely the kernel to flush out every modified page of
> those gigs of mmap()'d file data every 30 seconds via kupdate because
> other kernel do. The list goes on and on... These tracepoints are in
> the same locations that we've placed debug code in debug kernels in the
> past.
>
> Larry
>

> >
> > pagefaults, allocations, cache-misses, cache flushes and how pages
> > shift between various queues in the MM would be a good start IMHO.
> >
> > Anyway, i suspect your answer means a NAK :-( Would be nice if you
> > would suggest a path out of that NAK.
> >
> > Ingo
>


I've overhauled the patch so that all page level tracing has been
removed unless it directly causes page reclamation. At this point trace
individual pagefaults, unmaps and pageouts. However, for all page
reclaim paths and writeback paths we now traces quantities of pages
activated, deactivated, written, reclaimed, etc,. Also we now only
trace individual page allocations that cause further page reclamation to
occur. This still provides the necessary microscopic level of detail
without tracing the movement of all the pageframes:


> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>


Attachments:
mm-tracepoints.patch (17.34 kB)

2009-06-15 18:27:40

by Rik van Riel

[permalink] [raw]
Subject: Re: [Patch] mm tracepoints update - use case.

KOSAKI Motohiro wrote:
>> On Wed, 2009-04-22 at 08:07 -0400, Larry Woodman wrote:

>> Attached is an example of what the mm tracepoints can be used for:
>
> I have some comment.
>
> 1. Yes, current zone_reclaim have strange behavior. I plan to fix
> some bug-like bahavior.
> 2. your scenario only use the information of "zone_reclaim called".
> function tracer already provide it.
> 3. but yes, you are going to proper direction. we definitely need
> some fine grained tracepoint in this area. we are welcome to you.
> but in my personal feeling, your tracepoint have worthless argument
> a lot. we need more good information.
> I think I can help you in this area. I hope to work together.

Sorry I am replying to a really old email, but exactly
what information do you believe would be more useful to
extract from vmscan.c with tracepoints?

What are the kinds of problems that customer systems
(which cannot be rebooted into experimental kernels)
run into, that can be tracked down with tracepoints?

I can think of a few:
- excessive CPU use in page reclaim code
- excessive reclaim latency in page reclaim code
- unbalanced memory allocation between zones/nodes
- strange balance problems between reclaiming of page
cache and swapping out process pages

I suspect we would need fairly fine grained tracepoints
to track down these kinds of problems, with filtering
and/or interpretation in userspace, but I am always
interested in easier ways of tracking down these kinds
of problems :)

What kinds of tracepoints do you believe we would need?

Or, using Larry's patch as a starting point, what do you
believe should be changed?

--
All rights reversed.

2009-06-17 14:13:50

by Larry Woodman

[permalink] [raw]
Subject: Re: [Patch] mm tracepoints update - use case.

diff --git a/include/trace/events/mm.h b/include/trace/events/mm.h
new file mode 100644
index 0000000..1d888a4
--- /dev/null
+++ b/include/trace/events/mm.h
@@ -0,0 +1,436 @@
+#if !defined(_TRACE_MM_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_MM_H
+
+#include <linux/mm.h>
+#include <linux/tracepoint.h>
+
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM mm
+
+TRACE_EVENT(mm_anon_fault,
+
+ TP_PROTO(struct mm_struct *mm, unsigned long address),
+
+ TP_ARGS(mm, address),
+
+ TP_STRUCT__entry(
+ __field(struct mm_struct *, mm)
+ __field(unsigned long, address)
+ ),
+
+ TP_fast_assign(
+ __entry->mm = mm;
+ __entry->address = address;
+ ),
+
+ TP_printk("mm=%lx address=%lx",
+ (unsigned long)__entry->mm, __entry->address)
+);
+
+TRACE_EVENT(mm_anon_pgin,
+
+ TP_PROTO(struct mm_struct *mm, unsigned long address),
+
+ TP_ARGS(mm, address),
+
+ TP_STRUCT__entry(
+ __field(struct mm_struct *, mm)
+ __field(unsigned long, address)
+ ),
+
+ TP_fast_assign(
+ __entry->mm = mm;
+ __entry->address = address;
+ ),
+
+ TP_printk("mm=%lx address=%lx",
+ (unsigned long)__entry->mm, __entry->address)
+ );
+
+TRACE_EVENT(mm_anon_cow,
+
+ TP_PROTO(struct mm_struct *mm,
+ unsigned long address),
+
+ TP_ARGS(mm, address),
+
+ TP_STRUCT__entry(
+ __field(struct mm_struct *, mm)
+ __field(unsigned long, address)
+ ),
+
+ TP_fast_assign(
+ __entry->mm = mm;
+ __entry->address = address;
+ ),
+
+ TP_printk("mm=%lx address=%lx",
+ (unsigned long)__entry->mm, __entry->address)
+ );
+
+TRACE_EVENT(mm_anon_userfree,
+
+ TP_PROTO(struct mm_struct *mm,
+ unsigned long address),
+
+ TP_ARGS(mm, address),
+
+ TP_STRUCT__entry(
+ __field(struct mm_struct *, mm)
+ __field(unsigned long, address)
+ ),
+
+ TP_fast_assign(
+ __entry->mm = mm;
+ __entry->address = address;
+ ),
+
+ TP_printk("mm=%lx address=%lx",
+ (unsigned long)__entry->mm, __entry->address)
+ );
+
+TRACE_EVENT(mm_anon_unmap,
+
+ TP_PROTO(struct mm_struct *mm, unsigned long address),
+
+ TP_ARGS(mm, address),
+
+ TP_STRUCT__entry(
+ __field(struct mm_struct *, mm)
+ __field(unsigned long, address)
+ ),
+
+ TP_fast_assign(
+ __entry->mm = mm;
+ __entry->address = address;
+ ),
+
+ TP_printk("mm=%lx address=%lx",
+ (unsigned long)__entry->mm, __entry->address)
+ );
+
+TRACE_EVENT(mm_filemap_fault,
+
+ TP_PROTO(struct mm_struct *mm, unsigned long address, int flag),
+ TP_ARGS(mm, address, flag),
+
+ TP_STRUCT__entry(
+ __field(struct mm_struct *, mm)
+ __field(unsigned long, address)
+ __field(int, flag)
+ ),
+
+ TP_fast_assign(
+ __entry->mm = mm;
+ __entry->address = address;
+ __entry->flag = flag;
+ ),
+
+ TP_printk("%s: mm=%lx address=%lx",
+ __entry->flag ? "pagein" : "primary fault",
+ (unsigned long)__entry->mm, __entry->address)
+ );
+
+TRACE_EVENT(mm_filemap_cow,
+
+ TP_PROTO(struct mm_struct *mm, unsigned long address),
+
+ TP_ARGS(mm, address),
+
+ TP_STRUCT__entry(
+ __field(struct mm_struct *, mm)
+ __field(unsigned long, address)
+ ),
+
+ TP_fast_assign(
+ __entry->mm = mm;
+ __entry->address = address;
+ ),
+
+ TP_printk("mm=%lx address=%lx",
+ (unsigned long)__entry->mm, __entry->address)
+ );
+
+TRACE_EVENT(mm_filemap_unmap,
+
+ TP_PROTO(struct mm_struct *mm, unsigned long address),
+
+ TP_ARGS(mm, address),
+
+ TP_STRUCT__entry(
+ __field(struct mm_struct *, mm)
+ __field(unsigned long, address)
+ ),
+
+ TP_fast_assign(
+ __entry->mm = mm;
+ __entry->address = address;
+ ),
+
+ TP_printk("mm=%lx address=%lx",
+ (unsigned long)__entry->mm, __entry->address)
+ );
+
+TRACE_EVENT(mm_filemap_userunmap,
+
+ TP_PROTO(struct mm_struct *mm, unsigned long address),
+
+ TP_ARGS(mm, address),
+
+ TP_STRUCT__entry(
+ __field(struct mm_struct *, mm)
+ __field(unsigned long, address)
+ ),
+
+ TP_fast_assign(
+ __entry->mm = mm;
+ __entry->address = address;
+ ),
+
+ TP_printk("mm=%lx address=%lx",
+ (unsigned long)__entry->mm, __entry->address)
+ );
+
+TRACE_EVENT(mm_pagereclaim_pgout,
+
+ TP_PROTO(struct address_space *mapping, unsigned long offset, int anon),
+
+ TP_ARGS(mapping, offset, anon),
+
+ TP_STRUCT__entry(
+ __field(struct address_space *, mapping)
+ __field(unsigned long, offset)
+ __field(int, anon)
+ ),
+
+ TP_fast_assign(
+ __entry->mapping = mapping;
+ __entry->offset = offset;
+ __entry->anon = anon;
+ ),
+
+ TP_printk("mapping=%lx, offset=%lx %s",
+ (unsigned long)__entry->mapping, __entry->offset,
+ __entry->anon ? "anonymous" : "pagecache")
+ );
+
+TRACE_EVENT(mm_pagereclaim_free,
+
+ TP_PROTO(unsigned long nr_reclaimed),
+
+ TP_ARGS(nr_reclaimed),
+
+ TP_STRUCT__entry(
+ __field(unsigned long, nr_reclaimed)
+ ),
+
+ TP_fast_assign(
+ __entry->nr_reclaimed = nr_reclaimed;
+ ),
+
+ TP_printk("freed=%ld", __entry->nr_reclaimed)
+ );
+
+TRACE_EVENT(mm_pdflush_bgwriteout,
+
+ TP_PROTO(unsigned long written),
+
+ TP_ARGS(written),
+
+ TP_STRUCT__entry(
+ __field(unsigned long, written)
+ ),
+
+ TP_fast_assign(
+ __entry->written = written;
+ ),
+
+ TP_printk("written=%ld", __entry->written)
+ );
+
+TRACE_EVENT(mm_pdflush_kupdate,
+
+ TP_PROTO(unsigned long writes),
+
+ TP_ARGS(writes),
+
+ TP_STRUCT__entry(
+ __field(unsigned long, writes)
+ ),
+
+ TP_fast_assign(
+ __entry->writes = writes;
+ ),
+
+ TP_printk("writes=%ld", __entry->writes)
+ );
+
+TRACE_EVENT(mm_balance_dirty,
+
+ TP_PROTO(unsigned long written),
+
+ TP_ARGS(written),
+
+ TP_STRUCT__entry(
+ __field(unsigned long, written)
+ ),
+
+ TP_fast_assign(
+ __entry->written = written;
+ ),
+
+ TP_printk("written=%ld", __entry->written)
+ );
+
+TRACE_EVENT(mm_page_allocation,
+
+ TP_PROTO(unsigned long free),
+
+ TP_ARGS(free),
+
+ TP_STRUCT__entry(
+ __field(unsigned long, free)
+ ),
+
+ TP_fast_assign(
+ __entry->free = free;
+ ),
+
+ TP_printk("zone_free=%ld", __entry->free)
+ );
+
+TRACE_EVENT(mm_kswapd_ran,
+
+ TP_PROTO(struct pglist_data *pgdat, unsigned long reclaimed),
+
+ TP_ARGS(pgdat, reclaimed),
+
+ TP_STRUCT__entry(
+ __field(struct pglist_data *, pgdat)
+ __field(int, node_id)
+ __field(unsigned long, reclaimed)
+ ),
+
+ TP_fast_assign(
+ __entry->pgdat = pgdat;
+ __entry->node_id = pgdat->node_id;
+ __entry->reclaimed = reclaimed;
+ ),
+
+ TP_printk("node=%d reclaimed=%ld", __entry->node_id, __entry->reclaimed)
+ );
+
+TRACE_EVENT(mm_directreclaim_reclaimall,
+
+ TP_PROTO(int node, unsigned long reclaimed, unsigned long priority),
+
+ TP_ARGS(node, reclaimed, priority),
+
+ TP_STRUCT__entry(
+ __field(int, node)
+ __field(unsigned long, reclaimed)
+ __field(unsigned long, priority)
+ ),
+
+ TP_fast_assign(
+ __entry->node = node;
+ __entry->reclaimed = reclaimed;
+ __entry->priority = priority;
+ ),
+
+ TP_printk("node=%d reclaimed=%ld priority=%ld", __entry->node, __entry->reclaimed,
+ __entry->priority)
+ );
+
+TRACE_EVENT(mm_directreclaim_reclaimzone,
+
+ TP_PROTO(int node, unsigned long reclaimed, unsigned long priority),
+
+ TP_ARGS(node, reclaimed, priority),
+
+ TP_STRUCT__entry(
+ __field(int, node)
+ __field(unsigned long, reclaimed)
+ __field(unsigned long, priority)
+ ),
+
+ TP_fast_assign(
+ __entry->node = node;
+ __entry->reclaimed = reclaimed;
+ __entry->priority = priority;
+ ),
+
+ TP_printk("node = %d reclaimed=%ld, priority=%ld",
+ __entry->node, __entry->reclaimed, __entry->priority)
+ );
+TRACE_EVENT(mm_pagereclaim_shrinkzone,
+
+ TP_PROTO(unsigned long reclaimed, unsigned long priority),
+
+ TP_ARGS(reclaimed, priority),
+
+ TP_STRUCT__entry(
+ __field(unsigned long, reclaimed)
+ __field(unsigned long, priority)
+ ),
+
+ TP_fast_assign(
+ __entry->reclaimed = reclaimed;
+ __entry->priority = priority;
+ ),
+
+ TP_printk("reclaimed=%ld priority=%ld",
+ __entry->reclaimed, __entry->priority)
+ );
+
+TRACE_EVENT(mm_pagereclaim_shrinkactive,
+
+ TP_PROTO(unsigned long scanned, int file, int priority),
+
+ TP_ARGS(scanned, file, priority),
+
+ TP_STRUCT__entry(
+ __field(unsigned long, scanned)
+ __field(int, file)
+ __field(int, priority)
+ ),
+
+ TP_fast_assign(
+ __entry->scanned = scanned;
+ __entry->file = file;
+ __entry->priority = priority;
+ ),
+
+ TP_printk("scanned=%ld, %s, priority=%d",
+ __entry->scanned, __entry->file ? "pagecache" : "anonymous",
+ __entry->priority)
+ );
+
+TRACE_EVENT(mm_pagereclaim_shrinkinactive,
+
+ TP_PROTO(unsigned long scanned, unsigned long reclaimed,
+ int priority),
+
+ TP_ARGS(scanned, reclaimed, priority),
+
+ TP_STRUCT__entry(
+ __field(unsigned long, scanned)
+ __field(unsigned long, reclaimed)
+ __field(int, priority)
+ ),
+
+ TP_fast_assign(
+ __entry->scanned = scanned;
+ __entry->reclaimed = reclaimed;
+ __entry->priority = priority;
+ ),
+
+ TP_printk("scanned=%ld, reclaimed=%ld, priority=%d",
+ __entry->scanned, __entry->reclaimed,
+ __entry->priority)
+ );
+
+#endif /* _TRACE_MM_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/mm/filemap.c b/mm/filemap.c
index 1b60f30..af4a964 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -34,6 +34,7 @@
#include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
#include <linux/memcontrol.h>
#include <linux/mm_inline.h> /* for page_is_file_cache() */
+#include <trace/events/mm.h>
#include "internal.h"

/*
@@ -1568,6 +1569,8 @@ retry_find:
*/
ra->prev_pos = (loff_t)page->index << PAGE_CACHE_SHIFT;
vmf->page = page;
+ trace_mm_filemap_fault(vma->vm_mm, (unsigned long)vmf->virtual_address,
+ vmf->flags&FAULT_FLAG_NONLINEAR);
return ret | VM_FAULT_LOCKED;

no_cached_page:
diff --git a/mm/memory.c b/mm/memory.c
index 4126dd1..a4a580c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -61,6 +61,7 @@
#include <asm/tlb.h>
#include <asm/tlbflush.h>
#include <asm/pgtable.h>
+#include <trace/events/mm.h>

#include "internal.h"

@@ -812,15 +813,17 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
addr) != page->index)
set_pte_at(mm, addr, pte,
pgoff_to_pte(page->index));
- if (PageAnon(page))
+ if (PageAnon(page)) {
anon_rss--;
- else {
+ trace_mm_anon_userfree(mm, addr);
+ } else {
if (pte_dirty(ptent))
set_page_dirty(page);
if (pte_young(ptent) &&
likely(!VM_SequentialReadHint(vma)))
mark_page_accessed(page);
file_rss--;
+ trace_mm_filemap_userunmap(mm, addr);
}
page_remove_rmap(page);
if (unlikely(page_mapcount(page) < 0))
@@ -1896,7 +1899,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pte_t *page_table, pmd_t *pmd,
spinlock_t *ptl, pte_t orig_pte)
{
- struct page *old_page, *new_page;
+ struct page *old_page, *new_page = NULL;
pte_t entry;
int reuse = 0, ret = 0;
int page_mkwrite = 0;
@@ -2050,9 +2053,12 @@ gotten:
if (!PageAnon(old_page)) {
dec_mm_counter(mm, file_rss);
inc_mm_counter(mm, anon_rss);
+ trace_mm_filemap_cow(mm, address);
}
- } else
+ } else {
inc_mm_counter(mm, anon_rss);
+ trace_mm_anon_cow(mm, address);
+ }
flush_cache_page(vma, address, pte_pfn(orig_pte));
entry = mk_pte(new_page, vma->vm_page_prot);
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
@@ -2449,7 +2455,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
int write_access, pte_t orig_pte)
{
spinlock_t *ptl;
- struct page *page;
+ struct page *page = NULL;
swp_entry_t entry;
pte_t pte;
struct mem_cgroup *ptr = NULL;
@@ -2549,6 +2555,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
unlock:
pte_unmap_unlock(page_table, ptl);
out:
+ trace_mm_anon_pgin(mm, address);
return ret;
out_nomap:
mem_cgroup_cancel_charge_swapin(ptr);
@@ -2582,6 +2589,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
goto oom;
__SetPageUptodate(page);

+ trace_mm_anon_fault(mm, address);
if (mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))
goto oom_free_page;

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index bb553c3..ef92a97 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -34,6 +34,7 @@
#include <linux/syscalls.h>
#include <linux/buffer_head.h>
#include <linux/pagevec.h>
+#include <trace/events/mm.h>

/*
* The maximum number of pages to writeout in a single bdflush/kupdate
@@ -574,6 +575,7 @@ static void balance_dirty_pages(struct address_space *mapping)
congestion_wait(WRITE, HZ/10);
}

+ trace_mm_balance_dirty(pages_written);
if (bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh &&
bdi->dirty_exceeded)
bdi->dirty_exceeded = 0;
@@ -716,6 +718,7 @@ static void background_writeout(unsigned long _min_pages)
break;
}
}
+ trace_mm_pdflush_bgwriteout(_min_pages);
}

/*
@@ -776,6 +779,7 @@ static void wb_kupdate(unsigned long arg)
nr_to_write = global_page_state(NR_FILE_DIRTY) +
global_page_state(NR_UNSTABLE_NFS) +
(inodes_stat.nr_inodes - inodes_stat.nr_unused);
+ trace_mm_pdflush_kupdate(nr_to_write);
while (nr_to_write > 0) {
wbc.more_io = 0;
wbc.encountered_congestion = 0;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0727896..ca9355e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -48,6 +48,7 @@
#include <linux/page_cgroup.h>
#include <linux/debugobjects.h>
#include <linux/kmemleak.h>
+#include <trace/events/mm.h>

#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -1440,6 +1441,7 @@ zonelist_scan:
mark = zone->pages_high;
if (!zone_watermark_ok(zone, order, mark,
classzone_idx, alloc_flags)) {
+ trace_mm_page_allocation(zone_page_state(zone, NR_FREE_PAGES));
if (!zone_reclaim_mode ||
!zone_reclaim(zone, gfp_mask, order))
goto this_zone_full;
diff --git a/mm/rmap.c b/mm/rmap.c
index 23122af..f2156ca 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -50,6 +50,7 @@
#include <linux/memcontrol.h>
#include <linux/mmu_notifier.h>
#include <linux/migrate.h>
+#include <trace/events/mm.h>

#include <asm/tlbflush.h>

@@ -1025,6 +1026,7 @@ static int try_to_unmap_anon(struct page *page, int unlock, int migration)
if (mlocked)
break; /* stop if actually mlocked page */
}
+ trace_mm_anon_unmap(vma->vm_mm, vma->vm_start+page->index);
}

page_unlock_anon_vma(anon_vma);
@@ -1152,6 +1154,7 @@ static int try_to_unmap_file(struct page *page, int unlock, int migration)
goto out;
}
vma->vm_private_data = (void *) max_nl_cursor;
+ trace_mm_filemap_unmap(vma->vm_mm, vma->vm_start+page->index);
}
cond_resched_lock(&mapping->i_mmap_lock);
max_nl_cursor += CLUSTER_SIZE;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 95c08a8..bed7125 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -40,6 +40,8 @@
#include <linux/memcontrol.h>
#include <linux/delayacct.h>
#include <linux/sysctl.h>
+#define CREATE_TRACE_POINTS
+#include <trace/events/mm.h>

#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -417,6 +419,8 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
ClearPageReclaim(page);
}
inc_zone_page_state(page, NR_VMSCAN_WRITE);
+ trace_mm_pagereclaim_pgout(mapping, page->index<<PAGE_SHIFT,
+ PageAnon(page));
return PAGE_SUCCESS;
}

@@ -796,6 +800,7 @@ keep:
if (pagevec_count(&freed_pvec))
__pagevec_free(&freed_pvec);
count_vm_events(PGACTIVATE, pgactivate);
+ trace_mm_pagereclaim_free(nr_reclaimed);
return nr_reclaimed;
}

@@ -1182,6 +1187,8 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,
done:
local_irq_enable();
pagevec_release(&pvec);
+ trace_mm_pagereclaim_shrinkinactive(nr_scanned, nr_reclaimed,
+ priority);
return nr_reclaimed;
}

@@ -1316,6 +1323,7 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
if (buffer_heads_over_limit)
pagevec_strip(&pvec);
pagevec_release(&pvec);
+ trace_mm_pagereclaim_shrinkactive(pgscanned, file, priority);
}

static int inactive_anon_is_low_global(struct zone *zone)
@@ -1516,6 +1524,7 @@ static void shrink_zone(int priority, struct zone *zone,
}

sc->nr_reclaimed = nr_reclaimed;
+ trace_mm_pagereclaim_shrinkzone(nr_reclaimed, priority);

/*
* Even if we did not try to evict anon pages at all, we want to
@@ -1678,6 +1687,8 @@ out:
if (priority < 0)
priority = 0;

+ trace_mm_directreclaim_reclaimall(zonelist[0]._zonerefs->zone->node,
+ sc->nr_reclaimed, priority);
if (scanning_global_lru(sc)) {
for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {

@@ -1947,6 +1958,7 @@ out:
goto loop_again;
}

+ trace_mm_kswapd_ran(pgdat, sc.nr_reclaimed);
return sc.nr_reclaimed;
}

@@ -2299,7 +2311,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
const unsigned long nr_pages = 1 << order;
struct task_struct *p = current;
struct reclaim_state reclaim_state;
- int priority;
+ int priority = ZONE_RECLAIM_PRIORITY;
struct scan_control sc = {
.may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
.may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
@@ -2366,6 +2378,8 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)

p->reclaim_state = NULL;
current->flags &= ~(PF_MEMALLOC | PF_SWAPWRITE);
+ trace_mm_directreclaim_reclaimzone(zone->node,
+ sc.nr_reclaimed, priority);
return sc.nr_reclaimed >= nr_pages;
}


Attachments:
mmtracepoints-617.diff (16.84 kB)

2009-06-18 07:58:04

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [Patch] mm tracepoints update - use case.

Hi

sorry for the delaying in replay.
your question is always difficult...


> KOSAKI Motohiro wrote:
> >> On Wed, 2009-04-22 at 08:07 -0400, Larry Woodman wrote:
>
> >> Attached is an example of what the mm tracepoints can be used for:
> >
> > I have some comment.
> >
> > 1. Yes, current zone_reclaim have strange behavior. I plan to fix
> > some bug-like bahavior.
> > 2. your scenario only use the information of "zone_reclaim called".
> > function tracer already provide it.
> > 3. but yes, you are going to proper direction. we definitely need
> > some fine grained tracepoint in this area. we are welcome to you.
> > but in my personal feeling, your tracepoint have worthless argument
> > a lot. we need more good information.
> > I think I can help you in this area. I hope to work together.
>
> Sorry I am replying to a really old email, but exactly
> what information do you believe would be more useful to
> extract from vmscan.c with tracepoints?
>
> What are the kinds of problems that customer systems
> (which cannot be rebooted into experimental kernels)
> run into, that can be tracked down with tracepoints?
>
> I can think of a few:
> - excessive CPU use in page reclaim code
> - excessive reclaim latency in page reclaim code
> - unbalanced memory allocation between zones/nodes
> - strange balance problems between reclaiming of page
> cache and swapping out process pages
>
> I suspect we would need fairly fine grained tracepoints
> to track down these kinds of problems, with filtering
> and/or interpretation in userspace, but I am always
> interested in easier ways of tracking down these kinds
> of problems :)
>
> What kinds of tracepoints do you believe we would need?
>
> Or, using Larry's patch as a starting point, what do you
> believe should be changed?

OK, I recognize we need use-case discussion more.
following scenario are my freqently received issue list.
(perhaps, there are unwritten issue, but I don't recall it now)

Scenario 1. OOM killer happend. why? and who bring it?
Scenario 2. page allocation failure by memory fragmentation
Scenario 3. try_to_free_pages() makes very long latency. why?
Scenario 4. sar output that free memory dramatically reduced at 10 minute ago, and
it already recover now. What's happen?

- suspects
- kernel memory leak
- userland memory leak
- stupid driver use too much memory
- userland application suddenly start to use much memory

- what information are valuable?
- slab usage information (kmemtrace already does)
- page allocator usage information
- rss of all processes at oom happend
- why recent try_to_free_pages() can't reclaim any page?
- recent sycall history
- buddy fragmentation info


Plus, another requirement here
1. trace page refault distance (likes past Rik's /proc/refault patch)

2. file cache visualizer - Which file use many page-cache?
- afaik, Wu Fengguang is working on this issue.


--------------------------------------------
And, here is my reviewing comment to his patch.
btw, I haven't full review it yet. perhaps I might be overlooking something.


First, this is general review comment.

- Please don't display mm and/or another kernel raw pointer.
if we assume non stop system, we can't use kernel-dump. Thus kernel pointer
logging is not so useful.
Any userland tools can't parse it. (/proc/kcore don't help this situation,
the pointer might be freed before parsing)
- Please makes patch series. one big patch is harder review.
- Please write patch description and use-case.
- Please consider how do this feature works on mem-cgroup.
(IOW, please don't ignore many "if (scanning_global_lru())")
- tracepoint caller shouldn't have any assumption of displaying representation.
e.g.
wrong) trace_mm_pagereclaim_pgout(mapping, page->index<<PAGE_SHIFT, PageAnon(page));
good) trace_mm_pagereclaim_pgout(mapping, page)
that's general and good callback and/or hook manner.




> diff --git a/include/trace/events/mm.h b/include/trace/events/mm.h
> new file mode 100644
> index 0000000..1d888a4
> --- /dev/null
> +++ b/include/trace/events/mm.h
> @@ -0,0 +1,436 @@
> +#if !defined(_TRACE_MM_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_MM_H
> +
> +#include <linux/mm.h>
> +#include <linux/tracepoint.h>
> +
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM mm
> +
> +TRACE_EVENT(mm_anon_fault,
> +
> + TP_PROTO(struct mm_struct *mm, unsigned long address),
> +
> + TP_ARGS(mm, address),
> +
> + TP_STRUCT__entry(
> + __field(struct mm_struct *, mm)
> + __field(unsigned long, address)
> + ),
> +
> + TP_fast_assign(
> + __entry->mm = mm;
> + __entry->address = address;
> + ),
> +
> + TP_printk("mm=%lx address=%lx",
> + (unsigned long)__entry->mm, __entry->address)
> +);
> +
> +TRACE_EVENT(mm_anon_pgin,
> +
> + TP_PROTO(struct mm_struct *mm, unsigned long address),
> +
> + TP_ARGS(mm, address),
> +
> + TP_STRUCT__entry(
> + __field(struct mm_struct *, mm)
> + __field(unsigned long, address)
> + ),
> +
> + TP_fast_assign(
> + __entry->mm = mm;
> + __entry->address = address;
> + ),
> +
> + TP_printk("mm=%lx address=%lx",
> + (unsigned long)__entry->mm, __entry->address)
> + );
> +
> +TRACE_EVENT(mm_anon_cow,
> +
> + TP_PROTO(struct mm_struct *mm,
> + unsigned long address),
> +
> + TP_ARGS(mm, address),
> +
> + TP_STRUCT__entry(
> + __field(struct mm_struct *, mm)
> + __field(unsigned long, address)
> + ),
> +
> + TP_fast_assign(
> + __entry->mm = mm;
> + __entry->address = address;
> + ),
> +
> + TP_printk("mm=%lx address=%lx",
> + (unsigned long)__entry->mm, __entry->address)
> + );
> +
> +TRACE_EVENT(mm_anon_userfree,
> +
> + TP_PROTO(struct mm_struct *mm,
> + unsigned long address),
> +
> + TP_ARGS(mm, address),
> +
> + TP_STRUCT__entry(
> + __field(struct mm_struct *, mm)
> + __field(unsigned long, address)
> + ),
> +
> + TP_fast_assign(
> + __entry->mm = mm;
> + __entry->address = address;
> + ),
> +
> + TP_printk("mm=%lx address=%lx",
> + (unsigned long)__entry->mm, __entry->address)
> + );
> +
> +TRACE_EVENT(mm_anon_unmap,
> +
> + TP_PROTO(struct mm_struct *mm, unsigned long address),
> +
> + TP_ARGS(mm, address),
> +
> + TP_STRUCT__entry(
> + __field(struct mm_struct *, mm)
> + __field(unsigned long, address)
> + ),
> +
> + TP_fast_assign(
> + __entry->mm = mm;
> + __entry->address = address;
> + ),
> +
> + TP_printk("mm=%lx address=%lx",
> + (unsigned long)__entry->mm, __entry->address)
> + );
> +
> +TRACE_EVENT(mm_filemap_fault,
> +
> + TP_PROTO(struct mm_struct *mm, unsigned long address, int flag),
> + TP_ARGS(mm, address, flag),
> +
> + TP_STRUCT__entry(
> + __field(struct mm_struct *, mm)
> + __field(unsigned long, address)
> + __field(int, flag)
> + ),
> +
> + TP_fast_assign(
> + __entry->mm = mm;
> + __entry->address = address;
> + __entry->flag = flag;
> + ),
> +
> + TP_printk("%s: mm=%lx address=%lx",
> + __entry->flag ? "pagein" : "primary fault",
> + (unsigned long)__entry->mm, __entry->address)
> + );
> +
> +TRACE_EVENT(mm_filemap_cow,
> +
> + TP_PROTO(struct mm_struct *mm, unsigned long address),
> +
> + TP_ARGS(mm, address),
> +
> + TP_STRUCT__entry(
> + __field(struct mm_struct *, mm)
> + __field(unsigned long, address)
> + ),
> +
> + TP_fast_assign(
> + __entry->mm = mm;
> + __entry->address = address;
> + ),
> +
> + TP_printk("mm=%lx address=%lx",
> + (unsigned long)__entry->mm, __entry->address)
> + );
> +
> +TRACE_EVENT(mm_filemap_unmap,
> +
> + TP_PROTO(struct mm_struct *mm, unsigned long address),
> +
> + TP_ARGS(mm, address),
> +
> + TP_STRUCT__entry(
> + __field(struct mm_struct *, mm)
> + __field(unsigned long, address)
> + ),
> +
> + TP_fast_assign(
> + __entry->mm = mm;
> + __entry->address = address;
> + ),
> +
> + TP_printk("mm=%lx address=%lx",
> + (unsigned long)__entry->mm, __entry->address)
> + );
> +
> +TRACE_EVENT(mm_filemap_userunmap,
> +
> + TP_PROTO(struct mm_struct *mm, unsigned long address),
> +
> + TP_ARGS(mm, address),
> +
> + TP_STRUCT__entry(
> + __field(struct mm_struct *, mm)
> + __field(unsigned long, address)
> + ),
> +
> + TP_fast_assign(
> + __entry->mm = mm;
> + __entry->address = address;
> + ),
> +
> + TP_printk("mm=%lx address=%lx",
> + (unsigned long)__entry->mm, __entry->address)
> + );
> +
> +TRACE_EVENT(mm_pagereclaim_pgout,
> +
> + TP_PROTO(struct address_space *mapping, unsigned long offset, int anon),
> +
> + TP_ARGS(mapping, offset, anon),
> +
> + TP_STRUCT__entry(
> + __field(struct address_space *, mapping)
> + __field(unsigned long, offset)
> + __field(int, anon)
> + ),
> +
> + TP_fast_assign(
> + __entry->mapping = mapping;
> + __entry->offset = offset;
> + __entry->anon = anon;
> + ),
> +
> + TP_printk("mapping=%lx, offset=%lx %s",
> + (unsigned long)__entry->mapping, __entry->offset,
> + __entry->anon ? "anonymous" : "pagecache")
> + );
> +
> +TRACE_EVENT(mm_pagereclaim_free,
> +
> + TP_PROTO(unsigned long nr_reclaimed),
> +
> + TP_ARGS(nr_reclaimed),
> +
> + TP_STRUCT__entry(
> + __field(unsigned long, nr_reclaimed)
> + ),
> +
> + TP_fast_assign(
> + __entry->nr_reclaimed = nr_reclaimed;
> + ),
> +
> + TP_printk("freed=%ld", __entry->nr_reclaimed)
> + );
> +
> +TRACE_EVENT(mm_pdflush_bgwriteout,
> +
> + TP_PROTO(unsigned long written),
> +
> + TP_ARGS(written),
> +
> + TP_STRUCT__entry(
> + __field(unsigned long, written)
> + ),
> +
> + TP_fast_assign(
> + __entry->written = written;
> + ),
> +
> + TP_printk("written=%ld", __entry->written)
> + );
> +
> +TRACE_EVENT(mm_pdflush_kupdate,
> +
> + TP_PROTO(unsigned long writes),
> +
> + TP_ARGS(writes),
> +
> + TP_STRUCT__entry(
> + __field(unsigned long, writes)
> + ),
> +
> + TP_fast_assign(
> + __entry->writes = writes;
> + ),
> +
> + TP_printk("writes=%ld", __entry->writes)
> + );
> +
> +TRACE_EVENT(mm_balance_dirty,
> +
> + TP_PROTO(unsigned long written),
> +
> + TP_ARGS(written),
> +
> + TP_STRUCT__entry(
> + __field(unsigned long, written)
> + ),
> +
> + TP_fast_assign(
> + __entry->written = written;
> + ),
> +
> + TP_printk("written=%ld", __entry->written)
> + );
> +
> +TRACE_EVENT(mm_page_allocation,
> +
> + TP_PROTO(unsigned long free),
> +
> + TP_ARGS(free),
> +
> + TP_STRUCT__entry(
> + __field(unsigned long, free)
> + ),
> +
> + TP_fast_assign(
> + __entry->free = free;
> + ),
> +
> + TP_printk("zone_free=%ld", __entry->free)
> + );
> +
> +TRACE_EVENT(mm_kswapd_ran,
> +
> + TP_PROTO(struct pglist_data *pgdat, unsigned long reclaimed),
> +
> + TP_ARGS(pgdat, reclaimed),
> +
> + TP_STRUCT__entry(
> + __field(struct pglist_data *, pgdat)
> + __field(int, node_id)
> + __field(unsigned long, reclaimed)
> + ),
> +
> + TP_fast_assign(
> + __entry->pgdat = pgdat;
> + __entry->node_id = pgdat->node_id;
> + __entry->reclaimed = reclaimed;
> + ),
> +
> + TP_printk("node=%d reclaimed=%ld", __entry->node_id, __entry->reclaimed)
> + );
> +
> +TRACE_EVENT(mm_directreclaim_reclaimall,
> +
> + TP_PROTO(int node, unsigned long reclaimed, unsigned long priority),
> +
> + TP_ARGS(node, reclaimed, priority),
> +
> + TP_STRUCT__entry(
> + __field(int, node)
> + __field(unsigned long, reclaimed)
> + __field(unsigned long, priority)
> + ),
> +
> + TP_fast_assign(
> + __entry->node = node;
> + __entry->reclaimed = reclaimed;
> + __entry->priority = priority;
> + ),
> +
> + TP_printk("node=%d reclaimed=%ld priority=%ld", __entry->node, __entry->reclaimed,
> + __entry->priority)
> + );
> +
> +TRACE_EVENT(mm_directreclaim_reclaimzone,
> +
> + TP_PROTO(int node, unsigned long reclaimed, unsigned long priority),
> +
> + TP_ARGS(node, reclaimed, priority),
> +
> + TP_STRUCT__entry(
> + __field(int, node)
> + __field(unsigned long, reclaimed)
> + __field(unsigned long, priority)
> + ),
> +
> + TP_fast_assign(
> + __entry->node = node;
> + __entry->reclaimed = reclaimed;
> + __entry->priority = priority;
> + ),
> +
> + TP_printk("node = %d reclaimed=%ld, priority=%ld",
> + __entry->node, __entry->reclaimed, __entry->priority)
> + );
> +TRACE_EVENT(mm_pagereclaim_shrinkzone,
> +
> + TP_PROTO(unsigned long reclaimed, unsigned long priority),
> +
> + TP_ARGS(reclaimed, priority),
> +
> + TP_STRUCT__entry(
> + __field(unsigned long, reclaimed)
> + __field(unsigned long, priority)
> + ),
> +
> + TP_fast_assign(
> + __entry->reclaimed = reclaimed;
> + __entry->priority = priority;
> + ),
> +
> + TP_printk("reclaimed=%ld priority=%ld",
> + __entry->reclaimed, __entry->priority)
> + );
> +
> +TRACE_EVENT(mm_pagereclaim_shrinkactive,
> +
> + TP_PROTO(unsigned long scanned, int file, int priority),
> +
> + TP_ARGS(scanned, file, priority),
> +
> + TP_STRUCT__entry(
> + __field(unsigned long, scanned)
> + __field(int, file)
> + __field(int, priority)
> + ),
> +
> + TP_fast_assign(
> + __entry->scanned = scanned;
> + __entry->file = file;
> + __entry->priority = priority;
> + ),
> +
> + TP_printk("scanned=%ld, %s, priority=%d",
> + __entry->scanned, __entry->file ? "pagecache" : "anonymous",
> + __entry->priority)
> + );
> +
> +TRACE_EVENT(mm_pagereclaim_shrinkinactive,
> +
> + TP_PROTO(unsigned long scanned, unsigned long reclaimed,
> + int priority),
> +
> + TP_ARGS(scanned, reclaimed, priority),
> +
> + TP_STRUCT__entry(
> + __field(unsigned long, scanned)
> + __field(unsigned long, reclaimed)
> + __field(int, priority)
> + ),
> +
> + TP_fast_assign(
> + __entry->scanned = scanned;
> + __entry->reclaimed = reclaimed;
> + __entry->priority = priority;
> + ),
> +
> + TP_printk("scanned=%ld, reclaimed=%ld, priority=%d",
> + __entry->scanned, __entry->reclaimed,
> + __entry->priority)
> + );
> +
> +#endif /* _TRACE_MM_H */
> +
> +/* This part must be outside protection */
> +#include <trace/define_trace.h>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 1b60f30..af4a964 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -34,6 +34,7 @@
> #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
> #include <linux/memcontrol.h>
> #include <linux/mm_inline.h> /* for page_is_file_cache() */
> +#include <trace/events/mm.h>
> #include "internal.h"
>
> /*
> @@ -1568,6 +1569,8 @@ retry_find:
> */
> ra->prev_pos = (loff_t)page->index << PAGE_CACHE_SHIFT;
> vmf->page = page;
> + trace_mm_filemap_fault(vma->vm_mm, (unsigned long)vmf->virtual_address,
> + vmf->flags&FAULT_FLAG_NONLINEAR);
> return ret | VM_FAULT_LOCKED;
>
> no_cached_page:
> diff --git a/mm/memory.c b/mm/memory.c
> index 4126dd1..a4a580c 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -61,6 +61,7 @@
> #include <asm/tlb.h>
> #include <asm/tlbflush.h>
> #include <asm/pgtable.h>
> +#include <trace/events/mm.h>
>
> #include "internal.h"
>
> @@ -812,15 +813,17 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> addr) != page->index)
> set_pte_at(mm, addr, pte,
> pgoff_to_pte(page->index));
> - if (PageAnon(page))
> + if (PageAnon(page)) {
> anon_rss--;
> - else {
> + trace_mm_anon_userfree(mm, addr);
> + } else {
> if (pte_dirty(ptent))
> set_page_dirty(page);
> if (pte_young(ptent) &&
> likely(!VM_SequentialReadHint(vma)))
> mark_page_accessed(page);
> file_rss--;
> + trace_mm_filemap_userunmap(mm, addr);
> }
> page_remove_rmap(page);
> if (unlikely(page_mapcount(page) < 0))
> @@ -1896,7 +1899,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
> unsigned long address, pte_t *page_table, pmd_t *pmd,
> spinlock_t *ptl, pte_t orig_pte)
> {
> - struct page *old_page, *new_page;
> + struct page *old_page, *new_page = NULL;
> pte_t entry;
> int reuse = 0, ret = 0;
> int page_mkwrite = 0;
> @@ -2050,9 +2053,12 @@ gotten:
> if (!PageAnon(old_page)) {
> dec_mm_counter(mm, file_rss);
> inc_mm_counter(mm, anon_rss);
> + trace_mm_filemap_cow(mm, address);
> }
> - } else
> + } else {
> inc_mm_counter(mm, anon_rss);
> + trace_mm_anon_cow(mm, address);
> + }
> flush_cache_page(vma, address, pte_pfn(orig_pte));
> entry = mk_pte(new_page, vma->vm_page_prot);
> entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> @@ -2449,7 +2455,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
> int write_access, pte_t orig_pte)
> {
> spinlock_t *ptl;
> - struct page *page;
> + struct page *page = NULL;
> swp_entry_t entry;
> pte_t pte;
> struct mem_cgroup *ptr = NULL;
> @@ -2549,6 +2555,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
> unlock:
> pte_unmap_unlock(page_table, ptl);
> out:
> + trace_mm_anon_pgin(mm, address);
> return ret;
> out_nomap:
> mem_cgroup_cancel_charge_swapin(ptr);

In swapin, you trace "mm" and "virtual address". but in swap-out, you trace "mapping" and
"virtual address".

Oh well, we can't compare swap-in and swap-out log. Please consider to make input and output synmetric.


> @@ -2582,6 +2589,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
> goto oom;
> __SetPageUptodate(page);
>
> + trace_mm_anon_fault(mm, address);
> if (mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))
> goto oom_free_page;
>
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index bb553c3..ef92a97 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -34,6 +34,7 @@
> #include <linux/syscalls.h>
> #include <linux/buffer_head.h>
> #include <linux/pagevec.h>
> +#include <trace/events/mm.h>
>
> /*
> * The maximum number of pages to writeout in a single bdflush/kupdate
> @@ -574,6 +575,7 @@ static void balance_dirty_pages(struct address_space *mapping)
> congestion_wait(WRITE, HZ/10);
> }
>
> + trace_mm_balance_dirty(pages_written);

perhaps, you need to explain why this tracepoint is useful.
I haven't use this log on my past debugging.

perhaps, if you only need number of written pages, new vmstat field is
more useful?


> if (bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh &&
> bdi->dirty_exceeded)
> bdi->dirty_exceeded = 0;
> @@ -716,6 +718,7 @@ static void background_writeout(unsigned long _min_pages)
> break;
> }
> }
> + trace_mm_pdflush_bgwriteout(_min_pages);
> }

ditto.


>
> /*
> @@ -776,6 +779,7 @@ static void wb_kupdate(unsigned long arg)
> nr_to_write = global_page_state(NR_FILE_DIRTY) +
> global_page_state(NR_UNSTABLE_NFS) +
> (inodes_stat.nr_inodes - inodes_stat.nr_unused);
> + trace_mm_pdflush_kupdate(nr_to_write);
> while (nr_to_write > 0) {
> wbc.more_io = 0;
> wbc.encountered_congestion = 0;

ditto.


> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 0727896..ca9355e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -48,6 +48,7 @@
> #include <linux/page_cgroup.h>
> #include <linux/debugobjects.h>
> #include <linux/kmemleak.h>
> +#include <trace/events/mm.h>
>
> #include <asm/tlbflush.h>
> #include <asm/div64.h>
> @@ -1440,6 +1441,7 @@ zonelist_scan:
> mark = zone->pages_high;
> if (!zone_watermark_ok(zone, order, mark,
> classzone_idx, alloc_flags)) {
> + trace_mm_page_allocation(zone_page_state(zone, NR_FREE_PAGES));
> if (!zone_reclaim_mode ||
> !zone_reclaim(zone, gfp_mask, order))
> goto this_zone_full;

bad name.
it is not the notification of allocation.

Plus, this is wrong place too. it doesn't mean allocation failure.

it only mean a zone is not sufficient memory.
However this tracepoint don't have zone argument. then it is totally unuseful.

Plus, NR_FREE_PAGES is not sufficient informantion. the most common reason
of allocation failure is not low NR_FREE_PAGES. it is buddy fragmentation.




> diff --git a/mm/rmap.c b/mm/rmap.c
> index 23122af..f2156ca 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -50,6 +50,7 @@
> #include <linux/memcontrol.h>
> #include <linux/mmu_notifier.h>
> #include <linux/migrate.h>
> +#include <trace/events/mm.h>
>
> #include <asm/tlbflush.h>
>
> @@ -1025,6 +1026,7 @@ static int try_to_unmap_anon(struct page *page, int unlock, int migration)
> if (mlocked)
> break; /* stop if actually mlocked page */
> }
> + trace_mm_anon_unmap(vma->vm_mm, vma->vm_start+page->index);
> }
>
> page_unlock_anon_vma(anon_vma);
> @@ -1152,6 +1154,7 @@ static int try_to_unmap_file(struct page *page, int unlock, int migration)
> goto out;
> }
> vma->vm_private_data = (void *) max_nl_cursor;
> + trace_mm_filemap_unmap(vma->vm_mm, vma->vm_start+page->index);
> }
> cond_resched_lock(&mapping->i_mmap_lock);
> max_nl_cursor += CLUSTER_SIZE;

try_to_unmap() and try_to_unlock() are pretty difference.
maybe, we only need try_to_unmap() case?




> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 95c08a8..bed7125 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -40,6 +40,8 @@
> #include <linux/memcontrol.h>
> #include <linux/delayacct.h>
> #include <linux/sysctl.h>
> +#define CREATE_TRACE_POINTS
> +#include <trace/events/mm.h>
>
> #include <asm/tlbflush.h>
> #include <asm/div64.h>
> @@ -417,6 +419,8 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
> ClearPageReclaim(page);
> }
> inc_zone_page_state(page, NR_VMSCAN_WRITE);
> + trace_mm_pagereclaim_pgout(mapping, page->index<<PAGE_SHIFT,
> + PageAnon(page));

I don't think it this useful information.

for file-mapped)
[mapping, offset] pair represent which portion is pointed this cache page.
for swa-backed)
[process, virtual_address] ..


Plus, I have one question. How do we combine this information and blktrace?
if we can't see I/O activity relationship, this is really unuseful.


> return PAGE_SUCCESS;
> }
>
> @@ -796,6 +800,7 @@ keep:
> if (pagevec_count(&freed_pvec))
> __pagevec_free(&freed_pvec);
> count_vm_events(PGACTIVATE, pgactivate);
> + trace_mm_pagereclaim_free(nr_reclaimed);
> return nr_reclaimed;
> }

No.
if administrator only need number of free pages.
/proc/meminfo and /proc/vmstat already provide it.

but I don't think it is sufficient information.
May I ask when do you use this tracepoint? and why?



>
> @@ -1182,6 +1187,8 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,
> done:
> local_irq_enable();
> pagevec_release(&pvec);
> + trace_mm_pagereclaim_shrinkinactive(nr_scanned, nr_reclaimed,
> + priority);
> return nr_reclaimed;
> }
>
> @@ -1316,6 +1323,7 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
> if (buffer_heads_over_limit)
> pagevec_strip(&pvec);
> pagevec_release(&pvec);
> + trace_mm_pagereclaim_shrinkactive(pgscanned, file, priority);
> }
>
> static int inactive_anon_is_low_global(struct zone *zone)
> @@ -1516,6 +1524,7 @@ static void shrink_zone(int priority, struct zone *zone,
> }
>
> sc->nr_reclaimed = nr_reclaimed;
> + trace_mm_pagereclaim_shrinkzone(nr_reclaimed, priority);
>
> /*
> * Even if we did not try to evict anon pages at all, we want to
> @@ -1678,6 +1687,8 @@ out:
> if (priority < 0)
> priority = 0;
>
> + trace_mm_directreclaim_reclaimall(zonelist[0]._zonerefs->zone->node,
> + sc->nr_reclaimed, priority);
> if (scanning_global_lru(sc)) {
> for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
>

Why do you want to log to node? Why not zone itself?

Plus, Why you ignore try_to_free_pages() latency?



> @@ -1947,6 +1958,7 @@ out:
> goto loop_again;
> }
>
> + trace_mm_kswapd_ran(pgdat, sc.nr_reclaimed);
> return sc.nr_reclaimed;
> }
>

equal to kswapd_steal field in /proc/vmstat?


> @@ -2299,7 +2311,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> const unsigned long nr_pages = 1 << order;
> struct task_struct *p = current;
> struct reclaim_state reclaim_state;
> - int priority;
> + int priority = ZONE_RECLAIM_PRIORITY;
> struct scan_control sc = {
> .may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
> .may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
> @@ -2366,6 +2378,8 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
>
> p->reclaim_state = NULL;
> current->flags &= ~(PF_MEMALLOC | PF_SWAPWRITE);
> + trace_mm_directreclaim_reclaimzone(zone->node,
> + sc.nr_reclaimed, priority);
> return sc.nr_reclaimed >= nr_pages;
> }

this is _zone_ reclaim. but the code pass node.
Plus, if we consider to log page allocation and reclaim, we shouldn't ignore
gfp_mask.

it cause to change many allocation/reclaim behavior.


----
My current conclusion is, nobody use this patch on his own system.
the patch have many unclear useful tracepoint.

At least, patch splitting is needed for productive discussion.
e.g.
- reclaim IO activity tracing
- memory fragmentation visualizer
- per i-node page cache visualizer (likes Wu's filecache patch)
- reclaim failure reason tracing and aggregation ftrace plugin
- reclaim latency tracing


I'm glad if larry resubmit this effort.


2009-06-18 19:23:19

by Larry Woodman

[permalink] [raw]
Subject: Re: [Patch] mm tracepoints update - use case.

On Thu, 2009-06-18 at 16:57 +0900, KOSAKI Motohiro wrote:

Thanks for the feedback Kosaki!


> Scenario 1. OOM killer happend. why? and who bring it?

Doesnt the showmem() and stack trace to the console when the OOM kill
occurred show enough in the majority of cases? I realize that direct
alloc_pages() calls are not accounted for here but that can be really
invasive.

> Scenario 2. page allocation failure by memory fragmentation

Are you talking about order>0 allocation failures here? Most of the
slabs are single page allocations now.

> Scenario 3. try_to_free_pages() makes very long latency. why?

This is available in the mm tracepoints, they all include timestamps.

> Scenario 4. sar output that free memory dramatically reduced at 10 minute ago, and
> it already recover now. What's happen?

Is this really important? It would take buffering lots of data to
figure out what happened in the past.

>
> - suspects
> - kernel memory leak

Other than direct callers to the page allocator isnt that covered with
the kmemtrace stuff?

> - userland memory leak

The mm tracepoints track all user space allocations and frees(perhaps
too many?).

> - stupid driver use too much memory

hopefully kmemtrace will catch this?

> - userland application suddenly start to use much memory

The mm tracepoints track all user space allocations and frees.

>
> - what information are valuable?
> - slab usage information (kmemtrace already does)
> - page allocator usage information
> - rss of all processes at oom happend
> - why recent try_to_free_pages() can't reclaim any page?

The counters in the mm tracepoints do give counts but not the reasons
that the pagereclaim code fails.

> - recent sycall history
> - buddy fragmentation info
>
>
> Plus, another requirement here
> 1. trace page refault distance (likes past Rik's /proc/refault patch)
>
> 2. file cache visualizer - Which file use many page-cache?
> - afaik, Wu Fengguang is working on this issue.
>
>
> --------------------------------------------
> And, here is my reviewing comment to his patch.
> btw, I haven't full review it yet. perhaps I might be overlooking something.
>
>
> First, this is general review comment.
>
> - Please don't display mm and/or another kernel raw pointer.
> if we assume non stop system, we can't use kernel-dump. Thus kernel pointer
> logging is not so useful.

OK, I just dont know how valuable the trace output is with out some raw
data like the mm_struct.

> Any userland tools can't parse it. (/proc/kcore don't help this situation,
> the pointer might be freed before parsing)
> - Please makes patch series. one big patch is harder review.

OK.

> - Please write patch description and use-case.

OK.

> - Please consider how do this feature works on mem-cgroup.
> (IOW, please don't ignore many "if (scanning_global_lru())")
> - tracepoint caller shouldn't have any assumption of displaying representation.
> e.g.
> wrong) trace_mm_pagereclaim_pgout(mapping, page->index<<PAGE_SHIFT, PageAnon(page));
> good) trace_mm_pagereclaim_pgout(mapping, page)

OK.

> that's general and good callback and/or hook manner.
>
>
>

2009-06-18 19:41:28

by Rik van Riel

[permalink] [raw]
Subject: Re: [Patch] mm tracepoints update - use case.

Larry Woodman wrote:

>> - Please don't display mm and/or another kernel raw pointer.
>> if we assume non stop system, we can't use kernel-dump. Thus kernel pointer
>> logging is not so useful.
>
> OK, I just dont know how valuable the trace output is with out some raw
> data like the mm_struct.

I believe that we do want something like the mm_struct in
the trace info, so we can figure out which process was
allocating pages, etc...

>> - Please consider how do this feature works on mem-cgroup.
>> (IOW, please don't ignore many "if (scanning_global_lru())")

Good point, we want to trace cgroup vs non-cgroup reclaims,
too.

>> - tracepoint caller shouldn't have any assumption of displaying representation.
>> e.g.
>> wrong) trace_mm_pagereclaim_pgout(mapping, page->index<<PAGE_SHIFT, PageAnon(page));
>> good) trace_mm_pagereclaim_pgout(mapping, page)
>
> OK.
>
>> that's general and good callback and/or hook manner.

How do we figure those out from the page pointer at the time
the tracepoint triggers?

I believe that it would be useful to export that info in the
trace point, since we cannot expect the userspace trace tool
to figure out these things from the struct page address.

Or did I overlook something here?

--
All rights reversed.

2009-06-22 03:38:09

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [Patch] mm tracepoints update - use case.

> Larry Woodman wrote:
>
> >> - Please don't display mm and/or another kernel raw pointer.
> >> if we assume non stop system, we can't use kernel-dump. Thus kernel pointer
> >> logging is not so useful.
> >
> > OK, I just dont know how valuable the trace output is with out some raw
> > data like the mm_struct.
>
> I believe that we do want something like the mm_struct in
> the trace info, so we can figure out which process was
> allocating pages, etc...

Yes.
I think we need to print tgid, it is needed to imporove CONFIG_MM_OWNER.
current CONFIG_MM_OWNER back-pointer point to semi-random task_struct.


> >> - Please consider how do this feature works on mem-cgroup.
> >> (IOW, please don't ignore many "if (scanning_global_lru())")
>
> Good point, we want to trace cgroup vs non-cgroup reclaims,
> too.

thank you.

>
> >> - tracepoint caller shouldn't have any assumption of displaying representation.
> >> e.g.
> >> wrong) trace_mm_pagereclaim_pgout(mapping, page->index<<PAGE_SHIFT, PageAnon(page));
> >> good) trace_mm_pagereclaim_pgout(mapping, page)
> >
> > OK.
> >
> >> that's general and good callback and/or hook manner.
>
> How do we figure those out from the page pointer at the time
> the tracepoint triggers?
>
> I believe that it would be useful to export that info in the
> trace point, since we cannot expect the userspace trace tool
> to figure out these things from the struct page address.
>
> Or did I overlook something here?

current, TRACE_EVENT have two step information trasformation.

- step1 - TP_fast_assign()
it is called from tracepoint directly. it makes ring-buffer representaion.
- step2 - TP_printk
it is called when reading debug/tracing/trace file. it makes printable
representation from ring-buffer data.

example:

trace_sched_switch() has three argument, rq, prev, next.

--------------------------------------------------
static inline void
context_switch(struct rq *rq, struct task_struct *prev,
struct task_struct *next)
{
(snip)
trace_sched_switch(rq, prev, next);

-------------------------------------------------

TP_fast_assing extract data from argument pointer.
-----------------------------------------------------
TP_fast_assign(
memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
__entry->prev_pid = prev->pid;
__entry->prev_prio = prev->prio;
__entry->prev_state = prev->state;
memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
__entry->next_pid = next->pid;
__entry->next_prio = next->prio;
),
-----------------------------------------------------


I think mm tracepoint can do the same way.



2009-06-22 03:38:26

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [Patch] mm tracepoints update - use case.

Hi

> Thanks for the feedback Kosaki!
>
>
> > Scenario 1. OOM killer happend. why? and who bring it?
>
> Doesnt the showmem() and stack trace to the console when the OOM kill
> occurred show enough in the majority of cases? I realize that direct
> alloc_pages() calls are not accounted for here but that can be really
> invasive.

showmem() display _result_ of memory usage and fragmentation.
but Administrator often need to know the _reason_.

Plus, kmemtrace already trace slab allocate/free activity.
You mean you think this is really invasive?


> > Scenario 2. page allocation failure by memory fragmentation
>
> Are you talking about order>0 allocation failures here? Most of the
> slabs are single page allocations now.

Yes, order>0.
but I confused. Why do you talk about slab, not page alloc?

Note, non-x86 architecture freqently use order-1 allocation for
making stack.



> > Scenario 3. try_to_free_pages() makes very long latency. why?
>
> This is available in the mm tracepoints, they all include timestamps.

perhaps, no.
Administrator need to know the reason. not accumulated time. it's the result.

We can guess some reason
- IO congestion
- memory eating speed is fast than reclaim speed
- memory fragmentation

but it's only guess. we often need to get data.


> > Scenario 4. sar output that free memory dramatically reduced at 10 minute ago, and
> > it already recover now. What's happen?
>
> Is this really important? It would take buffering lots of data to
> figure out what happened in the past.

ok, my scenario description is a bit wrong.

if userland process explicitly consume memory or explicitely write
many data, it is true.

Is this more appropriate?

"userland process take the same action periodically, but only 10 minute ago
free memory reduced, why?"



> > - suspects
> > - kernel memory leak
>
> Other than direct callers to the page allocator isnt that covered with
> the kmemtrace stuff?

Yeah.
perhaps, kmemtrace enhance to cover page allocator is good approach.


> > - userland memory leak
>
> The mm tracepoints track all user space allocations and frees(perhaps
> too many?).

hmhm.


>
> > - stupid driver use too much memory
>
> hopefully kmemtrace will catch this?

ditto.
I agree with kmemtrace enhancement is good idea.

>
> > - userland application suddenly start to use much memory
>
> The mm tracepoints track all user space allocations and frees.

ok.


> > - what information are valuable?
> > - slab usage information (kmemtrace already does)
> > - page allocator usage information
> > - rss of all processes at oom happend
> > - why recent try_to_free_pages() can't reclaim any page?
>
> The counters in the mm tracepoints do give counts but not the reasons
> that the pagereclaim code fails.

That's very important key point. please don't ignore.


2009-06-22 15:05:19

by Larry Woodman

[permalink] [raw]
Subject: Re: [Patch] mm tracepoints update - use case.

On Mon, 2009-06-22 at 12:37 +0900, KOSAKI Motohiro wrote:

Thanks for the feedback KOSAKI!


> > Larry Woodman wrote:
> >
> > >> - Please don't display mm and/or another kernel raw pointer.
> > >> if we assume non stop system, we can't use kernel-dump. Thus kernel pointer
> > >> logging is not so useful.
> > >
> > > OK, I just dont know how valuable the trace output is with out some raw
> > > data like the mm_struct.
> >
> > I believe that we do want something like the mm_struct in
> > the trace info, so we can figure out which process was
> > allocating pages, etc...
>
> Yes.
> I think we need to print tgid, it is needed to imporove CONFIG_MM_OWNER.
> current CONFIG_MM_OWNER back-pointer point to semi-random task_struct.

All of the tracepoints contain command, pid, CPU and timestamp and
tracepoint name information. Are you saying I should capture more
information in specific mm tracepoints like the tgid and if the answer
is yes, what would we need this for?


cat-10962 [005] 1877.984589: mm_anon_fault:
cat-10962 [005] 1877.984638: mm_anon_fault:
cat-10962 [005] 1877.984658: sched_switch:
cat-10962 [005] 1877.988359: sched_switch:

>
>
> > >> - Please consider how do this feature works on mem-cgroup.
> > >> (IOW, please don't ignore many "if (scanning_global_lru())")
> >
> > Good point, we want to trace cgroup vs non-cgroup reclaims,
> > too.
>
> thank you.

All of the mm tracepoints are located above the cgroup specific calls.
This means that they will capture the same exact data reguardless of
whether cgroups are used or not. Are you saying I should capture
whether the data was specific to a cgroup or it was from the global
LRUs?


>
> >
> > >> - tracepoint caller shouldn't have any assumption of displaying representation.
> > >> e.g.
> > >> wrong) trace_mm_pagereclaim_pgout(mapping, page->index<<PAGE_SHIFT, PageAnon(page));
> > >> good) trace_mm_pagereclaim_pgout(mapping, page)
> > >
> > > OK.
> > >
> > >> that's general and good callback and/or hook manner.
> >
> > How do we figure those out from the page pointer at the time
> > the tracepoint triggers?
> >
> > I believe that it would be useful to export that info in the
> > trace point, since we cannot expect the userspace trace tool
> > to figure out these things from the struct page address.
> >
> > Or did I overlook something here?
>
> current, TRACE_EVENT have two step information trasformation.
>
> - step1 - TP_fast_assign()
> it is called from tracepoint directly. it makes ring-buffer representaion.
> - step2 - TP_printk
> it is called when reading debug/tracing/trace file. it makes printable
> representation from ring-buffer data.
>
> example:
>
> trace_sched_switch() has three argument, rq, prev, next.
>
> --------------------------------------------------
> static inline void
> context_switch(struct rq *rq, struct task_struct *prev,
> struct task_struct *next)
> {
> (snip)
> trace_sched_switch(rq, prev, next);
>
> -------------------------------------------------
>
> TP_fast_assing extract data from argument pointer.
> -----------------------------------------------------
> TP_fast_assign(
> memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
> __entry->prev_pid = prev->pid;
> __entry->prev_prio = prev->prio;
> __entry->prev_state = prev->state;
> memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
> __entry->next_pid = next->pid;
> __entry->next_prio = next->prio;
> ),
> -----------------------------------------------------
>
>
> I think mm tracepoint can do the same way.

The sched_switch tracepoint tells us the name of the outgoing and
incomming process during a context switch so this information is very
significant to that tracepoint. What mm tracepoint would I need to add
such information without it being redundant?

Thanks, Larry Woodman

>
>
>
>

2009-06-22 15:28:41

by Larry Woodman

[permalink] [raw]
Subject: Re: [Patch] mm tracepoints update - use case.

On Mon, 2009-06-22 at 12:37 +0900, KOSAKI Motohiro wrote:

Thanks for the feedback Kosaki!

> Hi
>
> > Thanks for the feedback Kosaki!
> >
> >
> > > Scenario 1. OOM killer happend. why? and who bring it?
> >
> > Doesnt the showmem() and stack trace to the console when the OOM kill
> > occurred show enough in the majority of cases? I realize that direct
> > alloc_pages() calls are not accounted for here but that can be really
> > invasive.
>
> showmem() display _result_ of memory usage and fragmentation.
> but Administrator often need to know the _reason_.

Right, thats why I have mm tracepoints in locations like shrink_zone,
shrink_active and shrink_inactive so we can drill down into exactly what
happened when either kswapd ran or a direct reclaim occured out of the
page allocator. Since we will know the timestamps and the number of
pages scanned and reclaimed we can tell the reason the page reclamation
did not supply enough pages and therefore the OOM occurred.

Do you think this is enough information or do you thine we need more?

>
> Plus, kmemtrace already trace slab allocate/free activity.
> You mean you think this is really invasive?
>
>
> > > Scenario 2. page allocation failure by memory fragmentation
> >
> > Are you talking about order>0 allocation failures here? Most of the
> > slabs are single page allocations now.
>
> Yes, order>0.
> but I confused. Why do you talk about slab, not page alloc?
>
> Note, non-x86 architecture freqently use order-1 allocation for
> making stack.

OK, I can add a tracepoint in the lumpy reclaim logic when it fails to
get enough contiguous memory to satisfy a high order allocation.

>
>
>
> > > Scenario 3. try_to_free_pages() makes very long latency. why?
> >
> > This is available in the mm tracepoints, they all include timestamps.
>
> perhaps, no.
> Administrator need to know the reason. not accumulated time. it's the result.
>
> We can guess some reason
> - IO congestion

This can be seen when the number of page scans is significantly greater
than the number pf page frees and pagouts. Do you thing we need to
combine these tracepoints or add one to throttle_vm_writeout() when it
needs to stall?

> - memory eating speed is fast than reclaim speed

The anonymous and filemapped tracepoints combined with the reclaim
tracepoints will tell us this, do you thing we need more tracepoints to
pinpoint when allocations outpace reclamations?

> - memory fragmentation

Would adding the order to the page_allocation tracepoint satisfy this?
Currently this tracepoint only triggers when the allocation fails and we
need to reclaim memory. Another option would be to include the order
information to the direct reclaim tracepoint so we can tell if it was
triggered due to memory fragmentation. Sorry but I navent seen many
cases in which fragmented memory caused failures.

>
> but it's only guess. we often need to get data.
>
>
> > > Scenario 4. sar output that free memory dramatically reduced at 10 minute ago, and
> > > it already recover now. What's happen?
> >
> > Is this really important? It would take buffering lots of data to
> > figure out what happened in the past.
>
> ok, my scenario description is a bit wrong.
>
> if userland process explicitly consume memory or explicitely write
> many data, it is true.
>
> Is this more appropriate?
>
> "userland process take the same action periodically, but only 10 minute ago
> free memory reduced, why?"
>
We could have a user space script that enabled specific tracepoints only
when it noticed something like the free pages fell below some threshold
and disabled it when free pages climbed back up above some other
threshold. Would this help?

>
>
> > > - suspects
> > > - kernel memory leak
> >
> > Other than direct callers to the page allocator isnt that covered with
> > the kmemtrace stuff?
>
> Yeah.
> perhaps, kmemtrace enhance to cover page allocator is good approach.
>
>
> > > - userland memory leak
> >
> > The mm tracepoints track all user space allocations and frees(perhaps
> > too many?).
>
> hmhm.

Is this a yes? Would the user space script described above help?

>
>
> >
> > > - stupid driver use too much memory
> >
> > hopefully kmemtrace will catch this?
>
> ditto.
> I agree with kmemtrace enhancement is good idea.
>
> >
> > > - userland application suddenly start to use much memory
> >
> > The mm tracepoints track all user space allocations and frees.
>
> ok.
>
>
> > > - what information are valuable?
> > > - slab usage information (kmemtrace already does)
> > > - page allocator usage information
> > > - rss of all processes at oom happend
> > > - why recent try_to_free_pages() can't reclaim any page?
> >
> > The counters in the mm tracepoints do give counts but not the reasons
> > that the pagereclaim code fails.
>
> That's very important key point. please don't ignore.

OK, would you suggest changing the code to count failures or simply
adding a tracepoint to the failure path which would potentially capture
lots more data?

>
>
>

2009-06-23 05:52:33

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [Patch] mm tracepoints update - use case.

> On Mon, 2009-06-22 at 12:37 +0900, KOSAKI Motohiro wrote:
>
> Thanks for the feedback KOSAKI!
>
>
> > > Larry Woodman wrote:
> > >
> > > >> - Please don't display mm and/or another kernel raw pointer.
> > > >> if we assume non stop system, we can't use kernel-dump. Thus kernel pointer
> > > >> logging is not so useful.
> > > >
> > > > OK, I just dont know how valuable the trace output is with out some raw
> > > > data like the mm_struct.
> > >
> > > I believe that we do want something like the mm_struct in
> > > the trace info, so we can figure out which process was
> > > allocating pages, etc...
> >
> > Yes.
> > I think we need to print tgid, it is needed to imporove CONFIG_MM_OWNER.
> > current CONFIG_MM_OWNER back-pointer point to semi-random task_struct.
>
> All of the tracepoints contain command, pid, CPU and timestamp and
> tracepoint name information. Are you saying I should capture more
> information in specific mm tracepoints like the tgid and if the answer
> is yes, what would we need this for?
>
>
> cat-10962 [005] 1877.984589: mm_anon_fault:
> cat-10962 [005] 1877.984638: mm_anon_fault:
> cat-10962 [005] 1877.984658: sched_switch:
> cat-10962 [005] 1877.988359: sched_switch:

this is sufficient in almost cause. but there are few exception.

ftrace common header logged current->pid, but kswapd steal the page
from another process. we interest victim process, not kswapd pid.
(e.g. Please see your trace_mm_anon_unmap())


> > > >> - Please consider how do this feature works on mem-cgroup.
> > > >> (IOW, please don't ignore many "if (scanning_global_lru())")
> > >
> > > Good point, we want to trace cgroup vs non-cgroup reclaims,
> > > too.
> >
> > thank you.
>
> All of the mm tracepoints are located above the cgroup specific calls.
> This means that they will capture the same exact data reguardless of
> whether cgroups are used or not. Are you saying I should capture
> whether the data was specific to a cgroup or it was from the global
> LRUs?

Yes and No.

example, if frequently cgroup reclaim occur, it mean administrator
miss to set memory limit.
but if frequently global reclaim occur, it mean we need to add physical
memory.

I mean, cgroup or not is one of major information for making analysis.
and perhaps cgroup path name is also useful.



> > > >> - tracepoint caller shouldn't have any assumption of displaying representation.
> > > >> e.g.
> > > >> wrong) trace_mm_pagereclaim_pgout(mapping, page->index<<PAGE_SHIFT, PageAnon(page));
> > > >> good) trace_mm_pagereclaim_pgout(mapping, page)
> > > >
> > > > OK.
> > > >
> > > >> that's general and good callback and/or hook manner.
> > >
> > > How do we figure those out from the page pointer at the time
> > > the tracepoint triggers?
> > >
> > > I believe that it would be useful to export that info in the
> > > trace point, since we cannot expect the userspace trace tool
> > > to figure out these things from the struct page address.
> > >
> > > Or did I overlook something here?
> >
> > current, TRACE_EVENT have two step information trasformation.
> >
> > - step1 - TP_fast_assign()
> > it is called from tracepoint directly. it makes ring-buffer representaion.
> > - step2 - TP_printk
> > it is called when reading debug/tracing/trace file. it makes printable
> > representation from ring-buffer data.
> >
> > example:
> >
> > trace_sched_switch() has three argument, rq, prev, next.
> >
> > --------------------------------------------------
> > static inline void
> > context_switch(struct rq *rq, struct task_struct *prev,
> > struct task_struct *next)
> > {
> > (snip)
> > trace_sched_switch(rq, prev, next);
> >
> > -------------------------------------------------
> >
> > TP_fast_assing extract data from argument pointer.
> > -----------------------------------------------------
> > TP_fast_assign(
> > memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
> > __entry->prev_pid = prev->pid;
> > __entry->prev_prio = prev->prio;
> > __entry->prev_state = prev->state;
> > memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
> > __entry->next_pid = next->pid;
> > __entry->next_prio = next->prio;
> > ),
> > -----------------------------------------------------
> >
> >
> > I think mm tracepoint can do the same way.
>
> The sched_switch tracepoint tells us the name of the outgoing and
> incomming process during a context switch so this information is very
> significant to that tracepoint. What mm tracepoint would I need to add
> such information without it being redundant?

perhaps I missed you mean.
I only pointed out that mm tracepoint can reduce number of argument.

I don't says increase/decrease display information.


maybe my explanation was wrong. my english is very poor. sorry.