2013-05-09 06:08:34

by Francis Deslauriers

[permalink] [raw]
Subject: [page fault tracepoint 1/2] Add page fault trace event definitions

Add page_fault_entry and page_fault_exit event definitions. It will
allow each architecture to instrument their page faults.

Signed-off-by: Francis Deslauriers <[email protected]>
Reviewed-by: Raphaël Beamonte <[email protected]>
Reviewed-by: Mathieu Desnoyers <[email protected]>
---
include/trace/events/fault.h | 51 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 51 insertions(+)
create mode 100644 include/trace/events/fault.h

diff --git a/include/trace/events/fault.h b/include/trace/events/fault.h
new file mode 100644
index 0000000..522ddee
--- /dev/null
+++ b/include/trace/events/fault.h
@@ -0,0 +1,51 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM fault
+
+#if !defined(_TRACE_FAULT_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_FAULT_H
+
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(page_fault_entry,
+
+ TP_PROTO(struct pt_regs *regs, unsigned long address,
+ int write_access),
+
+ TP_ARGS(regs, address, write_access),
+
+ TP_STRUCT__entry(
+ __field( unsigned long, ip )
+ __field( unsigned long, addr )
+ __field( uint8_t, write )
+ ),
+
+ TP_fast_assign(
+ __entry->ip = regs ? instruction_pointer(regs) : 0UL;
+ __entry->addr = address;
+ __entry->write = !!write_access;
+ ),
+
+ TP_printk("ip=%lu addr=%lu write_access=%d",
+ __entry->ip, __entry->addr, __entry->write)
+);
+
+TRACE_EVENT(page_fault_exit,
+
+ TP_PROTO(int result),
+
+ TP_ARGS(result),
+
+ TP_STRUCT__entry(
+ __field( int, res )
+ ),
+
+ TP_fast_assign(
+ __entry->res = result;
+ ),
+
+ TP_printk("result=%d", __entry->res)
+);
+
+#endif /* _TRACE_FAULT_H */
+/* This part must be outside protection */
+#include <trace/define_trace.h>
--
1.7.10.4


2013-05-09 06:06:36

by Francis Deslauriers

[permalink] [raw]
Subject: [page fault tracepoint 2/2] x86:Instruments page fault trace event

Signed-off-by: Francis Deslauriers <[email protected]>
Reviewed-by: Raphaël Beamonte <[email protected]>
---
arch/x86/mm/fault.c | 11 +++++++++++
mm/memory.c | 5 +++++
2 files changed, 16 insertions(+)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 654be4a..e227828 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -20,6 +20,9 @@
#include <asm/kmemcheck.h> /* kmemcheck_*(), ... */
#include <asm/fixmap.h> /* VSYSCALL_START */

+#define CREATE_TRACE_POINTS
+#include <trace/events/fault.h> /* trace_page_fault_*(), ... */
+
/*
* Page fault error code bits:
*
@@ -756,12 +759,18 @@ __bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,

if (likely(show_unhandled_signals))
show_signal_msg(regs, error_code, address, tsk);
+ trace_page_fault_entry(regs, address, error_code & PF_WRITE);

tsk->thread.cr2 = address;
tsk->thread.error_code = error_code;
tsk->thread.trap_nr = X86_TRAP_PF;

force_sig_info_fault(SIGSEGV, si_code, address, tsk, 0);
+ /*
+ * Using -1 here, since there is no VM_FAULT flag to identify
+ * user accesses triggering SIGSEGV.
+ */
+ trace_page_fault_exit(-1);

return;
}
@@ -1185,7 +1194,9 @@ good_area:
* make sure we exit gracefully rather than endlessly redo
* the fault:
*/
+ trace_page_fault_entry(regs, address, write);
fault = handle_mm_fault(mm, vma, address, flags);
+ trace_page_fault_exit(fault);

if (unlikely(fault & (VM_FAULT_RETRY|VM_FAULT_ERROR))) {
if (mm_fault_error(regs, error_code, address, fault))
diff --git a/mm/memory.c b/mm/memory.c
index 6dc1882..0bd86f8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -67,6 +67,8 @@
#include <asm/tlbflush.h>
#include <asm/pgtable.h>

+#include <trace/events/fault.h>
+
#include "internal.h"

#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
@@ -1829,8 +1831,11 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
if (foll_flags & FOLL_NOWAIT)
fault_flags |= (FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT);

+ trace_page_fault_entry(0, start,
+ foll_flags & FOLL_WRITE);
ret = handle_mm_fault(mm, vma, start,
fault_flags);
+ trace_page_fault_exit(ret);

if (ret & VM_FAULT_ERROR) {
if (ret & VM_FAULT_OOM)
--
1.7.10.4

2013-05-09 06:49:50

by zhangwei(Jovi)

[permalink] [raw]
Subject: Re: [page fault tracepoint 1/2] Add page fault trace event definitions

On 2013/5/9 14:05, Francis Deslauriers wrote:
> Add page_fault_entry and page_fault_exit event definitions. It will
> allow each architecture to instrument their page faults.

I'm wondering if this tracepoint could handle other page faults,
like faults in kernel memory(vmalloc, kmmio, etc...)

And if we decide to support those faults, add a type annotate in TP_printk
would be much helpful for user, to let user know what type of page faults happened.

Thanks.
>
> Signed-off-by: Francis Deslauriers <[email protected]>
> Reviewed-by: Raphaël Beamonte <[email protected]>
> Reviewed-by: Mathieu Desnoyers <[email protected]>
> ---
> include/trace/events/fault.h | 51 ++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 51 insertions(+)
> create mode 100644 include/trace/events/fault.h
>
> diff --git a/include/trace/events/fault.h b/include/trace/events/fault.h
> new file mode 100644
> index 0000000..522ddee
> --- /dev/null
> +++ b/include/trace/events/fault.h
> @@ -0,0 +1,51 @@
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM fault
> +
> +#if !defined(_TRACE_FAULT_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_FAULT_H
> +
> +#include <linux/tracepoint.h>
> +
> +TRACE_EVENT(page_fault_entry,
> +
> + TP_PROTO(struct pt_regs *regs, unsigned long address,
> + int write_access),
> +
> + TP_ARGS(regs, address, write_access),
> +
> + TP_STRUCT__entry(
> + __field( unsigned long, ip )
> + __field( unsigned long, addr )
> + __field( uint8_t, write )
> + ),
> +
> + TP_fast_assign(
> + __entry->ip = regs ? instruction_pointer(regs) : 0UL;
> + __entry->addr = address;
> + __entry->write = !!write_access;
> + ),
> +
> + TP_printk("ip=%lu addr=%lu write_access=%d",
> + __entry->ip, __entry->addr, __entry->write)
> +);
> +
> +TRACE_EVENT(page_fault_exit,
> +
> + TP_PROTO(int result),
> +
> + TP_ARGS(result),
> +
> + TP_STRUCT__entry(
> + __field( int, res )
> + ),
> +
> + TP_fast_assign(
> + __entry->res = result;
> + ),
> +
> + TP_printk("result=%d", __entry->res)
> +);
> +
> +#endif /* _TRACE_FAULT_H */
> +/* This part must be outside protection */
> +#include <trace/define_trace.h>
>

2013-05-09 13:49:30

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [page fault tracepoint 1/2] Add page fault trace event definitions

On 05/08/2013 11:46 PM, zhangwei(Jovi) wrote:
> On 2013/5/9 14:05, Francis Deslauriers wrote:
>> Add page_fault_entry and page_fault_exit event definitions. It will
>> allow each architecture to instrument their page faults.
>
> I'm wondering if this tracepoint could handle other page faults,
> like faults in kernel memory(vmalloc, kmmio, etc...)
>
> And if we decide to support those faults, add a type annotate in TP_printk
> would be much helpful for user, to let user know what type of page faults happened.
>

The plan for x86 was to switch the IDT so that any exception could get a
trace event without any overhead in normal operation. This has been in
the process for quite some time but looks like it was getting very close.

-hpa

2013-05-13 11:21:39

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [page fault tracepoint 1/2] Add page fault trace event definitions

* H. Peter Anvin ([email protected]) wrote:
> On 05/08/2013 11:46 PM, zhangwei(Jovi) wrote:
> > On 2013/5/9 14:05, Francis Deslauriers wrote:
> >> Add page_fault_entry and page_fault_exit event definitions. It will
> >> allow each architecture to instrument their page faults.
> >
> > I'm wondering if this tracepoint could handle other page faults,
> > like faults in kernel memory(vmalloc, kmmio, etc...)
> >
> > And if we decide to support those faults, add a type annotate in TP_printk
> > would be much helpful for user, to let user know what type of page faults happened.
> >
>
> The plan for x86 was to switch the IDT so that any exception could get a
> trace event without any overhead in normal operation. This has been in
> the process for quite some time but looks like it was getting very close.

Hi Peter,

Who is leading this IDT instrumentation effort ?

Since we have tracepoints in interrupt handlers nowadays, I wonder what
makes traps so much more special than interrupts to require the
arch-specific complexity of the IDT switcharoo trick ? If I had to
guess, the reason for this would be the page fault handler, which is
called way too frequently for its own good. The number of page faults
triggered by COW on process fork has been impressively high for the past
couple of years.

IMHO, this should be one extra reason for quickly allowing people to
trace those page faults, so they can get an idea of their tremendous
performance impact. This could speed up the efforts on transparent huge
pages, which seems to be a viable long-term solution to this page-size
scalability issue.

By default, my 3.5 Linux kernel (Debian) has:

$ cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never

I think transparent huge pages will become generally useful when enabled
by default, and when they will handle the page cache in addition to
anonymous pages.[1]

Thanks,

Mathieu

[1] Documentation/vm/transhuge.txt

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2013-05-13 15:08:24

by Steven Rostedt

[permalink] [raw]
Subject: Re: [page fault tracepoint 1/2] Add page fault trace event definitions

On Mon, 2013-05-13 at 07:21 -0400, Mathieu Desnoyers wrote:
> * H. Peter Anvin ([email protected]) wrote:

> Who is leading this IDT instrumentation effort ?
>

Seiji has been doing most of the work. I've just been busy doing other
things but I need to start getting this tidied up, and hopefully this
can get into 3.11.

https://lkml.org/lkml/2013/4/5/401

-- Steve