2006-11-21 16:06:33

by Mathieu Desnoyers

[permalink] [raw]
Subject: LTTng do_page_fault vs handle_mm_fault instrumentation

Hi Ingo,

I would like to discuss your suggestion of moving the do_page_fault
instrumentation to handle_mm_fault. On one side, it helps removing architecture
dependant instrumentation, but on the other hand :

1- We cannot access the struct pt_regs in all cases (there may be an invalid
current task struct).
2- We cannot distinguish between calls to handle_mm_fault from the page fault
handler or from get_user_pages.
3- Some people complain about not having enough information about the cause of
the page fault (see the forward below).

So instead of staying between my users who ask for those feature and kernel
developers who wish to reduce the intrusiveness of instrumentation (which is a
nice goal : moving the syscall entry/exit instrumentation do do_syscall_trace
has helped simplifying the instrumentation), I prefer to open the discussion
about it.

Ideas/comments are welcome.

Regards,

Mathieu


----- Forwarded message from Sergei Shtylyov <[email protected]> -----

Date: Tue, 21 Nov 2006 18:55:27 +0300
To: [email protected], [email protected]
Cc: [email protected], [email protected]
Organization: MontaVista Software Inc.
User-Agent: KMail/1.5
From: Sergei Shtylyov <[email protected]>
Subject: [PATCH] PowerPC: rearrange/add trap markers in do_page_fault()

Rearrange existing trap markers in PowerPC do_page_fault() to avoid duplicate
trap reporting in a certain case, and add the markers to some code paths that
were missed...

Signed-off-by: Sergei Shtylyov <[email protected]>

---
MIPS needs something along these lines as well. Removing the trap tracepoints
from do_page_fault() was a bad move in the first place...

arch/powerpc/mm/fault.c | 17 +++++++++++------
1 files changed, 11 insertions(+), 6 deletions(-)

Index: linux-2.6-lttng/arch/powerpc/mm/fault.c
===================================================================
--- linux-2.6-lttng.orig/arch/powerpc/mm/fault.c
+++ linux-2.6-lttng/arch/powerpc/mm/fault.c
@@ -199,12 +199,15 @@ int __kprobes do_page_fault(struct pt_re
if (in_atomic() || mm == NULL) {
if (!user_mode(regs))
return SIGSEGV;
+ MARK(kernel_trap_entry, "%ld struct pt_regs %p",
+ regs->trap, regs);
/* in_atomic() in user mode is really bad,
as is current->mm == NULL. */
printk(KERN_EMERG "Page fault in user mode with"
"in_atomic() = %d mm = %p\n", in_atomic(), mm);
printk(KERN_EMERG "NIP = %lx MSR = %lx\n",
regs->nip, regs->msr);
+ MARK(kernel_trap_exit, MARK_NOARGS);
die("Weird page fault", regs, SIGSEGV);
}

@@ -311,6 +314,8 @@ good_area:
if (pte_present(*ptep)) {
struct page *page = pte_page(*ptep);

+ MARK(kernel_trap_entry, "%ld struct pt_regs %p",
+ regs->trap, regs);
if (!test_bit(PG_arch_1, &page->flags)) {
flush_dcache_icache_page(page);
set_bit(PG_arch_1, &page->flags);
@@ -319,6 +324,7 @@ good_area:
_tlbie(address);
pte_unmap_unlock(ptep, ptl);
up_read(&mm->mmap_sem);
+ MARK(kernel_trap_exit, MARK_NOARGS);
return 0;
}
pte_unmap_unlock(ptep, ptl);
@@ -342,30 +348,26 @@ good_area:
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
+ MARK(kernel_trap_entry, "%ld struct pt_regs %p", regs->trap, regs);
survive:
- MARK(kernel_trap_entry, "%ld struct pt_regs %p",
- regs->trap, regs);
switch (handle_mm_fault(mm, vma, address, is_write)) {

case VM_FAULT_MINOR:
- MARK(kernel_trap_exit, MARK_NOARGS);
current->min_flt++;
break;
case VM_FAULT_MAJOR:
- MARK(kernel_trap_exit, MARK_NOARGS);
current->maj_flt++;
break;
case VM_FAULT_SIGBUS:
- MARK(kernel_trap_exit, MARK_NOARGS);
goto do_sigbus;
case VM_FAULT_OOM:
- MARK(kernel_trap_exit, MARK_NOARGS);
goto out_of_memory;
default:
BUG();
}

up_read(&mm->mmap_sem);
+ MARK(kernel_trap_exit, MARK_NOARGS);
return 0;

bad_area:
@@ -398,6 +400,7 @@ out_of_memory:
goto survive;
}
printk("VM: killing process %s\n", current->comm);
+ MARK(kernel_trap_exit, MARK_NOARGS);
if (user_mode(regs))
do_exit(SIGKILL);
return SIGKILL;
@@ -410,8 +413,10 @@ do_sigbus:
info.si_code = BUS_ADRERR;
info.si_addr = (void __user *)address;
force_sig_info(SIGBUS, &info, current);
+ MARK(kernel_trap_exit, MARK_NOARGS);
return 0;
}
+ MARK(kernel_trap_exit, MARK_NOARGS);
return SIGBUS;
}



----- End forwarded message -----
OpenPGP public key: http://krystal.dyndns.org:8080/key/compudj.gpg
Key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68


2006-11-21 16:24:33

by Sergei Shtylyov

[permalink] [raw]
Subject: Re: LTTng do_page_fault vs handle_mm_fault instrumentation

Hello.

Mathieu Desnoyers wrote:

> I would like to discuss your suggestion of moving the do_page_fault
> instrumentation to handle_mm_fault. On one side, it helps removing architecture
> dependant instrumentation, but on the other hand :

> 1- We cannot access the struct pt_regs in all cases (there may be an invalid
> current task struct).
> 2- We cannot distinguish between calls to handle_mm_fault from the page fault
> handler or from get_user_pages.
> 3- Some people complain about not having enough information about the cause of
> the page fault (see the forward below).
>
> So instead of staying between my users who ask for those feature and kernel
> developers who wish to reduce the intrusiveness of instrumentation (which is a
> nice goal : moving the syscall entry/exit instrumentation do do_syscall_trace
> has helped simplifying the instrumentation), I prefer to open the discussion
> about it.

It seems I've missed the whole story behind this move.
For me, it was more a question of consistency: if we're trying to trace
all trap handlers, why not page fault one? So, I just wanted my old LTT
tracepoints back. :-)

> Ideas/comments are welcome.

> Regards,

> Mathieu

WBR, Sergei

2006-11-21 17:03:22

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: LTTng do_page_fault vs handle_mm_fault instrumentation

* Sergei Shtylyov ([email protected]) wrote:
> Hello.
>
> Mathieu Desnoyers wrote:
>
> >I would like to discuss your suggestion of moving the do_page_fault
> >instrumentation to handle_mm_fault. On one side, it helps removing
> >architecture
> >dependant instrumentation, but on the other hand :
>
> >1- We cannot access the struct pt_regs in all cases (there may be an
> >invalid
> > current task struct).
> >2- We cannot distinguish between calls to handle_mm_fault from the page
> >fault
> > handler or from get_user_pages.
> >3- Some people complain about not having enough information about the
> >cause of
> > the page fault (see the forward below).
> >
> >So instead of staying between my users who ask for those feature and kernel
> >developers who wish to reduce the intrusiveness of instrumentation (which
> >is a
> >nice goal : moving the syscall entry/exit instrumentation do
> >do_syscall_trace
> >has helped simplifying the instrumentation), I prefer to open the
> >discussion
> >about it.
>
> It seems I've missed the whole story behind this move.
> For me, it was more a question of consistency: if we're trying to trace
> all trap handlers, why not page fault one? So, I just wanted my old LTT
> tracepoints back. :-)
>

This topic brings the question about how near must be the instrumentation from
the hardware events.

We have to take into account that the tracer uses the hardware to trace :
mainly, it writes an event into vmalloc'd buffer, which will generate vmalloc
faults on some architectures. Therefore, reentrancy with the "hardware" events
becomes part of the problem. Because of this, we can't simply instrument the
page fault handler at the beginning and end, as it would end up doing
infinite recursive calls.

I see two trends : on one side, the developers and people interested into
measuring performance want instrumentation as close to the hardware events as
possible. Some wants to know the exact execution flow, including the error
conditions in the trap handlers. People doing performance measurements wants
to account correctly the amount of time spent in those handlers.

The other side is the kernel developer who does not want to clutter the code
with too much instrumentation.

Instrumentation around the handle_mm_fault handler call, inside do_page_fault,
looked to me as a good compromise : it can access the struct pt_regs, it will
never be called from either a vmalloc fault or an erroneous page fault caused by
the tracer itself (which of course, never happens, but who knows...). It won't,
however, give information about some error paths in the page fault handler,
mainly related to kernel faults. It is also a little farther from the page
fault handler "real" entry and exit points, but I consider it a minor impact
compared to the cost of entering the trap on currently existing architectures.
Note that I plan to create a "calibration" module someday so we could know how
much time to add to the duration of the page faults, knowing the time required
to enter in and return from the fault.

Mathieu

> >Ideas/comments are welcome.
>
> >Regards,
>
> >Mathieu
>
> WBR, Sergei
>
OpenPGP public key: http://krystal.dyndns.org:8080/key/compudj.gpg
Key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2006-11-21 17:10:29

by Sergei Shtylyov

[permalink] [raw]
Subject: Re: LTTng do_page_fault vs handle_mm_fault instrumentation

Hello.

Mathieu Desnoyers wrote:

> Instrumentation around the handle_mm_fault handler call, inside do_page_fault,
> looked to me as a good compromise : it can access the struct pt_regs, it will
> never be called from either a vmalloc fault or an erroneous page fault caused by
> the tracer itself (which of course, never happens, but who knows...). It won't,
> however, give information about some error paths in the page fault handler,
> mainly related to kernel faults. It is also a little farther from the page
> fault handler "real" entry and exit points, but I consider it a minor impact
> compared to the cost of entering the trap on currently existing architectures.

I kept this approach mostly (note it would have been hard to change it due
to this particular handler's structure itself) but just cleaned it up. If you
consider that adding more markers was a bit too much, I can remove them. :-)

WBR, Sergei