LinuxLists.cc - RFC - Approaches to user-space probes

2006-03-27 06:54:58

Subject: RFC - Approaches to user-space probes

Hi All,

As Andrew Morton suggested, here is a document on user-space probes
discussing known approaches and design issues.

Please provide your comments and suggestions.

Thanks
Prasanna
----

The basic need is to provide infrastructure for user-space dynamic
instrumentation. As with kprobes, there is no need to recompile or
restart the applications for instrumentation, under a debugger for
instance.

Some of the use-cases are:

- To find out the memory leaks dynamically just by inserting probes on
malloc and free library routines.
- Can be used to identify resouce contention bottlenecks.
- Do performance measurements in real time.
- Logging and changing the registers and global data structures.

This document also discusses Christoph's suggested approach

Method Used:

1. Using breakpoint instruction and executing the instrumentation
code from within the breakpoint handler in interrupt context.

The advantages of this approach are listed below

- A single tool providing data capture in a consistent manner
eases the problems of correlation of events across multiple tools
(for kernel and user space)
- The dynamic aspect allows ad-hoc probepoints to be inserted where
no existing instrumentation is provided (emergency debug scenario
for example).
- Low overhead and user can have thousands of active probes on the
system and detect any instance when the probe was hit including
probes on shared library etc.

Design Issues:

==============================
BREAKPOINT VS JUMP INSTRUCTION
==============================

- Breakpoint instruction is the smallest instruction that can
replace any other instruction with less overhead (details
please refer to the issues discussed with method 1 and 2 below).

============================
UNIQUE PROBE INDENTIFICATION
============================

- Probes being tracked by an (inode, offset) tuple rather than by
virtual address so that they can be shared across all processes
mapping the executable/library even at different virtual addresses,
etc.

===========================================================
LOCAL PROBES(PER PROCESS) VS GLOBAL PROBES(EXECUTABLE FILE)
===========================================================

- All processes take a trap since the same executable file
gets mapped into different address_space.

- Compare this with ptrace breakpoints (hence strace and gdb) where
tracepoints and breakpoints are localized to a specified set of
processes. To support local probes the text pages are privatized
for that process. Global user-probes does not have the side effects
(privatization of pages) that ptrace has.

- Global probes does not require the executable pages to be present
in memory just to place a probe on them (hence zero overhead for
probes which are very unlikely to be hit).

- Global probes does not add restrictions on evicting a page with a
probe on it from memory.

- Global probes does not require pages to be marked with copy-on-write.

- Global probes are even visible across fork() syscalls.

- In case of global probes, per process instrumentation data can still
be obtained easily by logging & filtering based on pid/process name.

=====================================
PROBES ON EXECUTABLE MAPPED WRITEABLE
=====================================

- Probes can be inserted to the all the vma's that map the same
executable.

===================================
PROBES ON YET TO START APPLICATIONS
===================================

- User probes also supports the registering of the probepoints before
an the probed code is loaded. The clearly has advantages for
catching initialization problems. This involves modifying the probed
applications address_space readpage() and readpages() pointers
routine. Overhead of changing the address_space readpage/s()
pointers is limited to only the probed application until all probes
are removed from that application.

=========================================
NEED FOR A KERNEL MODULE TO INSERT PROBES
=========================================

1. Probes can be applied on system wide bases.
2. Low overhead of executing the handler from the kernel mode.
3. Executing the handler in user-mode requires additional application
/ daemon to share its address_space containing instrumentation code
with the probed application.

===========
LIMITATIONS
===========

1. Probes are visible if a copy of probed executable is made when
probes are applied.

2. Can only dump the data present in the memory when probe was
hit.

3. Can only run the handler the handler in the kernel mode.

4. Debuggers and probes cannot coexist at the same "address", even
though they can have breakpoints elsewhere in the same executable
mapped in memory.

Initial prototype of the above approach is being posted on lkml.
http://www.ussg.iu.edu/hypermail/linux/kernel/0603.2/1186.html

Some issues were pointed out during review and those will be fixed
based on the design consensus.
Other possible approaches which were looked up:

1. Attaching or loading the application into a trace tool.

In this method the user application must be loaded into a trace tool
or the trace tool is attached to already running application. Before
the user can instrument an application the user should decide what
that instrumentation will consist of. Dynaprof uses such a mechanism.

http://www.dyninst.org/tools.html

2. Using a "jump" instruction to a trampoline and trampoline executing
the instrumented code in user-space.

Eg: Paradyn tool. (http://www.paradyn.org/ and
http://www.paradyn.org/tracetool.html)

Issues with method 1 and 2 are:

- Induces Intel erratum E49 where the other processors see stale data
while one processor replaces the jump instruction.
- Instruction can only be replaced atomically if the size of the jump
instruction is greater than or equal to the original instruction.
- Other processors need to be stopped if the "jump" instruction size
is less than the original instruction.

3. Christoph's approach of providing a ptrace-like syscall interface
to insert/remove probes

I'd like to request Christoph for more details on the approach.

Questions with this approach are

1. Should this support per process probes or pre executable file
probes?
2. Should the handler be executed within kernel/user mode?
3. If kernel mode how do you insert the handlers with the kernel mode?
4. If user mode where should the handler exists ?
5. If user mode should it follow the ptrace way of giving control to
the handler?

Some of these questions may well be answered, once more details are
worked out about this approach

Limitations:

1. Large memory overhead if per-process copy of text pages is made.

2. Ptrace has a over-head of making a syscall for each probe hit to
access/modify the data.

Ptrace already allows the user to access and modify data from
user-mode.

=====
TODO:
=====
- evaluate suggestions about approach
- update the existing patchset based on the comments received
or work on approach agreed upon.
--
Prasanna S Panchamukhi
Linux Technology Center
India Software Labs, IBM Bangalore
Email: [email protected]
Ph: 91-80-51776329

2006-03-27 07:38:03

by Arjan van de Ven

[permalink] [raw]

Subject: Re: RFC - Approaches to user-space probes

On Mon, 2006-03-27 at 12:24 +0530, Prasanna S Panchamukhi wrote:
> Hi All,
>
> As Andrew Morton suggested, here is a document on user-space probes
> discussing known approaches and design issues.
>
> Please provide your comments and suggestions.
>
> Thanks
> Prasanna
> ----
>
> The basic need is to provide infrastructure for user-space dynamic
> instrumentation. As with kprobes, there is no need to recompile or
> restart the applications for instrumentation, under a debugger for
> instance.
>
> Some of the use-cases are:
>
> - To find out the memory leaks dynamically just by inserting probes on
> malloc and free library routines.

for that you do need to do that from the start of an application, at
which point perfectly good tooling exists already; leak tracking without
full state is, well, not something that'll work too well I suspect.
Also I don't see really why this needs kernel help :)

> - Low overhead and user can have thousands of active probes on the
> system and detect any instance when the probe was hit including
> probes on shared library etc.

I suspect this is the only reason for doing it inside the kernel;
anything else still really shouts "do it in userspace via ptrace" to me.

> ===========================================================
> LOCAL PROBES(PER PROCESS) VS GLOBAL PROBES(EXECUTABLE FILE)
> ===========================================================
>
> - All processes take a trap since the same executable file
> gets mapped into different address_space.

is that true for breakpoints inserted after start?
The reason I ask because... what if half the processed took a COW on the
page with the instruction you want to trap on. Are you going to edit all
those COW'd pages?

Also you no longer have the option to only do it on a selected subset of
processes (eg the workload vs the system)

> - Compare this with ptrace breakpoints (hence strace and gdb) where
> tracepoints and breakpoints are localized to a specified set of
> processes. To support local probes the text pages are privatized
> for that process. Global user-probes does not have the side effects
> (privatization of pages) that ptrace has.

No instead it gets to "deal" with that already having happened ;)

Overall I see only one possible reason to do this in the kernel:
performance. Anything else really suggests that a userspace approach is
more than reasonable to me. (It might not be always be super easy, but
on the other hand you gain a lot back by doing that, for example you
have a lot better backtrace and debuginformation there so that you can
do far more advanced probes like "probe only if the caller is ..." etc)

2006-03-27 10:00:04

by S. P. Prasanna

[permalink] [raw]

Subject: Re: RFC - Approaches to user-space probes

On Mon, Mar 27, 2006 at 09:37:48AM +0200, Arjan van de Ven wrote:
> On Mon, 2006-03-27 at 12:24 +0530, Prasanna S Panchamukhi wrote:
>
> > - Low overhead and user can have thousands of active probes on the
> > system and detect any instance when the probe was hit including
> > probes on shared library etc.
>
> I suspect this is the only reason for doing it inside the kernel;
> anything else still really shouts "do it in userspace via ptrace" to me.
>

Other reasons would be:

- to view some privilaged data, such as system regs while you are
debugging in user-space
- to view many arbitrary process address-space that use a common set
of modules - user or kernel space

>
> > ===========================================================
> > LOCAL PROBES(PER PROCESS) VS GLOBAL PROBES(EXECUTABLE FILE)
> > ===========================================================
> >
> > - All processes take a trap since the same executable file
> > gets mapped into different address_space.
>
> is that true for breakpoints inserted after start?

Yes, insertion of the breakpoint happens at the physical
page level and it gets written back to the disc.

> The reason I ask because... what if half the processed took a COW on the
> page with the instruction you want to trap on. Are you going to edit all
> those COW'd pages?

The current prototype does not insert probes on COW pages, but yes eventually
we will provide probes insertions on COW'd pages feature too.

Thanks
Prasanna
--
Prasanna S Panchamukhi
Linux Technology Center
India Software Labs, IBM Bangalore
Email: [email protected]
Ph: 91-80-51776329

2006-03-27 20:03:44

by Arjan van de Ven

[permalink] [raw]

Subject: Re: RFC - Approaches to user-space probes

On Mon, 2006-03-27 at 15:30 +0530, Prasanna S Panchamukhi wrote:
> On Mon, Mar 27, 2006 at 09:37:48AM +0200, Arjan van de Ven wrote:
> > On Mon, 2006-03-27 at 12:24 +0530, Prasanna S Panchamukhi wrote:
> >
> > > - Low overhead and user can have thousands of active probes on the
> > > system and detect any instance when the probe was hit including
> > > probes on shared library etc.
> >
> > I suspect this is the only reason for doing it inside the kernel;
> > anything else still really shouts "do it in userspace via ptrace" to me.
> >
>
> Other reasons would be:
>
> - to view some privilaged data, such as system regs while you are
> debugging in user-space

root can do that anyway afaics

> - to view many arbitrary process address-space that use a common set
> of modules - user or kernel space

that's just a matter of userspace tooling.

> Yes, insertion of the breakpoint happens at the physical
> page level and it gets written back to the disc.

at which point you get to deal with tripwire and other intrusion
detection systems.... and you prevent doing this on binaries residing on
read-only mounts (which isn't as uncommon as it sounds, read only
shared /usr is quite common in enterprise)

2006-03-28 09:42:57

by Richard J Moore

[permalink] [raw]

Subject: Re: RFC - Approaches to user-space probes

I think the example of probes on malloc is possibly not a good one. And I
think also that one has to accept that there can be no tool that will be a
panacea for all problems. So, for me it comes down to whether one should
make life easier when dealing with:

1) a class of problems that are more easily tackled using a global or
module-orientated probing mechanism

Consider a server process to which work is queued asynchronously via a
shared library.
Here is culprit process or processes might well not be known at the start
of diagnosis.

2) system problems that have side effects in user-space.

Because probe-handlers operate in kernel space, access to all kernel and
CPU state data is available when the probe fires as well as user space
data.
I sometimes hear the argument that because user-space can't or shouldn't
cause the kernel to die then there's no need to support system-wide
debugging tools.
To me that's limiting the problem space to kernel crashes, but more
significantly doesn't permit the existence of a problem where user-space
could significantly affect the kernel in an adverse way. The converse
however is always true - user-space can easily be adversely affect by
kernel space - a rogue driver for example.

3) User problems that have side-effects in the kernel or device interface.
Problems that are IO related come to mind: correlation of user events and
user data with driver device events and data is more easily handled though
a tool that can easily get a all these items data.

It will always be the case that there are harder ways to tackle global
problems using process-orientated tools and conversely simpler ways to
tackle process-specific problems than using system-orientated tools. But I
do think that ptrace won't scale to a global or system-wide level without
switching its internal mechanism to something along the lines of krprobes.
There are two reason I can think of:

1) ptrace is orientated to debugging a specific process tree and a
nominated debug process. Having it operate on arbitrary process would
require kernel extensions to achieve that but would also have a major
impact on performance if each event were to result in a context switch to
the debugger process.

2) ptrace operates by privatizing memory via COW, but kprobes doesn't. The
probes are fixed-up when a page is brought into memory by using an alias
r/w virtual address. Using existing the ptrace mechanism across all, or
most, processes could have a significant affect on swapfile and paging
rate. And that has to be bad news when investigating performance and race
conditions problems.

If we want to make life easier for debugging the types of problems
indicated above, then it's seems very reasonable to ask whether ptrace can
be extended to use the (user) kprobes mechanism.

Richard J Moore

Arjan van de Ven <[email protected]> wrote on 27/03/2006 20:03:14:

> On Mon, 2006-03-27 at 15:30 +0530, Prasanna S Panchamukhi wrote:
> > On Mon, Mar 27, 2006 at 09:37:48AM +0200, Arjan van de Ven wrote:
> > > On Mon, 2006-03-27 at 12:24 +0530, Prasanna S Panchamukhi wrote:
> > >
> > > > - Low overhead and user can have thousands of active probes on the
> > > > system and detect any instance when the probe was hit including
> > > > probes on shared library etc.
> > >
> > > I suspect this is the only reason for doing it inside the kernel;
> > > anything else still really shouts "do it in userspace via ptrace" to
me.
> > >
> >
> > Other reasons would be:
> >
> > - to view some privilaged data, such as system regs while you are
> > debugging in user-space
>
> root can do that anyway afaics
>
> > - to view many arbitrary process address-space that use a common set
> > of modules - user or kernel space
>
> that's just a matter of userspace tooling.
>
> > Yes, insertion of the breakpoint happens at the physical
> > page level and it gets written back to the disc.
>
> at which point you get to deal with tripwire and other intrusion
> detection systems.... and you prevent doing this on binaries residing on
> read-only mounts (which isn't as uncommon as it sounds, read only
> shared /usr is quite common in enterprise)
>

2006-03-28 09:48:20

by Andi Kleen

[permalink] [raw]

Subject: Re: RFC - Approaches to user-space probes

On Tuesday 28 March 2006 11:42, Richard J Moore wrote:

> 1) ptrace is orientated to debugging a specific process tree and a
> nominated debug process. Having it operate on arbitrary process would
> require kernel extensions to achieve that but would also have a major
> impact on performance if each event were to result in a context switch to
> the debugger process.

You can attach it to any pid

The problem is just finding new processes when they are created. And when
you trace all it will be quite inefficient.

>
> 2) ptrace operates by privatizing memory via COW, but kprobes doesn't. The
> probes are fixed-up when a page is brought into memory by using an alias
> r/w virtual address. Using existing the ptrace mechanism across all, or
> most, processes could have a significant affect on swapfile and paging
> rate. And that has to be bad news when investigating performance and race
> conditions problems.

The problem with ptrace is also that it is quite heavyweight - you have
to take over all signals at least etc. Some lighter weight probing
mechanism for user space would be probably a good idea.

> If we want to make life easier for debugging the types of problems
> indicated above, then it's seems very reasonable to ask whether ptrace can
> be extended to use the (user) kprobes mechanism.

It's a mistery to me why people hate ioctl and like ptrace.

ptrace already is far too complex and ugly. Clean new calls would be probably
preferable.

-Andi

2006-03-28 10:10:41

by Richard J Moore

[permalink] [raw]

Subject: Re: RFC - Approaches to user-space probes

Andi Kleen <[email protected]> wrote on 28/03/2006 09:47:57:

> On Tuesday 28 March 2006 11:42, Richard J Moore wrote:
>
> > 1) ptrace is orientated to debugging a specific process tree and a
> > nominated debug process. Having it operate on arbitrary process would
> > require kernel extensions to achieve that but would also have a major
> > impact on performance if each event were to result in a context switch
to
> > the debugger process.
>
> You can attach it to any pid

Yes, of course. But you have to do some work to determine the PIDs,
especially if you want to catch new processes that aren't in the current
set of debugee process trees.

>
> The problem is just finding new processes when they are created. And when
> you trace all it will be quite inefficient.
>
> >
> > 2) ptrace operates by privatizing memory via COW, but kprobes doesn't.
The
> > probes are fixed-up when a page is brought into memory by using an
alias
> > r/w virtual address. Using existing the ptrace mechanism across all, or
> > most, processes could have a significant affect on swapfile and paging
> > rate. And that has to be bad news when investigating performance and
race
> > conditions problems.
>
> The problem with ptrace is also that it is quite heavyweight - you have
> to take over all signals at least etc. Some lighter weight probing
> mechanism for user space would be probably a good idea.
>
> > If we want to make life easier for debugging the types of problems
> > indicated above, then it's seems very reasonable to ask whether ptrace
can
> > be extended to use the (user) kprobes mechanism.
>
> It's a mistery to me why people hate ioctl and like ptrace.

No doubt we'll find out soon ;-)

>
> ptrace already is far too complex and ugly. Clean new calls would be
probably
> preferable.
>
> -Andi

2006-03-28 14:54:36

by S. P. Prasanna

[permalink] [raw]

Subject: Re: RFC - Approaches to user-space probes

On Mon, Mar 27, 2006 at 10:03:14PM +0200, Arjan van de Ven wrote:
> On Mon, 2006-03-27 at 15:30 +0530, Prasanna S Panchamukhi wrote:
> > On Mon, Mar 27, 2006 at 09:37:48AM +0200, Arjan van de Ven wrote:
> > > On Mon, 2006-03-27 at 12:24 +0530, Prasanna S Panchamukhi wrote:
> > >
> > > > - Low overhead and user can have thousands of active probes on the
> > > > system and detect any instance when the probe was hit including
> > > > probes on shared library etc. > > >
> > > I suspect this is the only reason for doing it inside the kernel;
> > > anything else still really shouts "do it in userspace via ptrace" to me.
> > >
> >
> > Other reasons would be:
> >
> > - to view some privilaged data, such as system regs while you are
> > debugging in user-space
>
> root can do that anyway afaics
>
> > - to view many arbitrary process address-space that use a common set
> > of modules - user or kernel space
>
> that's just a matter of userspace tooling.
>
> > Yes, insertion of the breakpoint happens at the physical
> > page level and it gets written back to the disc.
>
> at which point you get to deal with tripwire and other intrusion
> detection systems.... and you prevent doing this on binaries residing on
> read-only mounts (which isn't as uncommon as it sounds, read only
> shared /usr is quite common in enterprise)

Arjan,

The probes are inserted at physical page level and if the
executable is mmaped private, probes never gets written to the disk.
The physical page mmaped will still have probes in it. If the page in
the memory gets discarded due to the low-memory situation, then probes
will get inserted into that page when read in via readpage/s() hooks.
Only thing that will be written to the disk is the corresponding
inode, since we increment and decrement its i_writecount.

But, if we do a objdump on the probed executable, we are seeing the
breakpoints in the disassembly. We are looking into this issue.
Does this mean that breakpoints are written to the disk?

If the executable is mmaped shared, then those mappings will get written
back to the disk.

Writting to the disk is not the requirement for user-space probes, it is
just the side effect and probes are successful even if it is written or
not to the disk.

User-space probes are successful even if it is on "read-only" mounts, since
nothing is written back.

--
Prasanna S Panchamukhi
Linux Technology Center
India Software Labs, IBM Bangalore
Email: [email protected]
Ph: 91-80-51776329

2006-03-28 17:29:26

by Arjan van de Ven

[permalink] [raw]

Subject: Re: RFC - Approaches to user-space probes

> But, if we do a objdump on the probed executable, we are seeing the
> breakpoints in the disassembly. We are looking into this issue.
> Does this mean that breakpoints are written to the disk?

it means tripwire and similar tools will raise big alarms at least.

2006-03-28 20:36:37

by Frank Ch. Eigler

[permalink] [raw]

Subject: Re: RFC - Approaches to user-space probes

Prasanna S Panchamukhi <[email protected]> writes:

> [...]
> If the executable is mmaped shared, then those mappings will get written
> back to the disk.
> Writting to the disk is not the requirement for user-space probes, it is
> just the side effect [...]

It's pretty clear that writing the dirtied pages to disk is an
*undesirable* side-effect, and should be eliminated. (Among many
other scenarios, imagine a kernel shutting down without all the probes
being cleanly removed. Then the executables are irretrievably
corrupted.)

- FChE

2006-03-28 20:45:16

by Andi Kleen

[permalink] [raw]

Subject: Re: RFC - Approaches to user-space probes

On Tuesday 28 March 2006 22:35, Frank Ch. Eigler wrote:
> Prasanna S Panchamukhi <[email protected]> writes:
>
> > [...]
> > If the executable is mmaped shared, then those mappings will get written
> > back to the disk.
> > Writting to the disk is not the requirement for user-space probes, it is
> > just the side effect [...]
>
> It's pretty clear that writing the dirtied pages to disk is an
> *undesirable* side-effect, and should be eliminated. (Among many
> other scenarios, imagine a kernel shutting down without all the probes
> being cleanly removed. Then the executables are irretrievably
> corrupted.)

That's pretty hard unfortunately. There are plenty of operations
that just access the page cache directly for IO
(like sendfile or mmap or other IO)

And allocating bounce buffers is tricky because the IO paths
cannot allocate memory reliably without deadlocks (or rather
it would require mempool and other inefficient measures)

Doing forced COW like ptrace is probably more practical.

-Andi

2006-03-31 11:55:34

by S. P. Prasanna

[permalink] [raw]

Subject: Re: RFC - Approaches to user-space probes

On Tue, Mar 28, 2006 at 03:35:43PM -0500, Frank Ch. Eigler wrote:
> Prasanna S Panchamukhi <[email protected]> writes:
>
> > [...]
> > If the executable is mmaped shared, then those mappings will get written
> > back to the disk.
> > Writting to the disk is not the requirement for user-space probes, it is
> > just the side effect [...]
>
> It's pretty clear that writing the dirtied pages to disk is an
> *undesirable* side-effect, and should be eliminated. (Among many
> other scenarios, imagine a kernel shutting down without all the probes
> being cleanly removed. Then the executables are irretrievably
> corrupted.)

Frank,

What would the tipical situations where the text section in the
executable is mapped with 'MAP_SHARED'?
This information will help solve the problem easily.

Thanks
Prasanna
--
Prasanna S Panchamukhi
Linux Technology Center
India Software Labs, IBM Bangalore
Email: [email protected]
Ph: 91-80-51776329

2006-03-31 16:25:55

by Frank Ch. Eigler

[permalink] [raw]

Subject: Re: RFC - Approaches to user-space probes

Hi -

On Fri, Mar 31, 2006 at 05:25:29PM +0530, Prasanna S Panchamukhi wrote:
> [...]
> > It's pretty clear that writing the dirtied pages to disk is an
> > *undesirable* side-effect, and should be eliminated. [...]
>
> What would the typical situations where the text section in the
> executable is mapped with 'MAP_SHARED'?

Even if such usage is not typical, if it is legal, it may open a
vulnerability. Imagine an unprivileged attacker doing just such an
mmap on some key shared library or executable, hoping that someone
else puts user-kprobes in there.

- FChE

Attachments:

(No filename) (569.00 B)
(No filename) (189.00 B)
Download all attachments