2008-06-16 08:20:30

by Renzo Davoli

[permalink] [raw]
Subject: [PATCH 0/1] ptrace_vm: let us simplify the code for ptrace and add useful features for VM

Proposal: let us simplify
PTRACE_SYSCALL/PTRACE_SINGLESTEP/PTRACE_SYSEMU/PTRACE_SYSEMU_SINGLESTEP,
and now PTRACE_BLOCKSTEP (which will require soon a PTRACE_SYSEMU_BLOCKSTEP),
my PTRACE_SYSVM...etc. etc.

Summary of the solution:
Use tags in the "addr" parameter of existing
PTRACE_SYSCALL/PTRACE_SINGLESTEP/PTRACE_CONT/PTRACE_BLOCKSTEP calls
to skip the current call (PTRACE_VM_SKIPCALL) or skip the second upcall to
the VM/debugger after the syscall execution (PTRACE_VM_SKIPEXIT).

Note:
The patch is against linux-2.6.26-rc6, it applies with some line offset
warnings to git2, too.

Motivation:

The ptrace tag PTRACE_SYSEMU is a feature mainly used for User-Mode Linux,
or at most for other virtual machines aiming to virtualize *all* the syscalls
(total virtual machines).

In fact:
ptrace(PTRACE_SYSEMU, pid, 0, 0)
means that the *next* system call will not be executed.
PTRACE_SYSEMU AFAIK has been implemented only for x86_32.

I already proposed some time ago a different tag: PTRACE_SYSVM
(and I maintain a patch for it) where:
ptrace(PTRACE_SYSVM, pid, XXX, 0)
1* is the same as PTRACE_SYSCALL when XXX==0,
2* skips the call (and stops before entering the next syscall) when
PTRACE_VM_SKIPCALL | PTRACE_VM_SKIPEXIT
3* skips the ptrace call after the system call if PTRACE_VM_SKIPEXIT.
PTRACE_SYSVM has been implemented for x86_32, powerpc_32, um+x86_32.
(x86_64 and ppc64 exist too, but are less tested).

The main difference between SYSEMU and SYSVM is that with SYSVM it is possible
to decide if *this* system call should be executed or not (instead of the next
one).
SYSVM can be used also for partial virtual machines (some syscall gets
virtualized and some others do not), like our umview.

PTRACE_SYSVM above can be used instead of PTRACE_SYSEMU in user-mode linux
and in all the others total virtual machines. In fact, provided user-mode linux
skips *all* the syscalls it does not matter if the upcall happens just after
(SYSEMU) or just before (SYSVM) having skipped the syscall.

Briefly I would like to unify SYSCALL, SYSEMU and SYSVM.
We don't need three different tags (and all their "variations",
SINGLESTEP->SYSEMU_SINGLESTEP etc).

We could keep PTRACE_SYSCALL, using the addr parameter as in PTRACE_SYSVM.
In this case all the code I have seen (user-mode linux, strace, umview
and googling around) use 0 or 1 for addr (being defined unused).
defining PTRACE_VM_SKIPCALL=4 and PTRACE_VM_SKIPEXIT=2 (i.e. by ignoring
the lsb) everything previously coded using PTRACE_SYSCALL should continue
to work.
In the same way PTRACE_SINGLESTEP, PTRACE_CONT and PTRACE_BLOCKSTEP can use
the same tags restarting after a SYSCALL.

This change would eventually simplify both the kernel code
(reducing tags and exceptions) and even user-mode linux and umview.

The skip-exit feature can be implemented in a arch-independent
manner, while for skip_call some simple changes are needed
(the entry assembly code should process the return value of the syscall
tracing function call, like in arch/x86/kernel/Entry_32.S).

Motivation summary:
1) (eventually) Reduce the number of PTRACE tags. The proposed patch
does not add any tag. On the contrary after a period of deprecation
SYSEMU* tags can be eliminated.
2) Backward compatible with existing software (existing UML kernels,
strace already tested). Only software using strange "addr" values
(currently ignored) could have portability problems.
3) (eventually) simplify kernel code. SYSEMU support is a bit messy and
x86/32 only. These new PTRACE_VM tags for the addr parameter will allow to
get rid of SYSEMU code.
4) It is simple to be ported across the architecture.
This patch already support PTRACE_VM_SKIPEXIT for all architectures and
PTRACE_VM_SKIPCALL for x86_32/64 (incl. x86_64 emu32), powerpc32/64, UML.
(to be honest I have tested the code on all the architectures above but
powerpc64. It does not mean that it is bug free even on the others, but at
least I have tried some UML kernels and some umview runs).
5) It is more powerful than PTRACE_SYSEMU. It provides an optimized support for
partial virtualization (some syscalls gets virtualized some other do
not) while keeping support for total virtualization a' la UML.
6) Software currently using PTRACE_SYSEMU can be easily ported to this
new support. The porting for UML (client side) is already in the patch.
All the calls like:
ptrace(PTRACE_SYSEMU, pid, 0, 0)
can be converted into
ptrace(PTRACE_SYSCALL, pid, PTRACE_VM_SKIPCALL, 0)
(but the first PTRACE_SYSCALL, the one which starts up the emulation.
In practice it is possible to set PTRACE_VM_SKIPCALL for the first call,
too. The "addr" tag is ignored being no syscalls pending).

The same feature has been implemented also against the new ptrace running on
McGrath's utrace support. This specific patch can be found here:
http://view-os.svn.sourceforge.net/viewvc/view-os/trunk/kmview-kernel-module/kernel_patches/

renzo


2008-06-17 16:26:46

by Jeff Dike

[permalink] [raw]
Subject: Re: [PATCH 0/1] ptrace_vm: let us simplify the code for ptrace and add useful features for VM

On Mon, Jun 16, 2008 at 09:58:04AM +0200, Renzo Davoli wrote:
> Summary of the solution:
> Use tags in the "addr" parameter of existing
> PTRACE_SYSCALL/PTRACE_SINGLESTEP/PTRACE_CONT/PTRACE_BLOCKSTEP calls
> to skip the current call (PTRACE_VM_SKIPCALL) or skip the second upcall to
> the VM/debugger after the syscall execution (PTRACE_VM_SKIPEXIT).

On the whole, I'm in favor of generalizing ptrace, especially if it
also simplifies the interface and code. Some notes below...

> I already proposed some time ago a different tag: PTRACE_SYSVM
> (and I maintain a patch for it) where:
> ptrace(PTRACE_SYSVM, pid, XXX, 0)
> 1* is the same as PTRACE_SYSCALL when XXX==0,
> 2* skips the call (and stops before entering the next syscall) when
> PTRACE_VM_SKIPCALL | PTRACE_VM_SKIPEXIT

There's a symmetry implied in the PTRACE_VM_SKIPCALL and
PTRACE_VM_SKIPEXIT names which doesn't exist in reality. SKIPEXIT (as
you note later) merely omits the notification on system call return.
SKIPCALL keeps the notification, but omits the system call execution,
so the effects are very different from each other.

I think this is just a naming issue - we don't want the names to fake
people into assuming things which aren't true.

> SYSVM can be used also for partial virtual machines (some syscall gets
> virtualized and some others do not), like our umview.

BTW, if performance is the issue here (and I don't see any other
compelling reasons for it), there are other possibilities which
provide much better performance. Any PTRACE_* variant will have at
least one notification. While there is a noticable gain over two
notifications, that's marginal compared to no notifications at all.
If you know ahead of time what system calls you want to trace, a
system call tracing mask lets you avoid those notifications totally.

I wrote up a patch a couple of years ago -
http://marc.info/?l=user-mode-linux-devel&m=114495242202954&w=2
but the interface implemented there isn't very good.

Jeff

--
Work email - jdike at linux dot intel dot com

2008-06-17 19:08:52

by Renzo Davoli

[permalink] [raw]
Subject: Re: [PATCH 0/1] ptrace_vm: let us simplify the code for ptrace and add useful features for VM

On Tue, Jun 17, 2008 at 12:25:11PM -0400, Jeff Dike wrote:
> On the whole, I'm in favor of generalizing ptrace, especially if it
> also simplifies the interface and code. Some notes below...
So, we agree on this.
>
> > I already proposed some time ago a different tag: PTRACE_SYSVM
> > (and I maintain a patch for it) where:
> > ptrace(PTRACE_SYSVM, pid, XXX, 0)
> > 1* is the same as PTRACE_SYSCALL when XXX==0,
> > 2* skips the call (and stops before entering the next syscall) when
> > PTRACE_VM_SKIPCALL | PTRACE_VM_SKIPEXIT
> There's a symmetry implied in the PTRACE_VM_SKIPCALL and
> PTRACE_VM_SKIPEXIT names which doesn't exist in reality. SKIPEXIT (as
> you note later) merely omits the notification on system call return.
> SKIPCALL keeps the notification, but omits the system call execution,
> so the effects are very different from each other.
Maybe we can find out better tag names.
In the patch I submitted PTRACE_VM_SKIPCALL implies PTRACE_VM_SKIPEXIT
as it is useless to have a notification after nothing has been done.
So, there are three behaviors after the first notification:
0 -> do the syscall and notify after it
PTRACE_VM_SKIPEXIT -> do the syscall and do not notify after it
PTRACE_VM_SKIPCALL -> skip everything.
>
> I think this is just a naming issue - we don't want the names to fake
> people into assuming things which aren't true.
Please help me to find better tag names.
>
> > SYSVM can be used also for partial virtual machines (some syscall gets
> > virtualized and some others do not), like our umview.
> BTW, if performance is the issue here (and I don't see any other
> compelling reasons for it), there are other possibilities which
> provide much better performance. Any PTRACE_* variant will have at
> least one notification. While there is a noticable gain over two
> notifications, that's marginal compared to no notifications at all.
> If you know ahead of time what system calls you want to trace, a
> system call tracing mask lets you avoid those notifications totally.
There is a misunderstanding about what I meant with "some syscall gets
virtualized and some others do not". Obviously it if a fault of mine, it
was poorly explained. Let me briefly describe our partial virtual
machines to explain one possible application for these tags.
(the complete documentation of the project can be found here:
wiki.virtualsquare.org).

umview (and now kmview using a kernel module based on utrace) decides if
a syscall must be virtualized or not depending on the value of its
arguments, not on the syscall number. With "system call" I mean "call of
a system call", a "system call call";-)

For example, *mview {umview,kmview} can virtualize just a subtree of the
file system, thus a "open" system call gets virtualized only if the path
refers to a file in the subtree. Consequently a system call like "read"
becomes virtual if the file descriptor was created by a virtualized
open, otherwise the process executes the standard read provided by the
kernel.

In this way users can (virtually) mount file system images just for the
processes running inside a *mview instance, or run user-level network
stacks, virtual devices, define their own perspective on everything
(uid, gid, system name). We have virtualized even the pace of the time
flowing.

We do not "boot" a different kernel, there are just modules that users
can combine to virtualize different entities:
- umfuse for the file system
- umnet for networking
- umdev for devices
- umtime, umbinfmt, umtime, umname...

We need all the different behaviors listed above.
PTRACE_VM_SKIPCALL -> for the system calls we virtualize.
PTRACE_VM_SKIPEXIT -> for the non virtualized system call.
0 -> sometimes we need the kernel to execute a different system call
or just we need to provide the process with a different output.
In the "open" situation above, we need the kernel to run something to
acquire a real file descriptor, as the process sees a mix of real and
virtual open files.

I think that other projects can benefit from this generalization, while
UML can use PTRACE_VM_SKIPCALL as it is currently using PTRACE_SYSEMU,
maybe extending this optimization to other architectures.

renzo

2008-06-18 16:51:27

by Jeff Dike

[permalink] [raw]
Subject: Re: [PATCH 0/1] ptrace_vm: let us simplify the code for ptrace and add useful features for VM

On Tue, Jun 17, 2008 at 09:08:31PM +0200, Renzo Davoli wrote:
> 0 -> do the syscall and notify after it

To be more precise -
do the call notification, do the syscall, and do the return notification
> PTRACE_VM_SKIPEXIT -> do the syscall and do not notify after it
don't do the return notification
> PTRACE_VM_SKIPCALL -> skip everything.
don't do the syscall or return notification

Looking at things this way, it seems like you might want three flags,
since the asymmetry is caused by two things being bundled into
SKIPCALL.

If you have
PTRACE_VM_SKIPEXIT - skip the return notification
PTRACE_VM_SKIPCALL - skip the syscall
PTRACE_VM_SKIPSTART - skip the call notification
this makes the meaning make more sense to me.

The downside of this is that you end up at least one combination that
doesn't make too much sense, like PTRACE_VM_SKIPCALL (do both
notifications even though nothing could have changed in between).

> umview (and now kmview using a kernel module based on utrace) decides if
> a syscall must be virtualized or not depending on the value of its
> arguments, not on the syscall number. With "system call" I mean "call of
> a system call", a "system call call";-)

OK, if you're looking at the arguments in order to decide what to do,
then you can't just mask out the notifications.

Jeff

--
Work email - jdike at linux dot intel dot com

2008-06-22 09:11:31

by Renzo Davoli

[permalink] [raw]
Subject: Re: [PATCH 0/1] ptrace_vm: let us simplify the code for ptrace and add useful features for VM

On Wed, Jun 18, 2008 at 12:49:42PM -0400, Jeff Dike wrote:
> Looking at things this way, it seems like you might want three flags,
> since the asymmetry is caused by two things being bundled into
> SKIPCALL.
> If you have
> PTRACE_VM_SKIPEXIT - skip the return notification
> PTRACE_VM_SKIPCALL - skip the syscall
> PTRACE_VM_SKIPSTART - skip the call notification
> this makes the meaning make more sense to me.

Jeff,

There are three events for a syscall:
START - call notification
CALL - run the SYSCALL
EXIT - return notification.

I think that it is a non sense to write code for useless cases.
Let us see all the combinations of doing/skipping each one of the three
phases:

0- DOSTART - DOCALL - DOEXIT - Standard PTRACE_SYSCALL (new option 0)
1- DOSTART - DOCALL - SKIPEXIT - PTRACE_VM_SKIPEXIT of my proposal
2- DOSTART - SKIPCALL - DOEXIT - useless, nothing has changed between
the two notifications
3- DOSTART - SKIPCALL - SKIPEXIT - PTRACE_VM_SKIPCALL in my proposal
4- SKIPSTART - DOCALL - DOEXIT - is this useful? (Case 4,see below)
5- SKIPSTART - DOCALL - SKIPEXIT - simply don't use PTRACE_SYSCALL
6- SKIPSTART - SKIPCALL - DOEXIT - this is the old PTRACE_SYSEMU (case 6)
7- SKIPSTART - SKIPCALL - SKIPEXIT - nullify completely the syscalls
(case 7).

case 4: a vm or debugging monitor receives just the return value of a
syscall. In many architectures it not even possible to read the parameters
of the call (e.g. powerpc where the first argument and the return value
use the same register). This choice must be done a-priori, so without
actually know which will be the next system call.

case 6: this makes sense just for applications which virtualize *all* the
system call, current PTRACE_SYSEMU works exactly in this way.
My patch shows that for these applications it does not matter whether the
virtualization takes place before skipping the call or after having just
skipped the call. So PTRACE_VM_SKIPCALL can be used instead.

case 7: skip the next syscall and give no information about, there is no way
to virtualize or trace what is going on.
Who could be ever interested in an option like this?

It seems that the combinations that really make sense are those skipping
a trailing part of the sequence.

DOSTART - DOCALL - DOEXIT my option 0
DOSTART - DOCALL - SKIPEXIT my option PTRACE_VM_SKIPEXIT
DOSTART - SKIPCALL - SKIPEXIT my option PTRACE_VM_SKIPCALL

If you think that it is not clear from the tag name that PTRACE_VM_SKIPCALL
implies PTRACE_VM_SKIPEXIT let us change the name in:
PTRACE_VM_SKIPCALL_SKIPEXIT
Maybe the name is quite long, but in this way it is clear what it does.

Here are some problems I have in implementing all the combinations of
do/skip.
Case 2 should be implemented by faking the second notification just after
the first (just for the sake of completeness, it is useless!).
For Case 4 to 7, we'd need to keep a global process state bit because
a previous PTRACE_SYSCALL parameter affects the execution of the next
syscall.

Provided that an application has the need to trace the syscalls executed
by a target process why somebody would ever want to skip or do something
for the next syscall without even know which system call it is? If
somebody wants to trace system calls it is reasonable that he/she wants
to be notified about each syscall first and then decide how to manage
the actual call.

Shortly, if we can envision applications that can reasonably use other
combinations of DO-SKIP flags, I'll be glad to rewrite a more general
patch, otherwise these options just add some complexity to the code
(against the premises), slow down (a bit) the execution speed and waste
kernel programmers' time.

renzo

2008-06-27 17:52:14

by Jeff Dike

[permalink] [raw]
Subject: Re: [PATCH 0/1] ptrace_vm: let us simplify the code for ptrace and add useful features for VM

On Sun, Jun 22, 2008 at 11:11:02AM +0200, Renzo Davoli wrote:
> There are three events for a syscall:
> START - call notification
> CALL - run the SYSCALL
> EXIT - return notification.
>
> I think that it is a non sense to write code for useless cases.
> Let us see all the combinations of doing/skipping each one of the three
> phases:
>
> 0- DOSTART - DOCALL - DOEXIT - Standard PTRACE_SYSCALL (new option 0)
> 1- DOSTART - DOCALL - SKIPEXIT - PTRACE_VM_SKIPEXIT of my proposal
> 2- DOSTART - SKIPCALL - DOEXIT - useless, nothing has changed between
> the two notifications
> 3- DOSTART - SKIPCALL - SKIPEXIT - PTRACE_VM_SKIPCALL in my proposal
> 4- SKIPSTART - DOCALL - DOEXIT - is this useful? (Case 4,see below)
> 5- SKIPSTART - DOCALL - SKIPEXIT - simply don't use PTRACE_SYSCALL
> 6- SKIPSTART - SKIPCALL - DOEXIT - this is the old PTRACE_SYSEMU (case 6)
> 7- SKIPSTART - SKIPCALL - SKIPEXIT - nullify completely the syscalls
> (case 7).
>
> case 4: a vm or debugging monitor receives just the return value of a
> syscall. In many architectures it not even possible to read the parameters
> of the call (e.g. powerpc where the first argument and the return value
> use the same register). This choice must be done a-priori, so without
> actually know which will be the next system call.

I can see this being useful - this is kind of what strace wants,
except that it wouldn't be able to see that a system call is about to
sleep. This could be implemented by just stashing any trashed
registers off to the side ala x86 orig_eax.

> case 6: this makes sense just for applications which virtualize *all* the
> system call, current PTRACE_SYSEMU works exactly in this way.
> My patch shows that for these applications it does not matter whether the
> virtualization takes place before skipping the call or after having just
> skipped the call. So PTRACE_VM_SKIPCALL can be used instead.

Yup.

> case 7: skip the next syscall and give no information about, there is no way
> to virtualize or trace what is going on.
> Who could be ever interested in an option like this?

No one.

> It seems that the combinations that really make sense are those skipping
> a trailing part of the sequence.
>
> DOSTART - DOCALL - DOEXIT my option 0
> DOSTART - DOCALL - SKIPEXIT my option PTRACE_VM_SKIPEXIT
> DOSTART - SKIPCALL - SKIPEXIT my option PTRACE_VM_SKIPCALL

Seems reasonable. In this case, they should be numbered 0, 1, 2
rather than having masks or-ed together. This happens to produce the
same numbers, except that 3 is outlawed.

> If you think that it is not clear from the tag name that PTRACE_VM_SKIPCALL
> implies PTRACE_VM_SKIPEXIT let us change the name in:
> PTRACE_VM_SKIPCALL_SKIPEXIT
> Maybe the name is quite long, but in this way it is clear what it
> does.

Maybe. How about PTRACE_VM_TRACESTART? Makes the naming somewhat
non-orthogonal, but shorter and descriptive.

Jeff

--
Work email - jdike at linux dot intel dot com