2006-05-18 15:54:10

by Renzo Davoli

[permalink] [raw]
Subject: ptrace enhancements for VM support (patch proposals follow in sep.msgs)

I am sending with three separate messages (as replies to this) a set of
proposed patches for a better support of virtual machines through ptrace.

We have developed the patches during the implementation of umview, the
user-mode prototype of the view-os project.
(Those interested in the project can read a short abstract after the signature
or refer to the project site on savannah:
http://savannah.nongnu.org/projects/view-os
).
Although the patches have increased in a significant way the performance
of our partial virtual machine implementation, the patches can be useful
for any project related to virtualization, e.g. User-Mode Linux.

Here is a short summary of the patches:
#1: access_process_vm_user. This is a more efficient implementation of
ptrace_readdata, ptrace_writedata, access_process_vm (and it adds the
ptrace_readstringdata function). -- kernel/ptrace.c
#2: management of the PTRACE_MULTI tag for ptrace. It is possible by this
tag to pack several requests in a single system call reducing the number
of context switches.
#3: management of the PTRACE_SYSVM tag. With this call during the processing
of the pre-syscall ptraced process stop, it is possible to choose three
different behaviors:
- The ptraced process runs the syscall and then stops again(like PTRACE_SYSCALL)
- The ptraced process runs the syscall and does not stop until the next syscall
(useful to run the real syscall when the Virtual Machine does not manage the
call)
- The ptraced process does not run the syscall (and neither stops again).
This latter behavior is useful for syscall completely implemented by the
virtual machine.
PTRACE_SYSVM is an extension of the PTRACE_SYSCALL and includes also
the feature implemented by STRACE_SYSEMU. (We have a prototype User-Mode Linux
patch which uses SYSVM instead of SYSEMU).

Patch #1 and #2 are architecture independent, #3 has been implemented on
i386/ppc/um.
The patches have been designed as incremental. They should be applied
#1,
#1 and #2,
or #1 #2 and #3.
#2 actually depends on #1 while applying #3 although logically independent
could just generate some complaints about the original files (shift of the
hunks or differences in the hunk contexts) if applied alone.
We suggest to patch #3 after #1 and #2.

These pathes are against 2.6.17-rc1, and we are posting them here for a general
discussion. We are updating the set of patch to the latest rc, and we will
post them here if this community feels our development interesting.
I have try to apply the patches to rc4 and it seems that they applies
correctly with some lines of offset.

>From the security point of view, these patch should not introduce new threats.
#1 re-implements what is already supported, #2 merges several system calls
in one call, the security checks formerly executed for each call are already
executed item per item, #3 integrates PTRACE_SYSCALL and PTRACE_SYSEMU
and extends the same features to other architectures.

We hope these proposals will be interesting for the ML and the kernel
development group.

I am sorry but I am not subscribed to the list, thus please
Cc to [email protected] your answers/comments.
Several members of the team, including myself, keep in touch with the ML by
reading the archives.

renzo davoli
team leader and co-main developer of view-os, (and also of vde, lwipv6, virtual
square).
together with Ludovico Gardenghi and Andrea Gasparini, main developers
and the entire staff of the project, all the members are listed on the web site.

--------------------------
Brief abstract of view-os.

What is view-os: it is the idea to give each process its own view of the
executing environment.
The common behavior where each process running on a kernel must have the
same perpective on (say) networking, file system, IPC, devices, etc. is
just a social convention that can be broken.

umview is a prototype that shows the idea and its effectiveness.

umview is a partial virtual machine, when you start the first process
inside umview and you do not preload any umview module, umview is completely
transparent: the processes inside and outside umview see the same view.
In other words a system call run by a process inside umview has the same
effect as it were issued by the same process running outside umview.

umview supports modules (pre-loaded or loaded at run time).
each system call is presented to a "choice function" of the loaded modules.
If a module "chooses" the system call it executes the system call instead
of the real kernel.
This "choice" can be based on the path (e.g. for open), file system type
(e.g. mount), protocol family (socket), or automagically chosen by fd (when
a module choose to manage socket or open or creat, all the following calls
referring to the same fd are diverted to the same module).

The state-of-the-art, up to today is the following.

- umfuse module, it is possible to mount ext2/iso file systems and potentially
all "fuse" based file system implementations can be used with umfuse.
Note that the umfuse mounted file systems are accessible only by the processes
running inside umview.
- lwipv6 module. it is possible to assign a virtual networking support to
the processes running in umview (lwipv6 is another project of ours, it is a
complete user level implementation of a IPv4/IPv6 hybrid stack). The network
interfaces can be connected to tun/tap or the a Virtual Distributed Ethernet
switch (again a project of ours, this is on sourcefourge and already included
in Debian sid and other distributions). In this way it is possible to assign an
IP addresses just to a process or to a group of processes.

There are some other younger modules included in the cvs:
- viewfs. the file system can be restructured as you want. You can
make a patchwork with the directories of your file system and say that
this is the "view" of the process. It is possible to define copy on write
access on a directory or on the entire file system. In this latter case
the processes in umview modify the files in their view but the actual files
have never been changed.
Very useful for application testing, if a buggy application messes up all
the files everything can be rolled back by restarting umview.
viewfs can be used as a security cage to run browsers. In case of browser bug,
personal sensible data has one further layer of protection.
- devfs. It is possible to define virtual devices. All the syscall (ioctl
included) to specific special files or to specific devices can be virtualized).
It is (actually will be) possible to run fdisk, mkfs, and umfuse-mount file
systems from image files. It is useful to prepare or modify images for
other virtual machines.
- umbinfmt. user-mode clone of binfmt_misc in the kernel. It is possible to
define interpreters to run almost every program. The management is the same,
if the umbinfmt virtual partition gets mounted on /proc/sys/fs/binfmt_misc,
the scripts access umbinfmt as it were binfmt_misc.

Some final remarks:
- umview supports the standard linux tools and programs (e.g. to mount a file
system, umview users run /bin/mount)
- umview runs on 2.6.x kernels (it runs *a bit slowly* on vanilla umpatched
kernels, but it runs. umview needs only ptrace). umview runs quite well on patched
kernels, expecially on >2.6.16 by exploiting the new pselect support.
- umview does not use any call or option that needs root access. umview
runs as a user-process with user permissions.
- (young feature) Module nesting is supported. e.g. It is possible to mount a
file system image which is stored on another virtual file system or
accessible by a virtual network. It is also possible to run umview inside
umview. In this way it is possible for some processes to share some parts of
their view while having specific views for other aspects. This "nested" run of
umview does not "ptrace twice" the processes, the underlying umview support
is notified that the virtual environment has forked, and it will manage the
different views independently after the view-fork.


2006-05-18 15:56:43

by Renzo Davoli

[permalink] [raw]
Subject: [PATCH] 1-access_process_vm_user

This patch modifies the functions ptrace_readdata and
ptrace_writedata already included in kernel/ptrace.c. Basically this is a more
efficient implementation of the same features.
A third function ptrace_readstringdata has been added to optimize the data
transfer of null terminated char strings (like pathnames).

A new function, access_process_vm_user(), takes care to get data from userspace
memory of a given process or to write to userspace memory, in a similar way of
the already existent access_process_vm() function.
It uses only an additional parameter: the flag "string" specify if making
checks on data, that should be a well-formed string ( null terminated ):

static int access_process_vm_user(struct task_struct *tsk, unsigned long addr, char __user *ubuf, int len, int write, int string)

We think this is a better implementation of readdata and writedata, because it
retains in cache entire memory pages, while the current implementation uses only
128 bytes buffers.

Signed-off-by: renzo davoli <[email protected]>

diff -Naur linux-2.6.17-rc1-bulk/include/linux/ptrace.h linux-2.6.17-rc1-access/include/linux/ptrace.h
--- linux-2.6.17-rc1-bulk/include/linux/ptrace.h 2006-04-03 05:22:10.000000000 +0200
+++ linux-2.6.17-rc1-access/include/linux/ptrace.h 2006-04-04 10:01:33.000000000 +0200
@@ -81,6 +81,7 @@
extern struct task_struct *ptrace_get_task_struct(pid_t pid);
extern int ptrace_traceme(void);
extern int ptrace_readdata(struct task_struct *tsk, unsigned long src, char __user *dst, int len);
+extern int ptrace_readstringdata(struct task_struct *tsk, unsigned long src, char __user *dst, int len);
extern int ptrace_writedata(struct task_struct *tsk, char __user *src, unsigned long dst, int len);
extern int ptrace_attach(struct task_struct *tsk);
extern int ptrace_detach(struct task_struct *, unsigned int);
diff -Naur linux-2.6.17-rc1-bulk/kernel/ptrace.c linux-2.6.17-rc1-access/kernel/ptrace.c
--- linux-2.6.17-rc1-bulk/kernel/ptrace.c 2006-04-03 05:22:10.000000000 +0200
+++ linux-2.6.17-rc1-access/kernel/ptrace.c 2006-04-04 10:01:33.000000000 +0200
@@ -263,54 +263,90 @@
return buf - old_buf;
}

-int ptrace_readdata(struct task_struct *tsk, unsigned long src, char __user *dst, int len)
+/*
+ * Access another process' address space to/from user space
+ * Do not walk the page table directly, use get_user_pages
+ */
+static int access_process_vm_user(struct task_struct *tsk, unsigned long addr, char __user *ubuf, int len, int write, int string)
{
- int copied = 0;
-
- while (len > 0) {
- char buf[128];
- int this_len, retval;
-
- this_len = (len > sizeof(buf)) ? sizeof(buf) : len;
- retval = access_process_vm(tsk, src, buf, this_len, 0);
- if (!retval) {
- if (copied)
+ struct mm_struct *mm;
+ struct vm_area_struct *vma;
+ struct page *page;
+ char *buf;
+ unsigned long old_addr = addr;
+
+ mm = get_task_mm(tsk);
+ if (!mm)
+ return 0;
+
+ buf=kmalloc(PAGE_SIZE, GFP_KERNEL);
+ down_read(&mm->mmap_sem);
+ /* ignore errors, just check how much was sucessfully transfered */
+ while (len) {
+ int bytes, ret, offset;
+ void *maddr;
+
+ ret = get_user_pages(tsk, mm, addr, 1,
+ write, 1, &page, &vma);
+ if (ret <= 0)
break;
- return -EIO;
+
+ bytes = len;
+ offset = addr & (PAGE_SIZE-1);
+ if (bytes > PAGE_SIZE-offset)
+ bytes = PAGE_SIZE-offset;
+
+ maddr = kmap(page);
+ if (write) {
+ __copy_from_user(buf,ubuf,bytes);
+ copy_to_user_page(vma, page, addr,
+ maddr + offset, buf, bytes);
+ if (!PageCompound(page))
+ set_page_dirty_lock(page);
+ } else {
+ copy_from_user_page(vma, page, addr,
+ buf, maddr + offset, bytes);
+ if (string) {
+ for (offset=0;offset<bytes;offset++)
+ if (buf[offset]==0)
+ break;
+ if (offset < bytes)
+ bytes=len=offset+1;
}
- if (copy_to_user(dst, buf, retval))
- return -EFAULT;
- copied += retval;
- src += retval;
- dst += retval;
- len -= retval;
+ __copy_to_user(ubuf,buf,bytes);
}
- return copied;
+ kunmap(page);
+ page_cache_release(page);
+ len -= bytes;
+ ubuf += bytes;
+ addr += bytes;
}
+ up_read(&mm->mmap_sem);
+ mmput(mm);

-int ptrace_writedata(struct task_struct *tsk, char __user *src, unsigned long dst, int len)
-{
- int copied = 0;
+ kfree(buf);
+ return addr - old_addr;
+}

- while (len > 0) {
- char buf[128];
- int this_len, retval;
-
- this_len = (len > sizeof(buf)) ? sizeof(buf) : len;
- if (copy_from_user(buf, src, this_len))
- return -EFAULT;
- retval = access_process_vm(tsk, dst, buf, this_len, 1);
- if (!retval) {
- if (copied)
- break;
+int ptrace_readdata(struct task_struct *tsk, unsigned long src, char __user *dst, int len)
+{
+ if (!access_ok(VERIFY_WRITE, dst ,len))
return -EIO;
+ return access_process_vm_user(tsk, src, dst, len, 0, 0);
}
- copied += retval;
- src += retval;
- dst += retval;
- len -= retval;
+
+int ptrace_readstringdata(struct task_struct *tsk, unsigned long src, char __user *dst, int len)
+{
+ if (!access_ok(VERIFY_WRITE, dst ,len))
+ return -EIO;
+ return access_process_vm_user(tsk, src, dst, len, 0, 1);
}
- return copied;
+
+int ptrace_writedata(struct task_struct *tsk, char __user *src, unsigned long dst, int len)
+{
+ if (!access_ok(VERIFY_READ, dst ,len))
+ return -EIO;
+ return access_process_vm_user(tsk, dst, src, len, 1, 0);
}

static int ptrace_setoptions(struct task_struct *child, long data)

2006-05-18 15:59:20

by Renzo Davoli

[permalink] [raw]
Subject: [PATCH] 2-ptrace_multi

ptmulti kernel patch inserts a new useful option for ptrace() call,
adding a new request type to ptrace() syscall.

With PTRACE_MULTI option you can send multiple ptrace requests with a
single system call: commonly a process that uses ptrace() needs
several PTRACE_PEEKDATA for getting some useful, even small pieces of data.
It is useful for these programs to run several ptrace
operations while limiting the number of context switches.

Ptrace-multi gets a "struct ptrace_multi" array (together with its
number of elements). It is in this way similar to the management of buffers
for readv or writev.
Each "struct ptrace_multi" specifies a single request of normal type of
ptrace. So you have can join several requests into a requests array that will
be passed through the "void* addr" parameter (the third) of ptrace().
The last argument specify the numbers of request to be accomplished by ptrace.

While normal ptrace requests can get a word at a time, we have added some other
request for simplify the interface between kernel
and applications that use ptrace(); these requests can get from user space more
than one word at a time:
- PTRACE_PEEKCHARDATA and PTRACE_POKECHARDATA is used for transferring general
data, like structure, buffer, and so on...
- PTRACE_PEEKSTRINGDATA get strings from user space (it relies on
access_process_vm_user patch) and it check that data returned is a well formed
string (null-terminated)

Debuggers and virtual machines (like User Mode Linux) and many other
applications that are based on ptrace can get great
performance improvements by PTRACE_MULTI: the number of system
calls (and context switches) decreases significantly.

This patch is architecture independent.

Signed-off-by: renzo davoli <[email protected]>

diff -Naur linux-2.6.17-rc1/include/linux/ptrace.h linux-2.6.17-rc1-ptmulti/include/linux/ptrace.h
--- linux-2.6.17-rc1/include/linux/ptrace.h 2006-04-04 16:18:55.000000000 +0200
+++ linux-2.6.17-rc1-ptmulti/include/linux/ptrace.h 2006-04-04 16:17:30.000000000 +0200
@@ -27,6 +27,18 @@
#define PTRACE_GETSIGINFO 0x4202
#define PTRACE_SETSIGINFO 0x4203

+#define PTRACE_MULTI 0x4300
+#define PTRACE_PEEKCHARDATA 0x4301
+#define PTRACE_POKECHARDATA 0x4302
+#define PTRACE_PEEKSTRINGDATA 0x4303
+
+struct ptrace_multi {
+ long request;
+ long addr;
+ void *localaddr;
+ long length;
+};
+
/* options set using PTRACE_SETOPTIONS */
#define PTRACE_O_TRACESYSGOOD 0x00000001
#define PTRACE_O_TRACEFORK 0x00000002
diff -Naur linux-2.6.17-rc1/kernel/ptrace.c linux-2.6.17-rc1-ptmulti/kernel/ptrace.c
--- linux-2.6.17-rc1/kernel/ptrace.c 2006-04-04 16:18:55.000000000 +0200
+++ linux-2.6.17-rc1-ptmulti/kernel/ptrace.c 2006-04-04 16:17:30.000000000 +0200
@@ -2,6 +2,7 @@
* linux/kernel/ptrace.c
*
* (C) Copyright 1999 Linus Torvalds
+ * PTRACE_MULTI support 2006 Renzo Davoli
*
* Common interfaces for "ptrace()" which we do not want
* to continually duplicate across every architecture.
@@ -503,6 +504,53 @@
return child;
}

+static int multi_ptrace(struct task_struct *child, long request, long addr, long size)
+{
+ int i, ret;
+ long j;
+ ret=0;
+ if (!access_ok(VERIFY_READ, addr,size*sizeof(struct ptrace_multi))) {
+ ret = -EIO;
+ goto out_multi_ptrace;
+ }
+ for (i=0; i<size && ret==0; i++, addr+=sizeof(struct ptrace_multi)) {
+ unsigned long len;
+ struct ptrace_multi __user pm ;
+ __copy_from_user(&pm, (struct ptrace_multi __user *)addr, sizeof(struct ptrace_multi));
+ len = pm.length;
+ switch ( pm.request){
+ case PTRACE_PEEKTEXT:
+ case PTRACE_PEEKDATA:
+ case PTRACE_PEEKUSR:
+ if (len <= 0) len=1;
+ for (j=0; j<len && ret==0; j++)
+ ret=arch_ptrace(child, pm.request, (long) (pm.addr) + j*sizeof(long), (long) (pm.localaddr) + j*sizeof(long));
+ break;
+ case PTRACE_POKETEXT:
+ case PTRACE_POKEDATA:
+ case PTRACE_POKEUSR:
+ if (len <= 0) len=1;
+ for (j=0; j<len && ret==0; j++)
+ ret=arch_ptrace(child, pm.request, (long) (pm.addr) + j*sizeof(long), *(((long *) (pm.localaddr)) + j));
+ break;
+ case PTRACE_PEEKCHARDATA:
+ ret = ptrace_readdata(child, pm.addr, pm.localaddr, len);
+ break;
+ case PTRACE_POKECHARDATA:
+ ret = ptrace_writedata(child, pm.localaddr, pm.addr, len);
+ break;
+ case PTRACE_PEEKSTRINGDATA:
+ ret = ptrace_readstringdata(child, pm.addr, pm.localaddr, len);
+ break;
+ default:
+ ret=arch_ptrace(child, pm.request, (long) (pm.addr), (long) (pm.localaddr));
+ break;
+ }
+ }
+ out_multi_ptrace:
+ return ret;
+}
+
#ifndef __ARCH_SYS_PTRACE
asmlinkage long sys_ptrace(long request, long pid, long addr, long data)
{
@@ -533,6 +581,9 @@
if (ret < 0)
goto out_put_task_struct;

+ if (request == PTRACE_MULTI)
+ ret = multi_ptrace(child, request, addr, data);
+ else
ret = arch_ptrace(child, request, addr, data);
if (ret < 0)
goto out_put_task_struct;

2006-05-18 16:07:54

by Renzo Davoli

[permalink] [raw]
Subject: [PATCH] 3-ptrace_vm

PTRACE_SYSVM patch is a improvement for applications
that use ptrace() for creating virtual machines, like User Mode Linux
or UMview.

The concept is near to the SYSEMU patch introduced by UML: you can specify that
a system call should not be executed by the underlying kernel.

In PTRACE_SYSEMU there's a conceptual hole: you tell that _the_next_ system call
won't be called, but sometimes is important to decide whether the system
call has to be executed or not during the system call management, it is not
possible to decide in advance.
User-Mode linux redefines all the system calls, but UM-ViewOS (for example)
aims to redefine only some of them and under specifical conditions.
(thus implementing Partial Virtual Machines).
In this case a VM monitor need to run the uncatched system calls in the
ptraced process without any further interactions (which would be only a
waste of time).

Our patch is semantically much more general, in order to select between
different behavior: with SYSVM you can decide to skip both the execution of the
current system call and the upcall after the execution or just the
"after syscall" upcall.

We introduced PTRACE_SYSVM request, together with PTRACE_VM_SKIPCALL and
PTRACE_VM_SKIPEXIT options that are specified inside the third parameter
of ptrace().
PTRACE_SYSVM is intended to be used instead of PTRACE_SYSCALL during the
upcall preceding the system call execution.

PTRACE_VM_SKIPCALL skips the system call execution (by the ptraced prrocess).
PTRACE_VM_SKIPEXIT avoids the second upcall after the system call execution.

With the PTRACE_SYSVM support a ptracing process (usually the virtual machine
monitor for the ptraced process) can decide different behaviors:

** ptrace(PTRACE_SYSVM,pid,0,0): it is completely equivalent to
ptrace(PTRACE_SYSCALL,pid,0,0)

** ptrace(PTRACE_SYSVM,pid,PT_VM_SKIPEXIT,0): the syscall is executed and the
results are returned to the process (pid). This is useful when the
ptracing process does not need to virtualize the call.

** ptrace(PTRACE_SYSVM,pid,PT_VM_SKIPCALL | PT_VM_SKIPEXIT,0): the syscall is
*not* executed nor there is a further upcall. This possibility is used when
the virtual machine need to virtualize the call, the return value, exit status
etc. should be loaded in the process address space before this call (maybe
using a PTRACE_MULTI call).

(In theory it is also possible to use a call like
ptrace(PTRACE_SYSVM,pid,VM_SKIPCALL,0): in this case the system call is not
executed but a second upcall takes place anyway. This is useless as
the second upcall takes place after the kernel has done nothing after the first
one.
We decided to use two flags for the sake of clearness and readability,
and given that we needed three different behaviors we needed two bits anyway).

That patch could completely replace PTRACE_SYSEMU: User Mode Linux could easily
use our patch in place of SYSEMU.

The patch work on ppc, i386, and even um (usermodelinux) architecture.
It has given a good speedup to Umview (from some rough tests it seems up to
1.8,1.85 on system call intensive programs like cp).

Signed-off-by: renzo davoli <[email protected]>

diff -Naur linux-2.6.17-rc1-access/arch/i386/kernel/ptrace.c linux-2.6.17-rc1-pmulti-ptvm/arch/i386/kernel/ptrace.c
--- linux-2.6.17-rc1-access/arch/i386/kernel/ptrace.c 2006-04-03 05:22:10.000000000 +0200
+++ linux-2.6.17-rc1-pmulti-ptvm/arch/i386/kernel/ptrace.c 2006-04-03 11:30:53.000000000 +0200
@@ -477,15 +477,26 @@
}
break;

+ case PTRACE_SYSVM:
+ if (addr == PTRACE_VM_TEST) {
+ ret = PTRACE_VM_MASK;
+ break;
+ }
+
case PTRACE_SYSEMU: /* continue and stop at next syscall, which will not be executed */
case PTRACE_SYSCALL: /* continue and stop at next (return from) syscall */
case PTRACE_CONT: /* restart after signal. */
ret = -EIO;
if (!valid_signal(data))
break;
+ child->ptrace &= ~PT_VM_MASK;
if (request == PTRACE_SYSEMU) {
set_tsk_thread_flag(child, TIF_SYSCALL_EMU);
clear_tsk_thread_flag(child, TIF_SYSCALL_TRACE);
+ } else if (request == PTRACE_SYSVM) {
+ set_tsk_thread_flag(child, TIF_SYSCALL_TRACE);
+ clear_tsk_thread_flag(child, TIF_SYSCALL_EMU);
+ child->ptrace |= (addr & PTRACE_VM_MASK) << 28;
} else if (request == PTRACE_SYSCALL) {
set_tsk_thread_flag(child, TIF_SYSCALL_TRACE);
clear_tsk_thread_flag(child, TIF_SYSCALL_EMU);
@@ -690,6 +701,9 @@
if (!(current->ptrace & PT_PTRACED))
goto out;

+ if (entryexit && (current->ptrace & PT_VM_SKIPEXIT))
+ return 0;
+
/* If a process stops on the 1st tracepoint with SYSCALL_TRACE
* and then is resumed with SYSEMU_SINGLESTEP, it will come in
* here. We have to check this and return */
@@ -717,7 +731,7 @@
send_sig(current->exit_code, current, 1);
current->exit_code = 0;
}
- ret = is_sysemu;
+ ret = (is_sysemu || (!entryexit && (current->ptrace & PT_VM_SKIPCALL)));
out:
if (unlikely(current->audit_context) && !entryexit)
audit_syscall_entry(current, AUDIT_ARCH_I386, regs->orig_eax,
diff -Naur linux-2.6.17-rc1-access/arch/powerpc/kernel/entry_32.S linux-2.6.17-rc1-pmulti-ptvm/arch/powerpc/kernel/entry_32.S
--- linux-2.6.17-rc1-access/arch/powerpc/kernel/entry_32.S 2006-04-03 05:22:10.000000000 +0200
+++ linux-2.6.17-rc1-pmulti-ptvm/arch/powerpc/kernel/entry_32.S 2006-04-03 11:30:53.000000000 +0200
@@ -279,6 +279,7 @@
stw r0,_TRAP(r1)
addi r3,r1,STACK_FRAME_OVERHEAD
bl do_syscall_trace_enter
+ mr r10,r3
lwz r0,GPR0(r1) /* Restore original registers */
lwz r3,GPR3(r1)
lwz r4,GPR4(r1)
@@ -287,6 +288,8 @@
lwz r7,GPR7(r1)
lwz r8,GPR8(r1)
REST_NVGPRS(r1)
+ cmpwi r10,0
+ bne- ret_from_syscall
b syscall_dotrace_cont

syscall_exit_work:
diff -Naur linux-2.6.17-rc1-access/arch/powerpc/kernel/entry_64.S linux-2.6.17-rc1-pmulti-ptvm/arch/powerpc/kernel/entry_64.S
--- linux-2.6.17-rc1-access/arch/powerpc/kernel/entry_64.S 2006-04-03 05:22:10.000000000 +0200
+++ linux-2.6.17-rc1-pmulti-ptvm/arch/powerpc/kernel/entry_64.S 2006-04-03 11:30:53.000000000 +0200
@@ -195,6 +195,7 @@
bl .save_nvgprs
addi r3,r1,STACK_FRAME_OVERHEAD
bl .do_syscall_trace_enter
+ mr r11,r3
ld r0,GPR0(r1) /* Restore original registers */
ld r3,GPR3(r1)
ld r4,GPR4(r1)
@@ -205,6 +206,8 @@
addi r9,r1,STACK_FRAME_OVERHEAD
clrrdi r10,r1,THREAD_SHIFT
ld r10,TI_FLAGS(r10)
+ cmpwi r11,0
+ bne- syscall_exit
b syscall_dotrace_cont

syscall_enosys:
diff -Naur linux-2.6.17-rc1-access/arch/powerpc/kernel/ptrace.c linux-2.6.17-rc1-pmulti-ptvm/arch/powerpc/kernel/ptrace.c
--- linux-2.6.17-rc1-access/arch/powerpc/kernel/ptrace.c 2006-04-03 05:22:10.000000000 +0200
+++ linux-2.6.17-rc1-pmulti-ptvm/arch/powerpc/kernel/ptrace.c 2006-04-03 11:30:53.000000000 +0200
@@ -338,15 +338,28 @@
break;
}

+ case PTRACE_SYSVM:
+ if (addr == PTRACE_VM_TEST) {
+ ret = PTRACE_VM_MASK;
+ break;
+ }
+
case PTRACE_SYSCALL: /* continue and stop at next (return from) syscall */
case PTRACE_CONT: { /* restart after signal. */
ret = -EIO;
if (!valid_signal(data))
break;
- if (request == PTRACE_SYSCALL)
- set_tsk_thread_flag(child, TIF_SYSCALL_TRACE);
- else
+ child->ptrace &= ~PT_VM_MASK;
+ if (request == PTRACE_CONT)
clear_tsk_thread_flag(child, TIF_SYSCALL_TRACE);
+ else {
+ set_tsk_thread_flag(child, TIF_SYSCALL_TRACE);
+ if (request == PTRACE_SYSVM) {
+ /* set PT_VM_SKIPCALL and PT_VM_SKIPEXIT by
+ * one operation */
+ child->ptrace |= (addr & PTRACE_VM_MASK) << 28;
+ }
+ }
child->exit_code = data;
/* make sure the single step bit is not set. */
clear_single_step(child);
@@ -527,7 +540,7 @@
}
}

-void do_syscall_trace_enter(struct pt_regs *regs)
+int do_syscall_trace_enter(struct pt_regs *regs)
{
#ifdef CONFIG_PPC64
secure_computing(regs->gpr[0]);
@@ -547,6 +560,7 @@
regs->gpr[0],
regs->gpr[3], regs->gpr[4],
regs->gpr[5], regs->gpr[6]);
+ return (current->ptrace & PT_VM_SKIPCALL);
}

void do_syscall_trace_leave(struct pt_regs *regs)
@@ -562,7 +576,8 @@

if ((test_thread_flag(TIF_SYSCALL_TRACE)
|| test_thread_flag(TIF_SINGLESTEP))
- && (current->ptrace & PT_PTRACED))
+ && (current->ptrace & PT_PTRACED) &&
+ ((current->ptrace & PT_VM_SKIPEXIT)==0))
do_syscall_trace();
}

diff -Naur linux-2.6.17-rc1-access/arch/ppc/kernel/entry.S linux-2.6.17-rc1-pmulti-ptvm/arch/ppc/kernel/entry.S
--- linux-2.6.17-rc1-access/arch/ppc/kernel/entry.S 2006-04-03 05:22:10.000000000 +0200
+++ linux-2.6.17-rc1-pmulti-ptvm/arch/ppc/kernel/entry.S 2006-04-03 11:30:53.000000000 +0200
@@ -279,6 +279,7 @@
stw r0,TRAP(r1)
addi r3,r1,STACK_FRAME_OVERHEAD
bl do_syscall_trace_enter
+ mr r10,r3
lwz r0,GPR0(r1) /* Restore original registers */
lwz r3,GPR3(r1)
lwz r4,GPR4(r1)
@@ -287,6 +288,8 @@
lwz r7,GPR7(r1)
lwz r8,GPR8(r1)
REST_NVGPRS(r1)
+ cmpwi r10,0
+ bne- ret_from_syscall
b syscall_dotrace_cont

syscall_exit_work:
diff -Naur linux-2.6.17-rc1-access/arch/um/drivers/ubd_kern.c linux-2.6.17-rc1-pmulti-ptvm/arch/um/drivers/ubd_kern.c
--- linux-2.6.17-rc1-access/arch/um/drivers/ubd_kern.c 2006-04-03 05:22:10.000000000 +0200
+++ linux-2.6.17-rc1-pmulti-ptvm/arch/um/drivers/ubd_kern.c 2006-04-03 11:30:53.000000000 +0200
@@ -1064,6 +1064,7 @@
"errno = %d\n", -n);
}
}
+ return 0;
}

static int ubd_getgeo(struct block_device *bdev, struct hd_geometry *geo)
diff -Naur linux-2.6.17-rc1-access/arch/um/include/kern_util.h linux-2.6.17-rc1-pmulti-ptvm/arch/um/include/kern_util.h
--- linux-2.6.17-rc1-access/arch/um/include/kern_util.h 2006-04-03 05:22:10.000000000 +0200
+++ linux-2.6.17-rc1-pmulti-ptvm/arch/um/include/kern_util.h 2006-04-03 11:30:53.000000000 +0200
@@ -70,7 +70,7 @@
extern void paging_init(void);
extern void init_flush_vm(void);
extern void *syscall_sp(void *t);
-extern void syscall_trace(union uml_pt_regs *regs, int entryexit);
+extern int syscall_trace(union uml_pt_regs *regs, int entryexit);
extern int hz(void);
extern void uml_idle_timer(void);
extern unsigned int do_IRQ(int irq, union uml_pt_regs *regs);
diff -Naur linux-2.6.17-rc1-access/arch/um/kernel/ptrace.c linux-2.6.17-rc1-pmulti-ptvm/arch/um/kernel/ptrace.c
--- linux-2.6.17-rc1-access/arch/um/kernel/ptrace.c 2006-04-03 05:22:10.000000000 +0200
+++ linux-2.6.17-rc1-pmulti-ptvm/arch/um/kernel/ptrace.c 2006-04-03 11:30:53.000000000 +0200
@@ -82,6 +82,12 @@
ret = poke_user(child, addr, data);
break;

+ case PTRACE_SYSVM:
+ if (addr == PTRACE_VM_TEST) {
+ ret = PTRACE_VM_MASK;
+ break;
+ }
+
case PTRACE_SYSCALL: /* continue and stop at next (return from) syscall */
case PTRACE_CONT: { /* restart after signal. */
ret = -EIO;
@@ -89,11 +95,17 @@
break;

set_singlestepping(child, 0);
- if (request == PTRACE_SYSCALL) {
- set_tsk_thread_flag(child, TIF_SYSCALL_TRACE);
+ child->ptrace &= ~PT_VM_MASK;
+ if (request == PTRACE_CONT) {
+ clear_tsk_thread_flag(child, TIF_SYSCALL_TRACE);
}
else {
- clear_tsk_thread_flag(child, TIF_SYSCALL_TRACE);
+ set_tsk_thread_flag(child, TIF_SYSCALL_TRACE);
+ if (request == PTRACE_SYSVM) {
+ /* set PT_VM_SKIPCALL and PT_VM_SKIPEXIT by
+ * one operation */
+ child->ptrace |= (addr & PTRACE_VM_MASK) << 28;
+ }
}
child->exit_code = data;
wake_up_process(child);
@@ -268,7 +280,7 @@
/* XXX Check PT_DTRACE vs TIF_SINGLESTEP for singlestepping check and
* PT_PTRACED vs TIF_SYSCALL_TRACE for syscall tracing check
*/
-void syscall_trace(union uml_pt_regs *regs, int entryexit)
+int syscall_trace(union uml_pt_regs *regs, int entryexit)
{
int is_singlestep = (current->ptrace & PT_DTRACE) && entryexit;
int tracesysgood;
@@ -292,10 +304,13 @@
send_sigtrap(current, regs, 0);

if (!test_thread_flag(TIF_SYSCALL_TRACE))
- return;
+ return 0;

if (!(current->ptrace & PT_PTRACED))
- return;
+ return 0;
+
+ if (entryexit && (current->ptrace & PT_VM_SKIPEXIT))
+ return 0;

/* the 0x80 provides a way for the tracing parent to distinguish
between a syscall stop and SIGTRAP delivery */
@@ -313,4 +328,8 @@
send_sig(current->exit_code, current, 1);
current->exit_code = 0;
}
+ if (!entryexit && (current->ptrace & PT_VM_SKIPCALL))
+ return 1;
+ else
+ return 0;
}
diff -Naur linux-2.6.17-rc1-access/arch/um/kernel/skas/syscall.c linux-2.6.17-rc1-pmulti-ptvm/arch/um/kernel/skas/syscall.c
--- linux-2.6.17-rc1-access/arch/um/kernel/skas/syscall.c 2006-04-03 05:22:10.000000000 +0200
+++ linux-2.6.17-rc1-pmulti-ptvm/arch/um/kernel/skas/syscall.c 2006-04-03 11:30:53.000000000 +0200
@@ -18,12 +18,13 @@
struct pt_regs *regs = container_of(r, struct pt_regs, regs);
long result;
int syscall;
+ int skip_call;
#ifdef UML_CONFIG_SYSCALL_DEBUG
int index;

index = record_syscall_start(UPT_SYSCALL_NR(r));
#endif
- syscall_trace(r, 0);
+ skip_call=syscall_trace(r, 0);

current->thread.nsyscalls++;
nsyscalls++;
@@ -36,12 +37,14 @@
* gcc version 4.0.1 20050727 (Red Hat 4.0.1-5)
* in case it's a compiler bug.
*/
+ if (skip_call == 0) {
syscall = UPT_SYSCALL_NR(r);
if((syscall >= NR_syscalls) || (syscall < 0))
result = -ENOSYS;
else result = EXECUTE_SYSCALL(syscall, regs);

REGS_SET_SYSCALL_RETURN(r->skas.regs, result);
+ }

syscall_trace(r, 1);
#ifdef UML_CONFIG_SYSCALL_DEBUG
diff -Naur linux-2.6.17-rc1-access/arch/um/kernel/tt/syscall_kern.c linux-2.6.17-rc1-pmulti-ptvm/arch/um/kernel/tt/syscall_kern.c
--- linux-2.6.17-rc1-access/arch/um/kernel/tt/syscall_kern.c 2006-04-03 05:22:10.000000000 +0200
+++ linux-2.6.17-rc1-pmulti-ptvm/arch/um/kernel/tt/syscall_kern.c 2006-04-03 11:30:53.000000000 +0200
@@ -21,6 +21,7 @@
void *sc;
long result;
int syscall;
+ int skip_call;
#ifdef CONFIG_SYSCALL_DEBUG
int index;
#endif
@@ -33,11 +34,13 @@
index = record_syscall_start(syscall);
#endif

- syscall_trace(&regs->regs, 0);
+ skip_call=syscall_trace(&regs->regs, 0);

current->thread.nsyscalls++;
nsyscalls++;

+ if (skip_call == 0) {
+
if((syscall >= NR_syscalls) || (syscall < 0))
result = -ENOSYS;
else result = EXECUTE_SYSCALL(syscall, regs);
@@ -49,6 +52,7 @@

SC_SET_SYSCALL_RETURN(sc, result);

+ }
syscall_trace(&regs->regs, 1);
#ifdef CONFIG_SYSCALL_DEBUG
record_syscall_end(index, result);
diff -Naur linux-2.6.17-rc1-access/include/linux/ptrace.h linux-2.6.17-rc1-pmulti-ptvm/include/linux/ptrace.h
--- linux-2.6.17-rc1-access/include/linux/ptrace.h 2006-04-05 10:23:15.000000000 +0200
+++ linux-2.6.17-rc1-pmulti-ptvm/include/linux/ptrace.h 2006-04-04 08:59:36.000000000 +0200
@@ -1,6 +1,7 @@
#ifndef _LINUX_PTRACE_H
#define _LINUX_PTRACE_H
/* ptrace.h */
+#include <linux/config.h>
/* structs and defines to help the user use the ptrace system call. */

/* has the defines to get at the registers. */
@@ -20,6 +21,7 @@
#define PTRACE_DETACH 0x11

#define PTRACE_SYSCALL 24
+#define PTRACE_SYSVM 33

/* 0x4200-0x4300 are reserved for architecture-independent additions. */
#define PTRACE_SETOPTIONS 0x4200
@@ -32,6 +34,10 @@
#define PTRACE_POKECHARDATA 0x4302
#define PTRACE_PEEKSTRINGDATA 0x4303

+#ifdef CONFIG_VIEWOS
+#define PTRACE_VIEWOS 0x4000
+#endif
+
struct ptrace_multi {
long request;
long addr;
@@ -58,6 +64,18 @@
#define PTRACE_EVENT_VFORK_DONE 5
#define PTRACE_EVENT_EXIT 6

+/* options for PTRACE_SYSVM */
+#define PTRACE_VM_TEST 0x80000000
+#define PTRACE_VM_SKIPCALL 1
+#define PTRACE_VM_SKIPEXIT 2
+#define PTRACE_VM_MASK 0x00000003
+
+#ifdef CONFIG_VIEWOS
+/* options fpr PTRACE_VIEWOS */
+#define PT_VIEWOS_TEST 0x80000000
+#define PT_VIEWOS_MASK 0x00000000
+#endif
+
#include <asm/ptrace.h>

#ifdef __KERNEL__
@@ -77,6 +95,10 @@
#define PT_TRACE_EXIT 0x00000200
#define PT_ATTACHED 0x00000400 /* parent != real_parent */

+#define PT_VM_SKIPCALL 0x10000000
+#define PT_VM_SKIPEXIT 0x20000000
+#define PT_VM_MASK 0x30000000
+
#define PT_TRACE_MASK 0x000003f4

/* single stepping state bits (used on ARM and PA-RISC) */
diff -Naur linux-2.6.17-rc1-access/init/Kconfig linux-2.6.17-rc1-pmulti-ptvm/init/Kconfig
--- linux-2.6.17-rc1-access/init/Kconfig 2006-04-03 05:22:10.000000000 +0200
+++ linux-2.6.17-rc1-pmulti-ptvm/init/Kconfig 2006-04-03 11:30:53.000000000 +0200
@@ -225,6 +225,13 @@

If unsure, say N.

+config VIEWOS
+ bool "ViewOS Support (EXPERIMENTAL)"
+ help
+ This option will enable kernel support for ViewOS partial virtualization.
+
+ Say N if unsure.
+
source "usr/Kconfig"

config UID16
diff -Naur linux-2.6.17-rc1-access/kernel/ptrace.c linux-2.6.17-rc1-pmulti-ptvm/kernel/ptrace.c
--- linux-2.6.17-rc1-access/kernel/ptrace.c 2006-04-05 10:23:15.000000000 +0200
+++ linux-2.6.17-rc1-pmulti-ptvm/kernel/ptrace.c 2006-04-03 11:30:53.000000000 +0200
@@ -310,7 +310,7 @@
if (string) {
for (offset=0;offset<bytes;offset++)
if (buf[offset]==0)
- break;
+ break;
if (offset < bytes)
bytes=len=offset+1;
}
@@ -595,3 +595,7 @@
return ret;
}
#endif /* __ARCH_SYS_PTRACE */
+
+#ifdef CONFIG_VIEWOS
+#warning VIEWOS support not implemented yet
+#endif

2006-05-18 20:17:16

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] 2-ptrace_multi

[email protected] (Renzo Davoli) writes:

> ptmulti kernel patch inserts a new useful option for ptrace() call,
> adding a new request type to ptrace() syscall.
>
> With PTRACE_MULTI option you can send multiple ptrace requests with a
> single system call: commonly a process that uses ptrace() needs
> several PTRACE_PEEKDATA for getting some useful, even small pieces of data.
> It is useful for these programs to run several ptrace
> operations while limiting the number of context switches.

What context switches do you mean? System calls? Linux is in general
designed to have very cheap system calls and they shouldn't be more tha
a few hundred cycles.

>
> Debuggers and virtual machines (like User Mode Linux) and many other
> applications that are based on ptrace can get great
> performance improvements by PTRACE_MULTI: the number of system
> calls (and context switches) decreases significantly.

You forgot to add numbers?

-Andi

2006-05-18 21:13:23

by Renzo Davoli

[permalink] [raw]
Subject: Re: [PATCH] 2-ptrace_multi

On Thu, May 18, 2006 at 10:17:08PM +0200, Andi Kleen wrote:
> > With PTRACE_MULTI option you can send multiple ptrace requests with a
> > single system call: commonly a process that uses ptrace() needs
> > several PTRACE_PEEKDATA for getting some useful, even small pieces of data.
> > It is useful for these programs to run several ptrace
> > operations while limiting the number of context switches.
>
> What context switches do you mean? System calls? Linux is in general
> designed to have very cheap system calls and they shouldn't be more tha
> a few hundred cycles.
Agree, but they are a few hundred cycles wasted.
This can be highly significative when you are working on a Virtual Machine
monitor.
In umview the monitor (umview process) and the process inside the VM
live in separate address spaces.
e.g. To virtualize a write you'd have to call PTRACE_PEEKDATA for each
word of the buffer, very many hundreds cycles lost.
> >
> > Debuggers and virtual machines (like User Mode Linux) and many other
> > applications that are based on ptrace can get great
> > performance improvements by PTRACE_MULTI: the number of system
> > calls (and context switches) decreases significantly.
>
> You forgot to add numbers?
They are obviously highly dependant on applications.

(test computer=tibook G4 1Ghz)
Umview+unreal test module/NO PTRACE_MULTI/NO PTRACE_SYSVM
$ time cp /unreal/usr/src/linux-source-2.6.16.tar.bz2 /tmp
real 0m22.626s
user 0m0.000s
sys 0m0.448s

Umview+unreal test module/NO_SYSVM WITH PTRACE_MULTI
$ time cp /unreal/usr/src/linux-source-2.6.16.tar.bz2 /tmp
real 0m1.850s
user 0m0.008s
sys 0m0.384s

Umview+unreal test module/SYSVM+MULTI
$ time cp /unreal/usr/src/linux-source-2.6.16.tar.bz2 /tmp
real 0m1.261s
user 0m0.012s
sys 0m0.392s

This is a test on the cost of umview virtualization with no modules.
Umview alone/NO SYSVM/NO PTRACE_MULTI
$ time cp /usr/src/linux-source-2.6.16.tar.bz2 /tmp
real 0m1.595s
user 0m0.000s
sys 0m0.548s

Umview alone/NO SYSVM (with ptrace_multi)
$ time cp /usr/src/linux-source-2.6.16.tar.bz2 /tmp
real 0m1.587s
user 0m0.008s
sys 0m0.596s

Umview alone/With SYSVM (and ptrace_multi)
$ time cp /usr/src/linux-source-2.6.16.tar.bz2 /tmp
real 0m1.056s
user 0m0.008s
sys 0m0.496s

The same command on the Linux kernel (NO umview)
$ time cp /usr/src/linux-source-2.6.16.tar.bz2 /tmp
real 0m0.429s
user 0m0.004s
sys 0m0.368s

This is a worst case run of umview as cp is absolutely IO bound (and
system call bound). The more computation a program has the more
efficient is the VM. The user mode VM costs in this case 2.5 times slowdown.

renzo

2006-05-19 02:23:16

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [PATCH] 2-ptrace_multi

On 5/18/06, Renzo Davoli <[email protected]> wrote:
> e.g. To virtualize a write you'd have to call PTRACE_PEEKDATA for each
> word of the buffer, very many hundreds cycles lost.

No, this is not how programs should do it. Just open /proc/PID/mem
and use pread() with an offset corresponding to the address. Now,
repeat your timings using this technique.

2006-05-19 09:07:29

by Renzo Davoli

[permalink] [raw]
Subject: Re: [PATCH] 2-ptrace_multi

On Thu, May 18, 2006 at 07:23:13PM -0700, Ulrich Drepper wrote:
> On 5/18/06, Renzo Davoli <[email protected]> wrote:
> >e.g. To virtualize a write you'd have to call PTRACE_PEEKDATA for each
> >word of the buffer, very many hundreds cycles lost.
>
> No, this is not how programs should do it. Just open /proc/PID/mem
> and use pread() with an offset corresponding to the address. Now,
> repeat your timings using this technique.

That would be faster to access the memory but:
- the manager has to keep one open file per controlled process
- when the Virtual Machine manager has to access the ptraced memory and
access its registers and finally restart the process, needs three system
calls instead of just one. (one pread/pwrite to /proc/PID/mem + at least one
ptrace PEEK/POKEUSR and a ptrace SYSCALL or SYSVM).
For a Virtual Machine manager three syscalls instead of one makes the
difference. The gap will be not so large with respect to the figures
of my previous message but there will be for sure.
- the read/write of several words of memory using ptrace do exist
implemented in an horribly tricky way for sparc/sparc64 (addr2 is taken
out from a register as an extra argument which is not part of ptrace
definition, see arch/sparc/kernel/ptrace.c). Our proposal gives also a
clean and efficient and general interface for the same feature.

renzo

2006-05-19 13:09:57

by Daniel Jacobowitz

[permalink] [raw]
Subject: Re: [PATCH] 2-ptrace_multi

On Fri, May 19, 2006 at 11:07:26AM +0200, Renzo Davoli wrote:
> On Thu, May 18, 2006 at 07:23:13PM -0700, Ulrich Drepper wrote:
> > On 5/18/06, Renzo Davoli <[email protected]> wrote:
> > >e.g. To virtualize a write you'd have to call PTRACE_PEEKDATA for each
> > >word of the buffer, very many hundreds cycles lost.
> >
> > No, this is not how programs should do it. Just open /proc/PID/mem
> > and use pread() with an offset corresponding to the address. Now,
> > repeat your timings using this technique.
>
> That would be faster to access the memory but:
> - the manager has to keep one open file per controlled process

No, it doesn't. It can open it as needed. It can even maintain a
cache of open mem files.

GDB's been opening it as needed for years. It works very well and is
drastically faster than PTRACE_PEEKDATA.

--
Daniel Jacobowitz
CodeSourcery

2006-05-19 17:45:42

by Renzo Davoli

[permalink] [raw]
Subject: Re: [PATCH] 2-ptrace_multi

On Fri, May 19, 2006 at 09:09:52AM -0400, Daniel Jacobowitz wrote:
> On Fri, May 19, 2006 at 11:07:26AM +0200, Renzo Davoli wrote:
> > On Thu, May 18, 2006 at 07:23:13PM -0700, Ulrich Drepper wrote:
> > > On 5/18/06, Renzo Davoli <[email protected]> wrote:
> > > >e.g. To virtualize a write you'd have to call PTRACE_PEEKDATA for each
> > > >word of the buffer, very many hundreds cycles lost.
> > >
> > > No, this is not how programs should do it. Just open /proc/PID/mem
> > > and use pread() with an offset corresponding to the address. Now,
> > > repeat your timings using this technique.
> >
> > That would be faster to access the memory but:
> > - the manager has to keep one open file per controlled process
>
> No, it doesn't. It can open it as needed. It can even maintain a
> cache of open mem files.
>
> GDB's been opening it as needed for years. It works very well and is
> drastically faster than PTRACE_PEEKDATA.
>
Over all I could speed up just half of the calls because I cannot write
in /proc/<pid>/mem !
You are proposing a solution which speeds up writes but not reads.

(from fs/proc/base.c)
#define mem_write NULL

#ifndef mem_write
/* This is a security hazard */
static ssize_t mem_write(struct file * file, const char * buf,
size_t count, loff_t *ppos)
....
#endif

My proposals should not add any threats which is not already in
PTRACE_POKEDATA. Now, either the threat do currently exist and my
proposed patch makes is exploitable in a faster way, or it did not
exist and it still does not exist.
PTRACE_MULTI just executes several ptrace requests in a single call.

Other projects would benefit from a similar patch:
see: http://www.cs.wisc.edu/condor/doc/parrot-agm2003.pdf
http://www.cse.nd.edu/~dthain/papers/ibox-sc05.pdf
They had the very same problem.

renzo

2006-05-19 19:15:15

by Renzo Davoli

[permalink] [raw]
Subject: Re: [PATCH] 2-ptrace_multi

for the sake of completeness here are the numbers:

This was the previous result.
> (test computer=tibook G4 1Ghz)
> Umview+unreal test module/NO PTRACE_MULTI/NO PTRACE_SYSVM
> $ time cp /unreal/usr/src/linux-source-2.6.16.tar.bz2 /tmp
> real 0m22.626s
> user 0m0.000s
> sys 0m0.448s

This operation cannot use /proc/<pid>/mem as there is a "read" from
the virtual file system that has to write the buffer value into the
ptraced process ("cp") memory.

Let us try the reverse op.

****OLD WAY
Umview+unreal test module/NO PTRACE_MULTI/NO PTRACE_SYSVM (PTRACE, old
way)
$ time cp /usr/src/linux-source-2.6.16.tar.bz2 /unreal/tmp
real 0m16.039s
user 0m0.000s
sys 0m0.208s

****YOUR PROPOSAL WITHOUT/WITH SYSVM (that patch is independent).
Umview+unreal test module/NO PTRACE_MULTI/NO PTRACE_SYSVM (using
/proc/<pid>/mem)
$ time cp /usr/src/linux-source-2.6.16.tar.bz2 /unreal/tmp
real 0m1.649s
user 0m0.000s
sys 0m0.172s

Umview+unreal test module/NO_PTRACE_MULTI PTRACE_SYSVM (using
/proc/<pid>/mem)
$ time cp /usr/src/linux-source-2.6.16.tar.bz2 /unreal/tmp
real 0m1.185s
user 0m0.004s
sys 0m0.188s

****OUR PROPOSAL (PTRACE_MULTI instead of /proc/<pid>/mem (WO/W SYSVM)
Umview+unreal test module PTRACE_MULTI/NO PTRACE_SYSVM
$ time cp /usr/src/linux-source-2.6.16.tar.bz2 /unreal/tmp
real 0m1.500s
user 0m0.004s
sys 0m0.244s

Umview+unreal test module PTRACE_MULTI/PTRACE_SYSVM
$ time cp /usr/src/linux-source-2.6.16.tar.bz2 /unreal/tmp
real 0m0.983s
user 0m0.008s
sys 0m0.148s

All the experiments have been done three times. This is the best time
(always the third); the results would have had the same significance
taking the first or the second run figures, the difference in time would have
been a bit higher.

Anyway I think I'll put this possibility (to use /proc/<pid>/mem) inside umview
source code.
It is a speedup for umview on unpatched kernel, just a one way speedup,
but it can help. Thank you for the hint.
I think that our patch(es) would be useful anyway as the solution you
propose is not a solution at all.
In the best approximation this is a workaround that covers part of the
problems with almost the same (a bit poorer) performance.

renzo

2006-05-19 20:15:15

by Daniel Jacobowitz

[permalink] [raw]
Subject: Re: [PATCH] 2-ptrace_multi

On Fri, May 19, 2006 at 07:45:34PM +0200, Renzo Davoli wrote:
> #ifndef mem_write
> /* This is a security hazard */

I believe the conclusion, when this was last discussed, was that this
is not true and could be fixed.

--
Daniel Jacobowitz
CodeSourcery

2006-05-19 20:17:48

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] 2-ptrace_multi

On Friday 19 May 2006 22:15, Daniel Jacobowitz wrote:

> On Fri, May 19, 2006 at 07:45:34PM +0200, Renzo Davoli wrote:
> > #ifndef mem_write
> > /* This is a security hazard */
>
> I believe the conclusion, when this was last discussed, was that this
> is not true and could be fixed.

iirc the main problem was mmap of /proc/*/mem. write can be probably
enabled after some auditing.

Alan hacked on this iirc so he might comment.

-Andi

2006-05-20 06:44:03

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [PATCH] 2-ptrace_multi

On 5/19/06, Andi Kleen <[email protected]> wrote:
> Alan hacked on this iirc so he might comment.

Al Viro has a similar patch in the FC kernels to govern read access to
/proc/*/maps based on the ptrace permissions. This is probably the
same problem. If you can use ptrace on the target process there is no
security reason why /proc/*/mem access shouldn't be granted.

2006-05-20 15:17:10

by Alan

[permalink] [raw]
Subject: Re: [PATCH] 2-ptrace_multi

On Gwe, 2006-05-19 at 22:17 +0200, Andi Kleen wrote:
> > I believe the conclusion, when this was last discussed, was that this
> > is not true and could be fixed.
>
> iirc the main problem was mmap of /proc/*/mem. write can be probably
> enabled after some auditing.
>
> Alan hacked on this iirc so he might comment.

The stuff I hacked on was to solve the problem that "/proc/xxx/mem"
changed meaning while open. That is if you did opens on proc/self/mem
and passed the fd to someone they got *their own* /proc/self/mem.

That can cause mayhem if you do

fd = open /proc/self/mem
dup(fd, 2);
dup(fd, 1);
seek to right spot
exec setuid binary in a way it prints and self patches.

I think the general cases of write and mmap can probably be enabled with
care. Clearly you can do it via ptrace so therefore ptrace equivalent
permissions is a beginning point. Someone needs to audit the mm
implications carefully because the old DOSemu mmap of /proc/self trick
did break stuff and the write case might have similar problems.

Alan

2006-05-20 18:30:22

by Renzo Davoli

[permalink] [raw]
Subject: Re: [PATCH] 2-ptrace_multi

To summarize the thread we have agreed that
/proc/*/mem should be writable, at least with ptrace permissions.

Even reading from /proc/*/mem does not currently have the same permissions of
ptrace. E.g. when a setuid process is started under ptrace it runs
without the setuid semantics, thus it is possible to get/put data
using PTRACE_{PEEK,POKE}*.
There are no security threats as the process is running in an
unprivileged way, on the contrary this is a feature that allows
virtual machines to run setuid code, e.g. we use this feature to
run /bin/ping on virtual networks.
Instead it is not possible to read the memory through /proc/*/mem
in the same situation.
(In UMview -- see our cvs if you like -- to manage this exception
there is now a read from /proc/*/mem file and if the read fails it
rolls back to the standard PTRACE_PEEKDATA.)

Let me point out that PTRACE_MULTI is not only related to memory access.
We are using PTRACE_MULTI also to store the registers and restart the
execution of the ptraced process with a single syscall.
This is very effective when umview runs on a ppc32 architecture. In
fact, PPC_PTRACE_{G,S}ETREGS do not exist for that architecture
(IMHO there is no evident reason for that). Without PTRACE_MULTI each register
must be read/written individually by a PTRACE_{PEEK,POKE}USER(*)

PTRACE_MULTI can be also used to optimize many other virtualized calls,
e.g. to read/write all the buffers for a readv/writev/recvmsg/sendmsg
call at once.

Therefore I feel that /proc/*/mem access can help but PTRACE_MULTI
gives something more.

renzo

(*) two notes about PPC_PTRACE_{G,S}ETREGS for powerpc.
It is not clear to me why the same calls are okay for ppc64 and forbidden
for ppc32, all the statements inside this ifdef

arch/powerpc/kernel/ptrace.c: 407 #ifdef CONFIG_PPC64
arch/powerpc/kernel/ptrace.c: 408 case PPC_PTRACE_GETREGS: { /* Get GPRs 0 - 31. */
...

are meaningful for ppc32 too. I have not tested it yet, but maybe
deleting the #ifdef is enough to provide PPC_PTRACE_{G,S}ETREGS to
ppc32, too.
There is another detail. IMVHO in ppc64 architecture the security control
that forbids to change the PT_ORIG_R3 register by PTRACE_POKEUSER

arch/powerpc/kernel/ptrace.c: 329 if (index == PT_ORIG_R3)
arch/powerpc/kernel/ptrace.c: 330 break;

is circunvented by PPC_PTRACE_SETREGS that rewrites all the registers
including PT_ORIG_R3. (Maybe I am wrong but I haven't seen any
check about this).

2006-05-20 20:23:26

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [PATCH] 2-ptrace_multi

On 5/20/06, Renzo Davoli <[email protected]> wrote:
> Let me point out that PTRACE_MULTI is not only related to memory access.

You've already been told that syscalls are very fast. For any small
number of calls the overhead is neglectable. Only for possibly huge
numbers of calls (like those needed to transfer memory content) is the
overhead significant but that is irrelevant because of /proc/*/mem.

Whatever other problems you have (accessing multiple registers) is an
arch-specific problem. If this is a real problem talk to the arch
maintainer about adding a call to get all registers at once. This is
nothing which should be handled with a construct like hte
PTRACE_MULTI.

2006-05-20 21:42:54

by Jeff Dike

[permalink] [raw]
Subject: Re: [PATCH] 2-ptrace_multi

On Sat, May 20, 2006 at 08:30:20PM +0200, Renzo Davoli wrote:
> Let me point out that PTRACE_MULTI is not only related to memory access.
> We are using PTRACE_MULTI also to store the registers and restart the
> execution of the ptraced process with a single syscall.
> This is very effective when umview runs on a ppc32 architecture. In
> fact, PPC_PTRACE_{G,S}ETREGS do not exist for that architecture
> (IMHO there is no evident reason for that). Without PTRACE_MULTI each register
> must be read/written individually by a PTRACE_{PEEK,POKE}USER(*)

Wouldn't the obvious fix be to implement [GS]ETREGS for arches that don't
have them?

> PTRACE_MULTI can be also used to optimize many other virtualized calls,
> e.g. to read/write all the buffers for a readv/writev/recvmsg/sendmsg
> call at once.

Here, I bet the data copying cost dominates the system call, and the
syscall overhead is minimal.

Jeff

2006-05-21 08:03:19

by Peter Chubb

[permalink] [raw]
Subject: Re: ptrace enhancements for VM support (patch proposals follow in sep.msgs)

>>>>> "Renzo" == Renzo Davoli <[email protected]> writes:

Renzo> I am sending with three separate messages (as replies to this)
Renzo> a set of proposed patches for a better support of virtual
Renzo> machines through ptrace.

Goody. I'm working on a linux vritualisation project for IA64
(Linux-on-Linux, or LoL for short ... yes the acronym is chosen
deliberately).

I also have some `ptrace improvement' patches. some are the same as
yours (the one-shot syscall stop, for example).

One other that I found very useful for this kind of virtual machine,
is being able to specify an address range for the IP for syscalls to
stop on. So I can specify that the VM should stop only for syscalls
in the virtual machine, not in the signal trampoline or in the
hypervisro code.

You can get my patches from
http://www.ertos.nicta.com.au/software/virtualisation/lol.pml


They're not quite ready for prime time yet, but are getting pretty close.

Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au
http://www.ertos.nicta.com.au ERTOS within National ICT Australia

2006-05-21 12:38:25

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH] 2-ptrace_multi

Jeff Dike wrote:
>> PTRACE_MULTI can be also used to optimize many other virtualized calls,
>> e.g. to read/write all the buffers for a readv/writev/recvmsg/sendmsg
>> call at once.
>>
>
> Here, I bet the data copying cost dominates the system call, and the
> syscall overhead is minimal.

In addition, the aio API allows you to do that in two calls for an iovec
of any size.

--
error compiling committee.c: too many arguments to function

2006-05-21 15:28:12

by Renzo Davoli

[permalink] [raw]
Subject: Re: [PATCH] 2-ptrace_multi

On Sat, May 20, 2006 at 05:39:59PM -0400, Jeff Dike wrote:
> Wouldn't the obvious fix be to implement [GS]ETREGS for arches that don't
> have them?
It is not enough. I am fixing the [GS]ETREGS for ppc32 but it happens
that the error number is stored in the register PT_CCR (a.k.a. R38)
so I need an extra call to get that, then I need the program counter which is
in PT_NIP (R31). [GS]ETREGS for ppc load/store just the range R0-R31.
so I need 3 syscalls for each syscall issued by the ptraced process
instead of just one.
Each architecture has its own idiosyncrasies so when somebody is trying to
write and efficient and *portable* support for virtual machines it happens
to have a piece of memory here, a register there etc that are needed
to support something.
I felt that giving a clear and effective way to do more ptrace requests
at once was the general solution.
Readv, writev were created to solve the performance problem to deal with calls
involving several buffers. All the programs based on readv/writev
can be implemented in a less efficient way through standard read/write.
This proposal had the same humble goal applied to ptrace instead of
read/write.

renzo

2006-05-22 13:02:34

by Daniel Jacobowitz

[permalink] [raw]
Subject: Re: [PATCH] 2-ptrace_multi

On Sun, May 21, 2006 at 05:28:10PM +0200, Renzo Davoli wrote:
> It is not enough. I am fixing the [GS]ETREGS for ppc32 but it happens
> that the error number is stored in the register PT_CCR (a.k.a. R38)
> so I need an extra call to get that, then I need the program counter which is
> in PT_NIP (R31). [GS]ETREGS for ppc load/store just the range R0-R31.
> so I need 3 syscalls for each syscall issued by the ptraced process
> instead of just one.

Then have you considered changing the regset returned to be actually
useful? Especially for ppc32 where you say it was not previously
implemented?

--
Daniel Jacobowitz
CodeSourcery

2006-05-22 15:05:47

by Renzo Davoli

[permalink] [raw]
Subject: Re: [PATCH] 2-ptrace_multi

On Mon, May 22, 2006 at 09:02:22AM -0400, Daniel Jacobowitz wrote:
> On Sun, May 21, 2006 at 05:28:10PM +0200, Renzo Davoli wrote:
> > It is not enough. I am fixing the [GS]ETREGS for ppc32 but it happens
> > that the error number is stored in the register PT_CCR (a.k.a. R38)
> > so I need an extra call to get that, then I need the program counter which is
> > in PT_NIP (R31). [GS]ETREGS for ppc load/store just the range R0-R31.
> > so I need 3 syscalls for each syscall issued by the ptraced process
> > instead of just one.
>
> Then have you considered changing the regset returned to be actually
> useful? Especially for ppc32 where you say it was not previously
> implemented?
Then it would be inconsistent with ppc64 where it does exist, and ppc64
has the very same problem.
So the solution would be to patch also the ppc64 [GS]ETREGS breaking
compatibility with existing applications.

The MULTI proposal was a way to have a fast, simple, safe support.
Fast: one syscall does all
simple: it is a vector of calls with the params of std ptrace calls
safe: if is not a new call, the security checks for ptrace are already
in place
flexible: with [GS]ETREGS you can get only the registers you need
instead of all the registers (this is just an example).
PPC wants to read/write 32 registers, i386 17, x86_64 21 etc,
when maybe just some of the registers are meaningful to your
application. Using [GS]ETREGS, you have to save the entire register set
somewhere to restore them after some changes. This applies also to areas
of memory, and other ptrace commands.
backward compatible: if you did not use it, nothing changes in ptrace
support.

If you do not find this proposal interesting, I'll continue to support
it as a specific patch for umview. I am not here to "sell" any solution.
On the contrary I think it might be useful in many applications.

renzo

2006-05-22 15:26:14

by Daniel Jacobowitz

[permalink] [raw]
Subject: Re: [PATCH] 2-ptrace_multi

On Mon, May 22, 2006 at 05:05:44PM +0200, Renzo Davoli wrote:
> Then it would be inconsistent with ppc64 where it does exist, and ppc64
> has the very same problem.
> So the solution would be to patch also the ppc64 [GS]ETREGS breaking
> compatibility with existing applications.

Or use new ptrace operations for the full regsets; that is probably
wiser.

> The MULTI proposal was a way to have a fast, simple, safe support.
> Fast: one syscall does all

You've added copy_from_user to several operations which were previously
entirely register-based calling conventions. In at least some
configurations this will dwarf the cost of the system call trap.

> If you do not find this proposal interesting, I'll continue to support
> it as a specific patch for umview. I am not here to "sell" any solution.
> On the contrary I think it might be useful in many applications.

Well, I'm afraid that I don't find it interesting; and I don't think
GDB would make use of it.

--
Daniel Jacobowitz
CodeSourcery