[email protected] said:
> First, kudos to everyone who worked on user mode linux.
Thanks!
> Anyway, I was reading about the design of UML, and it seems to me that
> its performance could be improved by adding a split privilege concept
> to Linux processes. A "normal" process would be "privileged".
> However, to support things like UML, a new syscall could put the
> process into "unprivileged" mode, which would cause any traps or
> faults (like syscalls or SEGVs) to drop the process into "privileged"
> mode at a controlled entry point.
This is an interesting idea. All signals would have to drop you back into
privileged mode, and syscalls would invoke the SIGTRAP handler (I'm not
that fond of this, but it works and it's more or less the way syscall
interception is done now (the process SIGTRAP handler isn't called, but
the tracer is woken up with the child being sent a SIGTRAP)).
I was planning on adding a new slow syscall path (enabled with
PTRACE_SIGSYSCALL or something) which delivers a SIGTRAP to the process and
turns off PTRACE_SIGSYSCALL for the duration of the handler.
Your idea would result in basically the same code, but with a much more
sensible interface to it. Mine would add yet another wart to ptrace, making
it even more toadlike than it is now. The notion of two process privilege
levels is much cleaner and more general.
> Adding an extra bit to the mmap/
> mprotect protection flags could specify memory mappings only
> accessible from privileged mode.
And this knocks off another problem. This would allow UML to unmap kernel
text and data while in unprivileged mode without the huge performance penalty
it has now with mprotecting it by hand.
Though, since processes are normally in privileged mode, I would turn that
flag around and say that in unprivileged mode, only specially marked mappings
are available.
This possibly ties in well with something else I have planned. By adding
an interface to create, manipulate, and destroy address spaces, it will
be possible for one thread to have a pool of address spaces available to it
which it can switch between as needed. This will allow UML to have one
host thread per virtual processor (instead of one per UML thread, currently)
and one address space per UML thread, and switch the one host process from
address space to address space on each context switch. This would solve
a bunch of UML problems in one shot.
Another thing I was trying to figure out how to do cleanly once this is
working is putting the UML kernel in its own address space. This would
give UML processes the full 3G address space they expect and make UML
completely invisible to them. Of course, the problem is how do you switch
address spaces on every signal and system call.
Let's say there's a new system call, unprivilege(), and it optionally takes
an address space handle (which would be a file descriptor). Then, any
switch back to privileged mode would first switch back to that address space.
This seems clean to me.
One thing that's unclear to me is how you enter unprivileged mode in the
first place. I guess you'd specify a procedure in the unprivilege() and
it would be called in unprivileged mode.
This looks like a very good idea to me and it seems like it would cleanly
solve a bunch of problems for UML.
Jeff
It sounds like there are a couple of good ideas here. Let me add my
refinements.
new_addr(); /* to get a secondary address space */
struct sandbox_params {
int return_reason;
int return_data;
int eax;
int ebx;
};
run_sandbox(int address_space, struct sandbox_params *params); /* to start a sandbox */
int fmmap(int address_space, void *start, size_t length, int prot,
int flags, int fd, off_t offset);
int fmunmap(int addresss_space, void *start, size_t length);
With the secondary address spaces being completely setup by uml.
And run_sandbox being the entry/exit point. The nice thing here is
that because they would share the same kernel stack/process most
registers can be left in registers. With run_sandbox putting as much
as possible on a fast path.
And then new_addr, fmmap, fmunmap would be all that you would really
need to manipulate those address spaces.
Usually processors only support a kernel/user space differentiation in
their page tables, and the sometimes support caching multiple address
spaces simultaneously cached in their tlbs. So I have designed this
interface to take advantage of the common processor features, and
additionally look as much like normal process execution as possible.
Any other implementation would need someone manually modify the page
tables, either the kernel or uml calling mprotect.
Any trap taken in the sandboxed address space should fill the
appropriate fields in struct sandbox_params and switch address spaces
back to the master process.
This interface is as cheap as I can imagine making it. And with
a little care can be really optimized on the kernel side if uml
becomes a common case.
Eric
I wrote up my thoughts on secondary address spaces on the uml-devel list
(http://www.geocrawler.com/lists/3/SourceForge/709/75/7527174/). I'll try
to be somewhat briefer here.
[email protected] said:
> new_addr(); /* to get a secondary address space */
I asked Linus about this at the kernel summit last year, and he said he
wanted a filesystem interface, not a set of new system calls.
So, my proposed interface (modulo the names, which I welcome improvements to):
Open of:
/proc/mm - returns a file descriptor referring to a new, empty mm_struct
/proc/self/mm - returns a file descriptor referring to current->mm
/proc/<pid>/mm - returns a file descriptor referring to the mm of process <pid>
> int fmmap(int address_space, void *start, size_t length, int prot,
> int flags, int fd, off_t offset);
> int fmunmap(int addresss_space, void *start, size_t length);
My proposal for this is to extend mmap:
void *new_mmap(void *start, size_t length, int prot, int flags,
int src_fd, off_t offset, int dest_fd);
The new thing is the addition of dest_fd, which refers to the object within
which the new mapping is to be made. dest_fd == -1 refers to the current
address space. I intend for dest_fd to be an address space descriptor, but
it seems to make sense for it to be anything that supports mmap.
munmap and mprotect would be similarly extended.
> run_sandbox(int address_space, struct sandbox_params *params); /* to
> start a sandbox */
> And run_sandbox being the entry/exit point.
Are you saying that you'd call run_sandbox to switch address spaces and
enter unprivileged mode, and when you re-enter privileged mode, the run_sandbox
call returns in the original address space with a bunch of information in
params?
If so, then
> The nice thing here is
> that because they would share the same kernel stack/process most
> registers can be left in registers.
is wrong. You need to preserve two kernel contexts, so you need two kernel
stacks. The run_sandbox context is obviously one. The unprivileged code
would also need to enter the kernel to fill in the sandbox params and force a
return from run_sandbox. Also, depending on the arch, CPU traps run on the
current kernel stack.
Otherwise, I like this idea.
But, I don't like mixing the process privilege and address space ideas
together like this. UML, like i386, grabs the top of the address space
for itself (it grabs 0xa0000000 - 0xc0000000), so user/kernel space transitions
don't require an address space switch. To require that a
privileged/unprivileged transition also switch address spaces will put a
speed limit on UML system calls.
That's why I like the idea of maps that are only available in privileged mode.
Turning on some mappings seems a lot cheaper than a full address space switch.
Having said that, I like the address space switch to be available as an
option. There are some UML applications (like honeypots) where having the
UML kernel in a totally different address space would be very useful.
Somewhat unrelated, but another thing I've been thinking about is whether
the process privilege idea could be used to implement strace. One difficulty
is that strace wants to record system calls, not nullify them as UML does.
There doesn't seem to be any room for allowing the system call or signal
handler to proceed in unprivileged mode.
Jeff
Jeff Dike <[email protected]> writes:
> I wrote up my thoughts on secondary address spaces on the uml-devel list
> (http://www.geocrawler.com/lists/3/SourceForge/709/75/7527174/). I'll try
> to be somewhat briefer here.
O.k. the summary pretty much matches what I was thinking. If I get
deep into it I'll have to read that article.
> [email protected] said:
> > new_addr(); /* to get a secondary address space */
>
> I asked Linus about this at the kernel summit last year, and he said he
> wanted a filesystem interface, not a set of new system calls.
I guess I can see that.
> So, my proposed interface (modulo the names, which I welcome improvements to):
>
> Open of:
> /proc/mm - returns a file descriptor referring to a new, empty mm_struct
> /proc/self/mm - returns a file descriptor referring to current->mm
> /proc/<pid>/mm - returns a file descriptor referring to the mm of process <pid>
>
> > int fmmap(int address_space, void *start, size_t length, int prot,
> > int flags, int fd, off_t offset);
> > int fmunmap(int addresss_space, void *start, size_t length);
>
> My proposal for this is to extend mmap:
>
> void *new_mmap(void *start, size_t length, int prot, int flags,
> int src_fd, off_t offset, int dest_fd);
>
> The new thing is the addition of dest_fd, which refers to the object within
> which the new mapping is to be made. dest_fd == -1 refers to the current
> address space. I intend for dest_fd to be an address space descriptor, but
> it seems to make sense for it to be anything that supports mmap.
Currently that is only address spaces.
> munmap and mprotect would be similarly extended.
Right.
> > run_sandbox(int address_space, struct sandbox_params *params); /* to
> > start a sandbox */
> > And run_sandbox being the entry/exit point.
>
> Are you saying that you'd call run_sandbox to switch address spaces and
> enter unprivileged mode, and when you re-enter privileged mode, the run_sandbox
> call returns in the original address space with a bunch of information in
> params?
That is what I was thinking.
> If so, then
>
> > The nice thing here is
> > that because they would share the same kernel stack/process most
> > registers can be left in registers.
>
> is wrong. You need to preserve two kernel contexts, so you need two kernel
> stacks.
???
The run_sandbox idea is essentially what the current vm86 does except
a little more optimized... I admit I left out a value for the
instruction pointer in the sandbox_params (which would necessarily be
address space dependent).
For whatever state run_sandbox would need to return to the original
address space it could simply stored on the kernel stack. I really
don't see why you would need to kernel contexts.
> The run_sandbox context is obviously one. The unprivileged code
> would also need to enter the kernel to fill in the sandbox params and force a
> return from run_sandbox. Also, depending on the arch, CPU traps run on the
> current kernel stack.
A trap or whatever is part of the return for the run_sandbox context.
The current vm86 system call already does something similar to this.
The 8 general purpose registers are saved and restored but the
floating point registers are passed through.
The tricky addition would be having multiple address spaces per
process.
> Otherwise, I like this idea.
I still don't see why it takes two kernel stacks to pull this off.
> But, I don't like mixing the process privilege and address space ideas
> together like this. UML, like i386, grabs the top of the address space
> for itself (it grabs 0xa0000000 - 0xc0000000), so user/kernel space transitions
> don't require an address space switch. To require that a
> privileged/unprivileged transition also switch address spaces will put a
> speed limit on UML system calls.
I will agree that it is sane for the run_sandbox command to work on
the current address space as well. But I don't like the idea of an
implied mprotect. As currently there isn't any hardware to implement
it and I don't see why anyone would make such hardware I think that
part should stay as two calls.
The only reason I can see for an implied mprotect is if executing the
mprotect keeps you from executing the run_sandbox command...
> That's why I like the idea of maps that are only available in privileged mode.
> Turning on some mappings seems a lot cheaper than a full address space switch.
Maybe I'm confused. The only cost of an address space switch is the tlb flush
and reload cost. (granted that is significant for short code stretches). But
more modern architectures are implementing address space numbers or their
kin so they can keep multiple address spaces in the tlb at once. With
address space numbers an address space switch is practically free.
The only performance hit I see is with the copy_to_user,
copy_from_user routines.
> Having said that, I like the address space switch to be available as an
> option. There are some UML applications (like honeypots) where having the
> UML kernel in a totally different address space would be very useful.
>
> Somewhat unrelated, but another thing I've been thinking about is whether
> the process privilege idea could be used to implement strace. One difficulty
> is that strace wants to record system calls, not nullify them as UML does.
> There doesn't seem to be any room for allowing the system call or signal
> handler to proceed in unprivileged mode.
Except by totally emulating it, which is fairly invasive, but it is
good for a real sandbox case where we want to make decisions.
I suspect there is some happy compromise case
Eric
[email protected] said:
> For whatever state run_sandbox would need to return to the original
> address space it could simply stored on the kernel stack. I really
> don't see why you would need to kernel contexts.
Yeah, it didn't occur to me until later that the run_sandbox state could
be stored without occupying a kernel stack.
Unless I'm missing something, the implementation of run_sandbox would be
that it stores the privileged context away somewhere, restores the
unprivileged context (which would then have to be a full register set),
and returns to userspace. So, the privileged context couldn't be on the
kernel stack. It would have to be in (or hanging off) the task structure
or something.
> The tricky addition would be having multiple address spaces per
> process.
It's not multiple address spaces per process so much as it is treating
addresses as objects completely separate from processes, and processes
can switch between them as they see fit. So, by opening up some other
process's /proc/<pid>/mm and switching to it (assuming permissions were OK),
you could invade that address space. You would immediately segfault or
something because your registers would be all wrong, but the switch itself
would work fine.
> But I don't like the idea of an implied mprotect. As currently there
> isn't any hardware to implement it and I don't see why anyone would
> make such hardware I think that part should stay as two calls.
Hardware isn't the issue. Atomicity is. I want the UML kernel to disappear
when it switches to userspace, just as the native kernel does. In order to
do this, the unprivilege-ing and unmapping of the UML kernel have to happen
in the same system call. Similarly, the remapping has to happen at the same
time as the return from unprivilege(). That's why I like the idea of having
(say) MAP_UNPRIV mappings be the only ones available in unprivileged mode.
> The only cost of an address space switch is the tlb flush and reload
> cost. (granted that is significant for short code stretches). But
> more modern architectures are implementing address space numbers or
> their kin so they can keep multiple address spaces in the tlb at once.
> With address space numbers an address space switch is practically
> free.
You may be right. I'm not an expert on this.
I'd like to keep these two ideas separate since they seem separately useful,
but still have them interact where it makes sense (i.e. creating an unprivileged
context in a different address space).
> The only performance hit I see is with the copy_to_user,
> copy_from_user routines.
Yeah, but that wouldn't be much of an issue for UML since, if necessary, it
can do a virt_to_phys and grab data from physical memory, which will be in
its address space.
> Except by totally emulating it, which is fairly invasive, but it is
> good for a real sandbox case where we want to make decisions.
Yeah, it's perfect for UML. The reason I'm interested in strace is that if
this can be demonstrated to cleanly replace pieces of ptrace, I think it
will have an automatic fan club.
I don't see why you can't allow an unprivileged context to just continue,
so the system call will proceed or the signal will be delivered, which would
be fine for strace (at least starting the process from scratch, not sure what
to do about attaching to a running process).
To me, this is suggesting an fcntl/ptrace-like interface for performing
various operations on unprivileged contexts:
status = unprivilege(UNPRIV_CREATE, sp, proc, arg, context) - would create a
new unprivileged context running proc(arg) on the stack pointed to by
sp, very similar to clone. context is a buffer large enough to hold
the userspace state of the context. This will return immediately with
the context filled in.
We have forward compatibility issues with the size of that context buffer
potentially needing to grow as registers are added, and old binaries overflowing
their static small buffers. So, we have a call to ask how big the buffer
should be:
size = unprivilege(UNPRIV_BUFSIZE)
status = unprivilege(UNPRIV_RUN, context) - runs the unprivileged context
contained in the context buffer. Returns some kind of status when
the context makes a system call or receives a signal. The context
buffer also contains information about the event that prompted the
return. Maybe add some flags to indicate that we're only interested
in some types of events.
status = unprivilege(UNPRIV_CANCEL, context) - cancels the pending event that
caused the return to privileged context. The system call doesn't
happen or the signal isn't delivered. Returns as UNPRIV_RUN does.
UML do something like this:
size = unprivilege(UNPRIV_BUFSIZE);
context = kmalloc(size);
/* proc is a little stub that branches into userspace */
status = unprivilege(UNPRIV_CREATE, sp, proc, arg, context);
/* Start it going */
unprivilege(UNPRIV_RUN, context);
while(1){
/* read the system call or signal out of context and either run the
* system call in UML or handle the signal
*/
/* cancel the syscall or signal */
unprivilege(UNPRIV_CANCEL, context);
}
strace would do pretty much the same thing, except it would call UNPRIV_RUN
instead of UNPRIV_CANCEL in the loop.
What do you think about this?
Jeff