2002-11-26 05:05:48

by Jeff Dike

[permalink] [raw]
Subject: uml-patch-2.5.49-1

This patch merges the rework that has been in my 2.4 pool for the last month
or so. I'm going to describe what happened in some detail since it hasn't
been discussed on lkml at all, and there are some generic kernel changes
involved which may be of wider interest.

The design of UML has had as its main points:
every UML process has a corresponding host process
the UML kernel is mapped into the top .5G of each process' address space
there is a special thread, the tracing thread, which ptraces all the
other threads, managing their transitions between the kernel and userspace.

This is insecure, because protecting UML kernel data from its processes is
hard to do right, and impossible to do quickly. UML does have a 'jail' mode
which implements this, which is many times slower than non-jail.

It's also slow, because entry to userspace involves a signal delivery to
the process entering the kernel and a signal return when leaving the kernel.

To fix these problems, I followed up an observation by Ingo a few months ago
that a full process context switch is several times faster than an in-process
signal delivery.

I implemented a new mode which puts the UML kernel into a completely separate
address space from its processes. skas (== "separate kernel address space" -
the traditional mode is now called tt (== "tracing thread")) mode has these
main points:
the kernel is in a separate process and address space from its processes
UML processes share a single host process
each UML process has its own host address space
thus, the "userspace" process hops between address spaces on each UML
context switch

skas mode has a number of advantages over the traditional tt mode:

better security - since the kernel is in a different address space, processes
can't even see, let alone modify, kernel data, since they can't form a kernel
address.

better performance - it is significantly faster. Kernel builds in skas mode
approach twice the speed of tt mode, and ~30% slower than the host (compared
to ~100% slower with tt mode).

better debuggability - it is now possible to 'gdb linux' and have it do what
you expect. Tools like gprof, gcov, and ddd should now just work, without
needing special support inside UML (although gprof currently needs the removal
of some of that support in order to work). It is possible to build UML as
a normal dynamically linked binary, which will make it possible to valgrind
the kernel (although valgrind is currently bothered by UML's use of clone).

cleaner code - process creation, switching, and destruction are far simpler,
cleaner, and faster in the skas code than in the tt code.

miscellaneous - UML process address spaces are now identical to those on the
host - this is advantageous for applications such as honeypots, as well as
possibly for applications which use the full 3G address space. The kernel
now has a full 3G of virtual address space.

There is one major disadvantage to skas mode - it can't be implemented given
the support currently in the stock kernel.

I've added some stuff into the generic and i386 code to make skas mode
possible. This support includes:
/proc/mm, which allows address spaces to be created independently of
processes
a number of ptrace extensions

/proc/mm has the following semantics -
open creates a new, empty address space - the file descriptor returned
is used as a handle to that address space
write modifies the address space according to the structure that is
passed in as the buffer argument. The possible operations are map, unmap,
protect, and copy segments. The first three are identical to mmap, munmap,
and mprotect. The last is used to copy the arch-specific data associated
with the mm_struct as part of cloning the address space.
close drops the reference count of the address space, which normally
frees it since UML will not have any processes running in it

The ptrace extensions are:
PTRACE_FAULTINFO - returns the information assoicated with the child's
most recent segfault
PTRACE_SIGPENDING - returns the child's pending signal mask
PTRACE_LDT - performs a modify_ldt on the child - this is really an
address space operation and will be moved to /proc/mm at some point
PTRACE_SWITCH_MM - switches the child from its current address space
to the one associated with the file descriptor pass in with this call

The host support patch is available with all the other UML downloads at
http://user-mode-linux.sf.net/dl-sf.html

I welcome any comments on it. The /proc/mm write semantics are less than
ideal - I especially would like suggestions for improvements.

The 2.5.49 UML patch is available at
http://uml-pub.ists.dartmouth.edu/uml/uml-patch-2.5.49-1.bz2

For the other UML mirrors and other downloads, see
http://user-mode-linux.sourceforge.net/dl-sf.html

Other links of interest:

The UML project home page : http://user-mode-linux.sourceforge.net
The UML Community site : http://usermodelinux.org

Jeff


2002-11-26 05:50:25

by Andi Kleen

[permalink] [raw]
Subject: Re: uml-patch-2.5.49-1

Jeff Dike <[email protected]> writes:
> main points:
> the kernel is in a separate process and address space from its processes
> UML processes share a single host process

Can you quickly describe why you didn't use one host process per uml
process ?

That would have avoided the need for a /proc/mm extension too I guess.

-Andi

2002-11-26 05:57:38

by Patrick Finnegan

[permalink] [raw]
Subject: Re: uml-patch-2.5.49-1

On 26 Nov 2002, Andi Kleen wrote:

> Jeff Dike <[email protected]> writes:
> > main points:
> > the kernel is in a separate process and address space from its processes
> > UML processes share a single host process
>
> Can you quickly describe why you didn't use one host process per uml
> process ?
>
> That would have avoided the need for a /proc/mm extension too I guess.

One reason I can think of is that it prevents 'stupid things' happening
under a copy of UML from killing the OS UML is running under... Eg. if a
process is running under UML because it's not trusted and then turns into
a forkbomb, you don't want that taking down the host OS.

Pat
--
Purdue Universtiy ITAP/RCS
Information Technology at Purdue
Research Computing and Storage
http://www-rcd.cc.purdue.edu


2002-11-26 06:03:06

by Andi Kleen

[permalink] [raw]
Subject: Re: uml-patch-2.5.49-1

> One reason I can think of is that it prevents 'stupid things' happening
> under a copy of UML from killing the OS UML is running under... Eg. if a
> process is running under UML because it's not trusted and then turns into
> a forkbomb, you don't want that taking down the host OS.

You could limit that with an appropiate ulimit.

Also a 'mm-bomb' could be similarly deadly without appropiate host limits.

-Andi

2002-11-26 06:52:51

by Patrick Finnegan

[permalink] [raw]
Subject: Re: uml-patch-2.5.49-1

On Tue, 26 Nov 2002, Andi Kleen wrote:

> > One reason I can think of is that it prevents 'stupid things' happening
> > under a copy of UML from killing the OS UML is running under... Eg. if a
> > process is running under UML because it's not trusted and then turns into
> > a forkbomb, you don't want that taking down the host OS.
>
> You could limit that with an appropiate ulimit.
>
> Also a 'mm-bomb' could be similarly deadly without appropiate host limits.

That's just one example... the idea is that you want maximal separation
between the guest OS's apps and the host OS. Sort of like "VM" on IBM's
series of mainframe architecures. Of course, that's virtualization done
in hardware not in software, but the principles are the same; you want a
maximal amount of separation between the layers.

Pat
--
Purdue Universtiy ITAP/RCS
Information Technology at Purdue
Research Computing and Storage
http://www-rcd.cc.purdue.edu


2002-11-26 07:00:57

by Andi Kleen

[permalink] [raw]
Subject: Re: uml-patch-2.5.49-1

Patrick Finnegan <[email protected]> writes:

> That's just one example... the idea is that you want maximal separation
> between the guest OS's apps and the host OS. Sort of like "VM" on IBM's
> series of mainframe architecures. Of course, that's virtualization done
> in hardware not in software, but the principles are the same; you want a
> maximal amount of separation between the layers.

As an "idea" it doesn't make much sense for me. An mm does tie up
considerable amounts of unswappable host memory (page tables, mm_struct),
which could be used for a DoS without too many problems. The separation
you are asking for just isn't there with UML. The same applies to other
resources used by UML.

-Andi

2002-11-26 14:06:54

by jlnance

[permalink] [raw]
Subject: Re: uml-patch-2.5.49-1

Hi Jeff,
Sounds like you are doing some good things with UML. I particularly
like the fact that gdb will be easier to use.

On Tue, Nov 26, 2002 at 12:17:07AM -0500, Jeff Dike wrote:
> I welcome any comments on it. The /proc/mm write semantics are less than
> ideal - I especially would like suggestions for improvements.

I think /proc/mm would be better implemented as /dev/mm. It seems to
have a lot more functionality associated with it than most /proc files.

Thanks,

Jim

2002-11-26 16:38:16

by Rik van Riel

[permalink] [raw]
Subject: Re: uml-patch-2.5.49-1

On Tue, 26 Nov 2002, Patrick Finnegan wrote:

> That's just one example... the idea is that you want maximal separation
> between the guest OS's apps and the host OS. Sort of like "VM" on IBM's
> series of mainframe architecures. [snip]

That's a nice idea, but in practice you also want efficient
execution of processes in the virtual machines and a virtual
host implementation that's flexible and easy to maintain.

As usual, you can't have everything so you'll have to make
choices here and there. The end result will be a useful
compromise between all the different ideas...

Rik
--
Bravely reimplemented by the knights who say "NIH".
http://www.surriel.com/ http://guru.conectiva.com/
Current spamtrap: <a href=mailto:"[email protected]">[email protected]</a>

2002-11-26 18:21:10

by Jeff Dike

[permalink] [raw]
Subject: Re: uml-patch-2.5.49-1

[email protected] said:
> Can you quickly describe why you didn't use one host process per uml
> process ?
> That would have avoided the need for a /proc/mm extension too I guess.

Yes, it would have. And it was done that way during an intermediate stage
of the skas work.

There were a few reasons for /proc/mm -
reduce host resource consumption - we still have the mm_struct and
page tables, but we do eliminate the kernel stack, task struct, and associated
data
design cleanliness - a UP UML is inherently single-threaded, so it's
pointless to have many host processes when only one of them can be running.
A host process maps much more cleanly onto a UML processor than a UML process,
and a UML process maps much more cleanly onto a host address space than a
host process.
code cleanliness - before /proc/mm, UML manipulated address spaces
through ptrace (PTRACE_M{MAP,UNMAP,PROTECT}). This meant that a process needed
to be used as a handle for the address space. Since there's no way of knowing
what process in an address space is still going to exist when you want to swap
it out, the UML mm_context was a list of task_structs. This made for some
nasty-looking code to keep that list up-to-date. It also made for some
nasty-looking locking against races between swapout and a thread exiting. Now,
the UML mm_context is an int which holds a /proc/mm file descriptor. No lists,
no races, it doesn't get any simpler.

Some smaller reasons -
With the address space manipulations in /proc/mm rather than ptrace,
it is possible that a cleaner interface can be found for it. The current
/proc/mm write semantics are morally equivalent to the previous ptrace
interface, but there is hope that something better can be found. With ptrace,
there is no hope.

scheduling fairness - when you have a single-threaded app (in the sense
that there is only one active thread at any given time) that's spreading its
work over many threads, the thread that wants to run will compete unfairly
in the scheduler with another single-threaded app that is honestly doing all
of its work in one thread. The thread in the first app will have accumulated
much less time than the second app's thread, and so will have higher priority,
even though the two apps may have used the same amount of CPU. With /proc/mm,
UML gets much closer to one host process per processor, but doesn't quite
make it. There are two host processes, one running the kernel, one running
userspace. I'm trying to think of a way of merging them.

It's much nicer to look at ps or top on the host and see a few
(currently four) processes per UML rather than, say, 100.

An unexpected benefit - UML is noticably faster with /proc/mm. That knocked
~10% off its kernel build time. With it doing a build about 40% slower than
the host, the 10% reduction in overall run time represents ~25% reduction in
UML's virtualization overhead.

Jeff

2002-11-26 18:22:34

by Jeff Dike

[permalink] [raw]
Subject: Re: uml-patch-2.5.49-1

[email protected] said:
> I think /proc/mm would be better implemented as /dev/mm.

What major and minor numbers should I assign to it? And what would be
the point of giving it a major and minor, anyway?

Jeff

2002-11-26 19:34:48

by Andreas Dilger

[permalink] [raw]
Subject: Re: uml-patch-2.5.49-1

On Nov 26, 2002 13:29 -0500, Jeff Dike wrote:
> design cleanliness - a UP UML is inherently single-threaded, so it's
> pointless to have many host processes when only one of them can be running.
> A host process maps much more cleanly onto a UML processor than a UML process,
> and a UML process maps much more cleanly onto a host address space than a
> host process.

How does GDB now distinguish between UML processes? Previously, with
GDB and UML one would "det; att <host pid>" to trace another process.
Will there be equivalent functionality in the new setup?

I was just thinking about hacking the UML PID allocation code so that
the UML process PID == host process PID, so that it is easier to debug
multiple kernel threads (which are all called "kernel thread" and are
hard to align with a specific UML kernel thread).

Will SMP UML "just" be a matter of forking the host process and sharing
the /proc/mm file descriptors, along with a UML SMP scheduler and some
IPC to decide which host process is running each UML process?

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/

2002-11-26 20:57:37

by Jeff Dike

[permalink] [raw]
Subject: Re: uml-patch-2.5.49-1

[email protected] said:
> How does GDB now distinguish between UML processes? Previously, with
> GDB and UML one would "det; att <host pid>" to trace another process.
> Will there be equivalent functionality in the new setup?

It now doesn't. What I'm considering is some function you can call from
gdb which would longjmp to the stack that you want to look at and execute
a breakpoint (or maybe just hit a breakpoint that was put there earlier).

That should give you equivalent functionality to the current det/att.

> Will SMP UML "just" be a matter of forking the host process and
> sharing the /proc/mm file descriptors, along with a UML SMP scheduler
> and some IPC to decide which host process is running each UML process?

Pretty much. It's basically the same as SMP in tt mode, except that starting
the idle threads will be slightly different.

Jeff

2002-11-27 02:22:09

by H. Peter Anvin

[permalink] [raw]
Subject: Re: uml-patch-2.5.49-1

Followup to: <[email protected]>
By author: Jeff Dike <[email protected]>
In newsgroup: linux.dev.kernel
>
> [email protected] said:
> > I think /proc/mm would be better implemented as /dev/mm.
>
> What major and minor numbers should I assign to it? And what would be
> the point of giving it a major and minor, anyway?
>

Access control, ability to work in a chroot, ...

For major/minor, this is presumably a misc device (major 10) or, if
you don't need module support, a kernel core device (major 1), and
write to [email protected] to have a minor number assigned.

-hpa
--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt <[email protected]>

2002-11-27 03:56:03

by Jeff Dike

[permalink] [raw]
Subject: Re: uml-patch-2.5.49-1

[email protected] said:
> Access control, ability to work in a chroot, ...

Point.

> For major/minor, this is presumably a misc device (major 10) or, if
> you don't need module support, a kernel core device (major 1), and
> write to [email protected] to have a minor number assigned.

If you think that this would be better as a misc device than a proc entry,
then I can certainly go along with that.

Jeff

2002-11-27 04:23:37

by H. Peter Anvin

[permalink] [raw]
Subject: Re: uml-patch-2.5.49-1

Jeff Dike wrote:
> [email protected] said:
>
>>Access control, ability to work in a chroot, ...
>
>
> Point.
>
>
>>For major/minor, this is presumably a misc device (major 10) or, if
>>you don't need module support, a kernel core device (major 1), and
>>write to [email protected] to have a minor number assigned.
>
>
> If you think that this would be better as a misc device than a proc entry,
> then I can certainly go along with that.
>

Absolutely. I think /proc is heavily overused as a really bad devfs.

-hpa



2002-11-27 04:57:51

by Jeff Dike

[permalink] [raw]
Subject: Re: uml-patch-2.5.49-1

[email protected] said:
> Absolutely. I think /proc is heavily overused as a really bad devfs.

OK, I'll send in the request, and make the switch in the next patch.

Jeff