2006-03-24 17:20:21

by Kirill Korotaev

[permalink] [raw]
Subject: [RFC] Virtualization steps

Eric, Herbert,

I think it is quite clear, that without some agreement on all these
virtualization issues, we won't be able to commit anything good to
mainstream. My idea is to gather our efforts to get consensus on most
clean parts of code first and commit them one by one.

The proposal is quite simple. We have 4 parties in this conversation
(maybe more?): IBM guys, OpenVZ, VServer and Eric Biederman. We discuss
the areas which should be considered step by step. Send patches for each
area, discuss, come to some agreement and all 4 parties Sign-Off the
patch. After that it goes to Andrew/Linus. Worth trying?

So far, (correct me if I'm wrong) we concluded that some people don't
want containers as a whole, but want some subsystem namespaces. I
suppose for people who care about containers only it doesn't matter, so
we can proceed with namespaces, yeah?

So the most easy namespaces to discuss I see:
- utsname
- sys IPC
- network virtualization
- netfilter virtualization

all these were discussed already somehow and looks like there is no
fundamental differencies in our approaches (at least OpenVZ and Eric,
for sure).

Right now, I suggest to concentrate on first 2 namespaces - utsname and
sysvipc. They are small enough and easy. Lets consider them without
sysctl/proc issues, as those can be resolved later. I sent the patches
for these 2 namespaces to all of you. I really hope for some _good_
critics, so we could work it out quickly.

Thanks,
Kirill


2006-03-24 18:54:43

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

Kirill Korotaev wrote:
> Eric, Herbert,
>
> I think it is quite clear, that without some agreement on all these
> virtualization issues, we won't be able to commit anything good to
> mainstream. My idea is to gather our efforts to get consensus on most
> clean parts of code first and commit them one by one.
>
> The proposal is quite simple. We have 4 parties in this conversation
> (maybe more?): IBM guys, OpenVZ, VServer and Eric Biederman. We discuss
> the areas which should be considered step by step. Send patches for each
> area, discuss, come to some agreement and all 4 parties Sign-Off the
> patch. After that it goes to Andrew/Linus. Worth trying?

Oh, after you come to an agreement and start posting patches, can you
also outline why we want this in the kernel (what it does that low
level virtualization doesn't, etc, etc), and how and why you've agreed
to implement it. Basically, some background and a summary of your
discussions for those who can't follow everything. Or is that a faq
item?

Thanks,
Nick

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-03-24 19:27:04

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

Kirill Korotaev <[email protected]> writes:

> Eric, Herbert,
>
> I think it is quite clear, that without some agreement on all these
> virtualization issues, we won't be able to commit anything good to
> mainstream. My idea is to gather our efforts to get consensus on most clean
> parts of code first and commit them one by one.
>
> The proposal is quite simple. We have 4 parties in this conversation (maybe
> more?): IBM guys, OpenVZ, VServer and Eric Biederman. We discuss the areas which
> should be considered step by step. Send patches for each area, discuss, come to
> some agreement and all 4 parties Sign-Off the patch. After that it goes to
> Andrew/Linus. Worth trying?

Yes, this sounds like a path forward that has a reasonable chance of
making progress.

> So far, (correct me if I'm wrong) we concluded that some people don't want
> containers as a whole, but want some subsystem namespaces. I suppose for people
> who care about containers only it doesn't matter, so we can proceed with
> namespaces, yeah?

Yes, I think at one point I have seen all of the major parties receptive
to the concept.

> So the most easy namespaces to discuss I see:
> - utsname
> - sys IPC
> - network virtualization
> - netfilter virtualization

The networking is hard simply because the is so very much of it, and it
is being active developed :)

> all these were discussed already somehow and looks like there is no fundamental
> differencies in our approaches (at least OpenVZ and Eric, for sure).

Yes. I think we agree on what the semantics should be for these parts.
Which should avoid the problem with have with the pid namespace.

> Right now, I suggest to concentrate on first 2 namespaces - utsname and
> sysvipc. They are small enough and easy. Lets consider them without sysctl/proc
> issues, as those can be resolved later. I sent the patches for these 2
> namespaces to all of you. I really hope for some _good_ critics, so we could
> work it out quickly.

Sounds like a plan.

Eric

2006-03-24 19:27:34

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

On Sat, 2006-03-25 at 04:33 +1100, Nick Piggin wrote:
> Oh, after you come to an agreement and start posting patches, can you
> also outline why we want this in the kernel (what it does that low
> level virtualization doesn't, etc, etc)

Can you wait for an OLS paper? ;)

I'll summarize it this way: low-level virtualization uses resource
inefficiently.

With this higher-level stuff, you get to share all of the Linux caching,
and can do things like sharing libraries pretty naturally.

They are also much lighter-weight to create and destroy than full
virtual machines. We were planning on doing some performance
comparisons versus some hypervisors like Xen and the ppc64 one to show
scaling with the number of virtualized instances. Creating 100 of these
Linux containers is as easy as a couple of shell scripts, but we still
can't find anybody crazy enough to go create 100 Xen VMs.

Anyway, those are the things that came to my mind first. I'm sure the
others involved have their own motivations.

-- Dave

2006-03-24 19:56:15

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

Dave Hansen <[email protected]> writes:

> On Sat, 2006-03-25 at 04:33 +1100, Nick Piggin wrote:
>> Oh, after you come to an agreement and start posting patches, can you
>> also outline why we want this in the kernel (what it does that low
>> level virtualization doesn't, etc, etc)
>
> Can you wait for an OLS paper? ;)
>
> I'll summarize it this way: low-level virtualization uses resource
> inefficiently.
>
> With this higher-level stuff, you get to share all of the Linux caching,
> and can do things like sharing libraries pretty naturally.

Also it is a major enabler for things such as process migration,
between kernels.

> They are also much lighter-weight to create and destroy than full
> virtual machines. We were planning on doing some performance
> comparisons versus some hypervisors like Xen and the ppc64 one to show
> scaling with the number of virtualized instances. Creating 100 of these
> Linux containers is as easy as a couple of shell scripts, but we still
> can't find anybody crazy enough to go create 100 Xen VMs.

One of my favorite test cases is to kill about 100 of them
simultaneously :)

I think on a reasonably beefy dual processor machine I should be able
to get about 1000 of them running all at once.

> Anyway, those are the things that came to my mind first. I'm sure the
> others involved have their own motivations.

The practical aspect is that several groups have found the arguments
compelling enough that they have already done complete
implementations. At which point getting us all to agree on a common
implementation is important. :)

Eric

2006-03-24 21:19:20

by Herbert Poetzl

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

On Fri, Mar 24, 2006 at 08:19:59PM +0300, Kirill Korotaev wrote:
> Eric, Herbert,
>
> I think it is quite clear, that without some agreement on all these
> virtualization issues, we won't be able to commit anything good to
> mainstream. My idea is to gather our efforts to get consensus on most
> clean parts of code first and commit them one by one.
>
> The proposal is quite simple. We have 4 parties in this conversation
> (maybe more?): IBM guys, OpenVZ, VServer and Eric Biederman. We
> discuss the areas which should be considered step by step. Send
> patches for each area, discuss, come to some agreement and all 4
> parties Sign-Off the patch. After that it goes to Andrew/Linus.
> Worth trying?

sounds good to me, as long as we do not consider
the patches 'final' atm .. because I think we should
try to test them with _all_ currently existing solutions
first ... we do not need to bother Andrew with stuff
which doesn't work for the existing and future 'users'.

so IMHO, we should make a kernel branch (Eric or Sam
are probably willing to maintain that), which we keep
in-sync with mainline (not necessarily git, but at
least snapshot wise), where we put all the patches
we agree on, and each party should then adjust the
existing solution to this kernel, so we get some deep
testing in the process, and everybody can see if it
'works' for him or not ...

things where we agree that it 'just works' for everyone
can always be handed upstream, and would probably make
perfect patches for Andrew ...

> So far, (correct me if I'm wrong) we concluded that some people don't
> want containers as a whole, but want some subsystem namespaces. I
> suppose for people who care about containers only it doesn't matter, so
> we can proceed with namespaces, yeah?

yes, the emphasis here should be on lightweight and
modular, so that those folks interested in full featured
containers can just 'assemble' the pieces, while those
desiring service/space isolation pick their subsystems
one by one ...

> So the most easy namespaces to discuss I see:
> - utsname

yes, that's definitely one we can start with, as it seems
that we already have _very_ similar implementations

> - sys IPC

this is something which is also related to limits and
should get special attention with resource sharing,
isolation and control in mind

> - network virtualization

here I see many issues, as for example Linux-VServer
does not necessarily aim for full virtualization, when
simple and performant isolation is sufficient.

don't get me wrong, we are _not_ against network
virtualization per se, but we isolation is just so
much simpler to administrate and often much more
performant, so that it is very interesting for service
separation as well as security applications

just consider the 'typical' service isolation aspect
where you want to have two apaches, separated on two
IPs, but communicating with a single sql database

> - netfilter virtualization

same as for network virtualization, but not really
an issue if it can be 'disabled'

of course, the ideal solution would be some kind
of hybrid, where you can have virtual interfaces as
well as isolated IPs, side-by-side ...

> all these were discussed already somehow and looks like there is no
> fundamental differencies in our approaches (at least OpenVZ and Eric,
> for sure).
>
> Right now, I suggest to concentrate on first 2 namespaces - utsname
> and sysvipc. They are small enough and easy. Lets consider them
> without sysctl/proc issues, as those can be resolved later. I sent the
> patches for these 2 namespaces to all of you. I really hope for some
> _good_ critics, so we could work it out quickly.

will look into them soon ...

best,
Herbert

> Thanks,
> Kirill

2006-03-27 18:48:49

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

Herbert Poetzl <[email protected]> writes:

> On Fri, Mar 24, 2006 at 08:19:59PM +0300, Kirill Korotaev wrote:
>> Eric, Herbert,
>>
>> I think it is quite clear, that without some agreement on all these
>> virtualization issues, we won't be able to commit anything good to
>> mainstream. My idea is to gather our efforts to get consensus on most
>> clean parts of code first and commit them one by one.
>>
>> The proposal is quite simple. We have 4 parties in this conversation
>> (maybe more?): IBM guys, OpenVZ, VServer and Eric Biederman. We
>> discuss the areas which should be considered step by step. Send
>> patches for each area, discuss, come to some agreement and all 4
>> parties Sign-Off the patch. After that it goes to Andrew/Linus.
>> Worth trying?
>
> sounds good to me, as long as we do not consider
> the patches 'final' atm .. because I think we should
> try to test them with _all_ currently existing solutions
> first ... we do not need to bother Andrew with stuff
> which doesn't work for the existing and future 'users'.
>
> so IMHO, we should make a kernel branch (Eric or Sam
> are probably willing to maintain that), which we keep
> in-sync with mainline (not necessarily git, but at
> least snapshot wise), where we put all the patches
> we agree on, and each party should then adjust the
> existing solution to this kernel, so we get some deep
> testing in the process, and everybody can see if it
> 'works' for him or not ...

ACK. A collection of patches that we can all agree
on sounds like something worth aiming for.

It looks like Kirill last round of patches can form
a nucleus for that. So far I have seem plenty of technical
objects but no objections to the general direction.

So agreement appears possible.

Eric

2006-03-28 04:28:46

by Bill Davidsen

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

Dave Hansen wrote:
> On Sat, 2006-03-25 at 04:33 +1100, Nick Piggin wrote:
>> Oh, after you come to an agreement and start posting patches, can you
>> also outline why we want this in the kernel (what it does that low
>> level virtualization doesn't, etc, etc)
>
> Can you wait for an OLS paper? ;)
>
> I'll summarize it this way: low-level virtualization uses resource
> inefficiently.
>
> With this higher-level stuff, you get to share all of the Linux caching,
> and can do things like sharing libraries pretty naturally.
>
> They are also much lighter-weight to create and destroy than full
> virtual machines. We were planning on doing some performance
> comparisons versus some hypervisors like Xen and the ppc64 one to show
> scaling with the number of virtualized instances. Creating 100 of these
> Linux containers is as easy as a couple of shell scripts, but we still
> can't find anybody crazy enough to go create 100 Xen VMs.

But these require a modified O/S, do they not? Or do I read that
incorrectly? Is this going to be real virtualization able to run any O/S?

Frankly I don't see running 100 VMs as a realistic goal, being able to
run Linux, Windows, Solaris and BEOS unmodified in 4-5 VMs would be far
more useful.
>
> Anyway, those are the things that came to my mind first. I'm sure the
> others involved have their own motivations.
>
> -- Dave
>

2006-03-28 05:30:47

by Sam Vilain

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

On Mon, 2006-03-27 at 23:28 -0500, Bill Davidsen wrote:
> Frankly I don't see running 100 VMs as a realistic goal, being able to
> run Linux, Windows, Solaris and BEOS unmodified in 4-5 VMs would be far
> more useful.

You misunderstand this approach. It is not about VMs at all. Any VM
approach is the "big hammer" of virtualisation; we are more interested
in a big bag of very precise tools to virtualise one subsystem at a
time.

Sam.

2006-03-28 06:45:40

by Kir Kolyshkin

[permalink] [raw]
Subject: Re: [Devel] Re: [RFC] Virtualization steps

Bill Davidsen wrote:

> Dave Hansen wrote:
>
>> On Sat, 2006-03-25 at 04:33 +1100, Nick Piggin wrote:
>>
>>> Oh, after you come to an agreement and start posting patches, can you
>>> also outline why we want this in the kernel (what it does that low
>>> level virtualization doesn't, etc, etc)
>>
>>
>> Can you wait for an OLS paper? ;)
>>
>> I'll summarize it this way: low-level virtualization uses resource
>> inefficiently.
>>
>> With this higher-level stuff, you get to share all of the Linux caching,
>> and can do things like sharing libraries pretty naturally.
>>
>> They are also much lighter-weight to create and destroy than full
>> virtual machines. We were planning on doing some performance
>> comparisons versus some hypervisors like Xen and the ppc64 one to show
>> scaling with the number of virtualized instances. Creating 100 of these
>> Linux containers is as easy as a couple of shell scripts, but we still
>> can't find anybody crazy enough to go create 100 Xen VMs.
>
>
> But these require a modified O/S, do they not? Or do I read that
> incorrectly? Is this going to be real virtualization able to run any O/S?

This type is called OS-level virtualization, or kernel-level
virtualization, or partitioning. Basically it allows to create a
compartments (in OpenVZ we call them VEs -- Virtual Environments) in
which you can run full *unmodified* Linux system (but the kernel itself
-- it is one single kernel common for all compartments). That means that
with this approach you can not run OSs other than Linux, but different
Linux distributions are working just fine.

> Frankly I don't see running 100 VMs as a realistic goal

It is actually not a future goal, but rather a reality. Since os-level
virtualization overhead is very low (1-2 per cent or so), one can run
hundreds of VEs.

Say, on a box with 1GB of RAM OpenVZ [http://openvz.org/] is able to run
about 150 VEs each one having init, apache (serving static content),
sendmail, sshd, cron etc. running. Actually you can run more, but with
the aggressive swapping so performance drops considerably. So it all
mostly depends on RAM, and I'd say that 500+ VEs on a 4GB box should run
just fine. Of course it all depends on what you run inside those VEs.

> , being able to run Linux, Windows, Solaris and BEOS unmodified in 4-5
> VMs would be far more useful.

This is a different story. If you want to run different OSs on the same
box -- use emulation or paravirtualization.

If you are happy to stick to Linux on this box -- use OS-level
virtualization. Aside from the best possible scalability and
performance, the other benefit of this approach is dynamic resource
management -- since there is a single kernel managing all the resources
such as RAM, you can easily tune all those resources runtime. More to
say, you can make one VE use more RAM while nobody else it using it,
leading to much better resource usage. And since there is one single
kernel that manages everything, you could do nice tricks like VE
checkpointing, live migration, etc. etc.

Some more info on topic are available from
http://openvz.org/documentation/tech/

Kir.

>>
>> Anyway, those are the things that came to my mind first. I'm sure the
>> others involved have their own motivations.
>>
>> -- Dave
>>
>
> _______________________________________________
> Devel mailing list
> [email protected]
> https://openvz.org/mailman/listinfo/devel


2006-03-28 08:52:09

by Herbert Poetzl

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

On Mon, Mar 27, 2006 at 11:28:12PM -0500, Bill Davidsen wrote:
> Dave Hansen wrote:
> >On Sat, 2006-03-25 at 04:33 +1100, Nick Piggin wrote:
> >>Oh, after you come to an agreement and start posting patches, can you
> >>also outline why we want this in the kernel (what it does that low
> >>level virtualization doesn't, etc, etc)
> >
> >Can you wait for an OLS paper? ;)
> >
> >I'll summarize it this way: low-level virtualization uses resource
> >inefficiently.
> >
> >With this higher-level stuff, you get to share all of the Linux caching,
> >and can do things like sharing libraries pretty naturally.
> >
> >They are also much lighter-weight to create and destroy than full
> >virtual machines. We were planning on doing some performance
> >comparisons versus some hypervisors like Xen and the ppc64 one to show
> >scaling with the number of virtualized instances. Creating 100 of these
> >Linux containers is as easy as a couple of shell scripts, but we still
> >can't find anybody crazy enough to go create 100 Xen VMs.
>
> But these require a modified O/S, do they not? Or do I read that
> incorrectly? Is this going to be real virtualization able to run any
> O/S?

Xen requires slighly modified kernels, while e.g.
Linux-VServer only uses a _single_ kernel for all
virtualized guests ...

> Frankly I don't see running 100 VMs as a realistic goal, being able to
> run Linux, Windows, Solaris and BEOS unmodified in 4-5 VMs would be
> far more useful.

well, that largely depends on the 'use' ...

I don't think that vps providers like lycos would be
very happy if they had to multiply the ammount of
machines they require by 10 or 20 :)

and yes, running 100 and more Linux-VServers on a
single machine _is_ realistic ...

best,
Herbert

> >Anyway, those are the things that came to my mind first. I'm sure the
> >others involved have their own motivations.
> >
> >-- Dave
> >

2006-03-28 08:51:37

by Kirill Korotaev

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

>> so IMHO, we should make a kernel branch (Eric or Sam
>> are probably willing to maintain that), which we keep
>> in-sync with mainline (not necessarily git, but at
>> least snapshot wise), where we put all the patches
>> we agree on, and each party should then adjust the
>> existing solution to this kernel, so we get some deep
>> testing in the process, and everybody can see if it
>> 'works' for him or not ...
>
> ACK. A collection of patches that we can all agree
> on sounds like something worth aiming for.
>
> It looks like Kirill last round of patches can form
> a nucleus for that. So far I have seem plenty of technical
> objects but no objections to the general direction.
yup, I will fix everything and will come with a set of patches for IPC,
so we could select which way is better to do it :)

> So agreement appears possible.
Nice to hear this!

Eric, we have a GIT repo on openvz.org already:
http://git.openvz.org

we will create a separate branch also called -acked, where patches
agreed upon will go.

Thanks,
Kirill

2006-03-28 09:00:37

by Kirill Korotaev

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

> Frankly I don't see running 100 VMs as a realistic goal, being able to
> run Linux, Windows, Solaris and BEOS unmodified in 4-5 VMs would be far
> more useful.
It is more than realistic. Hosting companies run more than 100 VPSs in
reality. There are also other usefull scenarios. For example, I know the
universities which run VPS for every faculty web site, for every
department, mail server and so on. Why do you think they want to run
only 5VMs on one machine? Much more!

Thanks,
Kirill

2006-03-28 09:02:21

by Kirill Korotaev

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

> Oh, after you come to an agreement and start posting patches, can you
> also outline why we want this in the kernel (what it does that low
> level virtualization doesn't, etc, etc), and how and why you've agreed
> to implement it. Basically, some background and a summary of your
> discussions for those who can't follow everything. Or is that a faq
> item?
Nick, will be glad to shed some light on it.

First of all, what it does which low level virtualization can't:
- it allows to run 100 containers on 1GB RAM
(it is called containers, VE - Virtual Environments,
VPS - Virtual Private Servers).
- it has no much overhead (<1-2%), which is unavoidable with hardware
virtualization. For example, Xen has >20% overhead on disk I/O.
- it allows to create/deploy VE in less than a minute, VE start/stop
takes ~1-2 seconds.
- it allows to dynamically change all resource limits/configurations.
In OpenVZ it is even possible to add/remove virtual CPUs to/from VE.
It is possible to increase/descrease memory limits on the fly etc.
- it has much more efficient memory usage with single template file
in a cache if COW-like filesystem is used for VE templates.
- it allows you to access VE files from host easily if needed.
This helps to make management much more flexible, e.g. you can
upgrade/repair/fix all you VEs from host, i.e. easy mass management.


OS kernel virtualization
~~~~~~~~~~~~~~~~~~~~~~~~
OS virtualization is a kernel solution, which replaces the usage
of many global variables with context-dependant counterparts. This
allows to have isolated private resources in different contexts.

So VE means essentially context and a set of it's variables/settings,
which include but not limited to, own process tree, files, IPC
resources, IP routing, network devices and such.

Full virtualization solution consists of:
- virtualization of resources, i.e. private contexts
- resource controls, for limiting contexts
- management tools

Such kind of virtualization solution is implemented in OpenVZ
(http://openvz.org) and Linux-Vserver (http://linux-vserver.org) projects.

Summary of previous discussions on LKML
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- we agreed upon doing virtualization of each kernel subsystem
separately, not as a single virtual environment.
- we almost agreed upon calling virtualization of subsystems
"namespaces".
- we were discussing whether we should have global namespace context,
like 'current' or bypass context as an argument to all functions
which require it.
- we didn't agreed on whether we need a config option and ability to
compile kernel w/o virtual namespaces.

Thansk,
Kirill


2006-03-28 11:16:58

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

Herbert Poetzl wrote:

> well, that largely depends on the 'use' ...
>
> I don't think that vps providers like lycos would be
> very happy if they had to multiply the ammount of
> machines they require by 10 or 20 :)
>
> and yes, running 100 and more Linux-VServers on a
> single machine _is_ realistic ...
>

Yep.

And if it is intrusive to the core kernel, then as always we have
to try to evaluate the question "is it worth it"? How many people
want it and what alternatives do they have (eg. maintaining
seperate patches, using another approach), what are the costs,
complexities, to other users and developers etc.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-03-28 11:37:35

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

Kirill Korotaev wrote:
>
> Nick, will be glad to shed some light on it.
>

Thanks very much Kirill.

I don't think I'm qualified to make any decisions about this,
so I don't want to detract from the real discussions, but I
just had a couple more questions:

> First of all, what it does which low level virtualization can't:
> - it allows to run 100 containers on 1GB RAM
> (it is called containers, VE - Virtual Environments,
> VPS - Virtual Private Servers).
> - it has no much overhead (<1-2%), which is unavoidable with hardware
> virtualization. For example, Xen has >20% overhead on disk I/O.

Are any future hardware solutions likely to improve these problems?

>
> OS kernel virtualization
> ~~~~~~~~~~~~~~~~~~~~~~~~

Is this considered secure enough that multiple untrusted VEs are run
on production systems?

What kind of users want this, who can't use alternatives like real
VMs?

> Summary of previous discussions on LKML
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Have their been any discussions between the groups pushing this
virtualization, and important kernel developers who are not part of
a virtualization effort? Ie. is there any consensus about the
future of these patches?

Thanks,
Nick

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-03-28 12:53:54

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

Quoting Kirill Korotaev ([email protected]):
> >>so IMHO, we should make a kernel branch (Eric or Sam
> >>are probably willing to maintain that), which we keep
> >>in-sync with mainline (not necessarily git, but at
> >>least snapshot wise), where we put all the patches
> >>we agree on, and each party should then adjust the
> >>existing solution to this kernel, so we get some deep
> >>testing in the process, and everybody can see if it
> >>'works' for him or not ...
> >
> >ACK. A collection of patches that we can all agree
> >on sounds like something worth aiming for.
> >
> >It looks like Kirill last round of patches can form
> >a nucleus for that. So far I have seem plenty of technical
> >objects but no objections to the general direction.
> yup, I will fix everything and will come with a set of patches for IPC,
> so we could select which way is better to do it :)
>
> >So agreement appears possible.
> Nice to hear this!
>
> Eric, we have a GIT repo on openvz.org already:
> http://git.openvz.org
>
> we will create a separate branch also called -acked, where patches
> agreed upon will go.

That's ok by me. If a more neutral name/site were preferred, we could
use the sf.net set we had finally gotten around to setting up -
http://www.sf.net/projects/lxc (LinuX Containers). Unfortunately that would
likely be just a quilt patch repository.

A wiki + git repository would be ideal.

-serge

2006-03-28 14:26:41

by Herbert Poetzl

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

On Tue, Mar 28, 2006 at 07:00:25PM +1000, Nick Piggin wrote:
> Herbert Poetzl wrote:
>
> >well, that largely depends on the 'use' ...
> >
> >I don't think that vps providers like lycos would be
> >very happy if they had to multiply the ammount of
> >machines they require by 10 or 20 :)
> >
> >and yes, running 100 and more Linux-VServers on a
> >single machine _is_ realistic ...
> >
>
> Yep.
>
> And if it is intrusive to the core kernel, then as always we have to
> try to evaluate the question "is it worth it"? How many people want it
> and what alternatives do they have (eg. maintaining seperate patches,
> using another approach), what are the costs, complexities, to other
> users and developers etc.

my words, but let me ask, what do you consider 'intrusive'?

best,
Herbert

> --
> SUSE Labs, Novell Inc.
> Send instant messages to your online friends http://au.messenger.yahoo.com

2006-03-28 14:39:09

by Bill Davidsen

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

Kirill Korotaev wrote:

>> Frankly I don't see running 100 VMs as a realistic goal, being able
>> to run Linux, Windows, Solaris and BEOS unmodified in 4-5 VMs would
>> be far more useful.
>
> It is more than realistic. Hosting companies run more than 100 VPSs in
> reality. There are also other usefull scenarios. For example, I know
> the universities which run VPS for every faculty web site, for every
> department, mail server and so on. Why do you think they want to run
> only 5VMs on one machine? Much more!

I made no commont on what "they" might want, I want to make the rack of
underutilized Windows, BSD and Solaris servers go away. An approach
which doesn't support unmodified guest installs doesn't solve any of my
current problems. I didn't say it was in any way not useful, just not of
interest to me. What needs I have for Linux environments are answered by
jails and/or UML.

--
bill davidsen <[email protected]>
CTO TMR Associates, Inc
Doing interesting things with small computers since 1979

2006-03-28 15:06:34

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

Bill Davidsen <[email protected]> writes:

> Kirill Korotaev wrote:
>
>>> Frankly I don't see running 100 VMs as a realistic goal, being able to run
>>> Linux, Windows, Solaris and BEOS unmodified in 4-5 VMs would be far more
>>> useful.
>>
>> It is more than realistic. Hosting companies run more than 100 VPSs in
>> reality. There are also other usefull scenarios. For example, I know the
>> universities which run VPS for every faculty web site, for every department,
>> mail server and so on. Why do you think they want to run only 5VMs on one
>> machine? Much more!
>
> I made no commont on what "they" might want, I want to make the rack of
> underutilized Windows, BSD and Solaris servers go away. An approach which
> doesn't support unmodified guest installs doesn't solve any of my current
> problems. I didn't say it was in any way not useful, just not of interest to
> me. What needs I have for Linux environments are answered by jails and/or UML.

So from one perspective that is what we are building. A full featured
jail capable of running an unmodified linux distro. The cost is
simply making a way to use the same names twice for the global
namespaces. UML may use these features to accelerate it's own processes.

Virtualization is really the wrong word to describe what we are building. As
it allows for all kinds of heavy weight implementations, and has an associate
with much heavier things.

At the extreme end where you only have one process in each logical instance
of the kernel, a better name would be a heavy weight process. Where each
such process sees an environment as if it owned the entire machine.

Eric

2006-03-28 15:36:01

by Herbert Poetzl

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

On Tue, Mar 28, 2006 at 07:15:17PM +1000, Nick Piggin wrote:
> Kirill Korotaev wrote:
> >
> >Nick, will be glad to shed some light on it.
> >
>
> Thanks very much Kirill.
>
> I don't think I'm qualified to make any decisions about this,
> so I don't want to detract from the real discussions, but I
> just had a couple more questions:
>
> >First of all, what it does which low level virtualization can't:
> >- it allows to run 100 containers on 1GB RAM
> > (it is called containers, VE - Virtual Environments,
> > VPS - Virtual Private Servers).
> >- it has no much overhead (<1-2%), which is unavoidable with hardware
> > virtualization. For example, Xen has >20% overhead on disk I/O.
>
> Are any future hardware solutions likely to improve these problems?

not really, but as you know, "640K ought to be enough
for anybody", so maybe future hardware developments will
make shared resources possible (with different kernels)

> >OS kernel virtualization
> >~~~~~~~~~~~~~~~~~~~~~~~~
>
> Is this considered secure enough that multiple untrusted VEs are run
> on production systems?

definitely! there are many, many, hosting providers
using exactly this technology to provide Virutal Private
Servers for their customers, of course, in production

> What kind of users want this, who can't use alternatives like real
> VMs?

well, the same users who do not want to use Bochs for
emulating a PC on a PC, when they can use UML for example,
because it's much faster and easier to use ...

aside from that, Linux-VServer for example, is not only
designed to create complete virtual servers, it also
works for service separation and increasing security for
many applications, like for example:

- test environments (one guest per distro)
- service separation (one service per 'container')
- resource management and accounting

> >Summary of previous discussions on LKML
> >~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> Have their been any discussions between the groups pushing this
> virtualization, and ...

yes, the discussions are ongoing ... maybe to clarify the
situation for the folks not involved (projects in
alphabetical order):

FreeVPS (Free Virtual Private Server Solution):
===============================================
[http://www.freevps.com/]
not pushing for inclusion, early Linux-VServer
spinoff, partially maintained but they seem to have
other interrests lately

Alex Lyashkov (FreeVPS kernel maintainer)
[Positive Software Corporation http://www.freevps.com/]

BSD Jail LSM (Linux-Jails security module):
===========================================
[http://kerneltrap.org/node/3823]

Serge E. Hallyn (Patch/Module maintainer) [IBM]
interested in some kind of mainline solution

Dave Hansen (IBM Linux Technology Center)
interested in virtualization for context/container
migration

Linux-VServer (community project, maintained):
==============================================
[http://linux-vserver.org/]

Jacques Gelinas (previous VServer maintainer)
not pushing for inclusion

Herbert Poetzl (Linux-VServer kernel maintainer)
not pushing for inclusion, but I want to make damn
sure that there does not come bloat into the kernel
and the mainline effords will be usable for
Linux-VServer and similar ...

Sam Vilain (Refactoring Linux-VServer patches)
[Catalyst http://catalyst.net.nz/]
trying hard to provide a simple/minimalistic version
of Linux-VServer for mainline

many others, not really pushing anything here :)

OpenVZ (open project, maintained, subset of Virtuozzo(tm)):
===========================================================
[http://openvz.org/]

Kir Kolyshkin (OpenVZ maintainer):
[SWsoft http://www.swsoft.com I gues?]
maybe pushing for inclusion ...

Kirill Korotaev (OpenVZ/Virtuozzo kernel developer?)
[SWsoft http://www.swsoft.com]
heavily pushing for inclusion ...

Alexey Kuznetsov (Chief Software Engineer)
[SWsoft http://www.swsoft.com]
not pushing but supporting company interrests

PID Virtualization (kernel branch for inclusion):
=================================================

Eric W. Biederman (branch developer/maintainer)
[XMission http://xmission.com/]

Virtuozzo(tm) (Commercial solution form SWsoft):
================================================
[http://www.virtuozzo.com/]

not involved yet, except via OpenVZ

Stanislav Protassov (Director of Engineering)
[SWsoft http://www.swsoft.com]


A ton of IBM and VZ folks are not listed here, but I
guess you can figure who is who from the email addresses

there are also a bunch of folks from Columbia and
Princeton university interested and/or involved in
kernel level virtualization and context migration.

please extend this list where appropriate, I'm pretty
sure I forgot at least five important/involved persons

> important kernel developers who are not part of a virtualization
> effort?

no idea, probably none for now ...

> Ie. is there any consensus about the future of these patches?

what patches? what future?

HTC,
Herbert

> Thanks,
> Nick
>
> --
> SUSE Labs, Novell Inc.
> Send instant messages to your online friends http://au.messenger.yahoo.com

2006-03-28 15:49:30

by Matt Ayres

[permalink] [raw]
Subject: Re: [Devel] Re: [RFC] Virtualization steps



Kirill Korotaev wrote:
>> Oh, after you come to an agreement and start posting patches, can you
>> also outline why we want this in the kernel (what it does that low
>> level virtualization doesn't, etc, etc), and how and why you've agreed
>> to implement it. Basically, some background and a summary of your
>> discussions for those who can't follow everything. Or is that a faq
>> item?
> Nick, will be glad to shed some light on it.
>
> First of all, what it does which low level virtualization can't:
> - it allows to run 100 containers on 1GB RAM
> (it is called containers, VE - Virtual Environments,
> VPS - Virtual Private Servers).
> - it has no much overhead (<1-2%), which is unavoidable with hardware
> virtualization. For example, Xen has >20% overhead on disk I/O.

I think the Xen guys would disagree with you on this. Xen claims <3%
overhead on the XenSource site.

Where did you get these figures from? What Xen version did you test?
What was your configuration? Did you have kernel debugging enabled? You
can't just post numbers without the data to back it up, especially when
it conflicts greatly with the Xen developers statements. AFAIK Xen is
well on it's way to inclusion into the mainstream kernel.

Thank you,
Matt Ayres

2006-03-28 16:19:01

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

Nick Piggin <[email protected]> writes:

> Kirill Korotaev wrote:
>> Nick, will be glad to shed some light on it.
>>
>
> Thanks very much Kirill.
>
> I don't think I'm qualified to make any decisions about this,
> so I don't want to detract from the real discussions, but I
> just had a couple more questions:
>
>> First of all, what it does which low level virtualization can't:
>> - it allows to run 100 containers on 1GB RAM
>> (it is called containers, VE - Virtual Environments,
>> VPS - Virtual Private Servers).
>> - it has no much overhead (<1-2%), which is unavoidable with hardware
>> virtualization. For example, Xen has >20% overhead on disk I/O.
>
> Are any future hardware solutions likely to improve these problems?

This isn't a direct competition, both solutions coincide nicely.

The major efficiency differences are fundamental to the approaches and
can only be solved in software and not hardware. The fundamental efficiency
limits of low level virtualization are not sharing resources between
instances well (think how hard memory hotplug is to solve), the fact
that running a kernel takes at least 1MB for just the kernel, the
fact that no matter how good your hypervisor is there will be some
hardware interface it doesn't virtualize.

Whereas what we are aiming at are just enough modifications to the kernel
to allow multiple instances of user space. We aren't virtualizing anything
that isn't already virtualized in the kernel.

>> OS kernel virtualization
>> ~~~~~~~~~~~~~~~~~~~~~~~~
>
> Is this considered secure enough that multiple untrusted VEs are run
> on production systems?

Kirill or Herbert can give a better answer but that is of the major
points of BSD Jails and their kin is it not?

> What kind of users want this, who can't use alternatives like real
> VMs?

Well that question assumes a lot. The answer that assumes a lot
in the other direction is that adding an additional unnecessary layers
just complicates the problem and slows things down for no reason
while making it so you can't assume the solution is always present.
In addition to doing it in a non-portable way so it is only available
on a few platforms.

I can't even think of a straight answer to the users question.

My users are in the high performance computing realm, and for that
subset it is easy. Xen and it's kin don't virtualize the high
bandwidth low latency communication hardware that is used, and that
may not even be possible. Using a hypervisor in a situation like that
certainly isn't general or easily maintainable. (Think about
what a challenge it has been to get usable infiniband drivers merged).

>> Summary of previous discussions on LKML
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> Have their been any discussions between the groups pushing this
> virtualization, and important kernel developers who are not part of
> a virtualization effort? Ie. is there any consensus about the
> future of these patches?

Yes, but just enough to give us hope :)

Unless you count the mount namespace as part of this in which case
pieces are already merged.

The challenging is that writing kernel code that does this is
easy. Writing kernel code that is mergeable and that the different
groups all agree meets their requirements is much harder. It has
taken us until now to have a basic approach that we all agree on.
Now we get to beat each other up over the technical details :)

Eric

2006-03-28 16:33:54

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

Herbert Poetzl <[email protected]> writes:

> PID Virtualization (kernel branch for inclusion):
> =================================================
>
> Eric W. Biederman (branch developer/maintainer)
> [XMission http://xmission.com/]

Actually I work for Linux Networx http://www.lnxi.com
XMission is just my ISP. I find it easier to work from
home. :)

Eric

2006-03-28 16:44:50

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [Devel] Re: [RFC] Virtualization steps

Matt Ayres <[email protected]> writes:

> I think the Xen guys would disagree with you on this. Xen claims <3% overhead
> on the XenSource site.
>
> Where did you get these figures from? What Xen version did you test? What was
> your configuration? Did you have kernel debugging enabled? You can't just post
> numbers without the data to back it up, especially when it conflicts greatly
> with the Xen developers statements. AFAIK Xen is well on it's way to inclusion
> into the mainstream kernel.

It doesn't matter. The proof that Xen has more overhead is trivial
Xen does more, and Xen clients don't share resources well.

Nor is this about Xen vs what we are doing. These are different
non conflicting approaches that operating in completely different
ways and solve a different set of problems.

Xen is about multiple kernels.

The alternative is a supped of chroot.

Eric

2006-03-28 17:05:28

by Matt Ayres

[permalink] [raw]
Subject: Re: [Devel] Re: [RFC] Virtualization steps



Eric W. Biederman wrote:
> Matt Ayres <[email protected]> writes:
>
>> I think the Xen guys would disagree with you on this. Xen claims <3% overhead
>> on the XenSource site.
>>
>> Where did you get these figures from? What Xen version did you test? What was
>> your configuration? Did you have kernel debugging enabled? You can't just post
>> numbers without the data to back it up, especially when it conflicts greatly
>> with the Xen developers statements. AFAIK Xen is well on it's way to inclusion
>> into the mainstream kernel.
>
> It doesn't matter. The proof that Xen has more overhead is trivial
> Xen does more, and Xen clients don't share resources well.
>

I understand the difference. It was more about Kirill grabbing numbers
out of the air. I actually think the containers and Xen complement each
other very well. As Xen is now based on 2.6.16 (as are both VServer and
OVZ) it makes sense to run a few Xen domains that then in turn run
containers in some scenarios. As far as the last part, Xen doesn't
share resources at all :)

Thank you,
Matt Ayres

2006-03-28 17:48:13

by Jeff Dike

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

On Tue, Mar 28, 2006 at 08:03:34AM -0700, Eric W. Biederman wrote:
> UML may use these features to accelerate it's own processes.

And I'm planning on doing exactly that.

Jeff

2006-03-28 20:22:14

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

Herbert Poetzl wrote:
> On Tue, Mar 28, 2006 at 07:00:25PM +1000, Nick Piggin wrote:

>>And if it is intrusive to the core kernel, then as always we have to
>>try to evaluate the question "is it worth it"? How many people want it
>>and what alternatives do they have (eg. maintaining seperate patches,
>>using another approach), what are the costs, complexities, to other
>>users and developers etc.
>
>
> my words, but let me ask, what do you consider 'intrusive'?
>

I don't think I could give a complete answer...
I guess it could be stated as the increase in the complexity of
the rest of the code for someone who doesn't know anything about
the virtualization implementation.

Completely non intrusive is something like 2 extra function calls
to/from generic code, changes to data structures are transparent
(or have simple wrappers), and there is no shared locking or data
with the rest of the kernel. And it goes up from there.

Anyway I'm far from qualified... I just hope that with all the
work you guys are putting in that you'll be able to justify it ;)

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-03-28 20:26:10

by Jun OKAJIMA

[permalink] [raw]
Subject: Re: [Devel] Re: [RFC] Virtualization steps

>
>I'll summarize it this way: low-level virtualization uses resource
>inefficiently.
>
>With this higher-level stuff, you get to share all of the Linux caching,
>and can do things like sharing libraries pretty naturally.
>
>They are also much lighter-weight to create and destroy than full
>virtual machines. We were planning on doing some performance
>comparisons versus some hypervisors like Xen and the ppc64 one to show
>scaling with the number of virtualized instances. Creating 100 of these
>Linux containers is as easy as a couple of shell scripts, but we still
>can't find anybody crazy enough to go create 100 Xen VMs.
>
>Anyway, those are the things that came to my mind first. I'm sure the
>others involved have their own motivations.
>

Some questions.

1. Your point is rignt in some ways, and I agree with you.
Yes, I currently guess Jail is quite practical than Xen.
Xen sounds cool, but really practical? I doubt a bit.
But it would be a narrow thought, maybe.
How you estimate feature improvement of memory shareing
on VM ( e.g. Xen/VMware)?
I have seen there are many papers about this issue.
If once memory sharing gets much efficient, Xen possibly wins.

2. Folks, how you think about other good points of Xen,
like live migration, or runs solaris, or has suspend/resume or...
No Linux jails have such feature for now, although I dont think
it is impossible with jail.


My current suggestion is,

1. Dont use Xen for running multiple VMs.
2. Use Xen for better admin/operation/deploy... tools.
3. If you need multiple VMs, use jail on Xen.

--- Okajima, Jun. Tokyo, Japan.
http://www.digitalinfra.co.jp/
http://www.colinux.org/
http://www.machboot.com/

2006-03-28 20:50:30

by Kir Kolyshkin

[permalink] [raw]
Subject: Re: [Devel] Re: [RFC] Virtualization steps

Jun OKAJIMA wrote:

>>I'll summarize it this way: low-level virtualization uses resource
>>inefficiently.
>>
>>With this higher-level stuff, you get to share all of the Linux caching,
>>and can do things like sharing libraries pretty naturally.
>>
>>They are also much lighter-weight to create and destroy than full
>>virtual machines. We were planning on doing some performance
>>comparisons versus some hypervisors like Xen and the ppc64 one to show
>>scaling with the number of virtualized instances. Creating 100 of these
>>Linux containers is as easy as a couple of shell scripts, but we still
>>can't find anybody crazy enough to go create 100 Xen VMs.
>>
>>Anyway, those are the things that came to my mind first. I'm sure the
>>others involved have their own motivations.
>>
>>
>>
>
>Some questions.
>
>1. Your point is rignt in some ways, and I agree with you.
> Yes, I currently guess Jail is quite practical than Xen.
> Xen sounds cool, but really practical? I doubt a bit.
> But it would be a narrow thought, maybe.
> How you estimate feature improvement of memory shareing
> on VM ( e.g. Xen/VMware)?
> I have seen there are many papers about this issue.
> If once memory sharing gets much efficient, Xen possibly wins.
>
>
This is not just about memory sharing. Dynamic resource management is
hardly possible in a model where you have multiple kernels running; all
of those kernel were designed to run on a dedicated hardware. As it was
pointed out, adding/removing memory from a Xen guest during runtime is
tricky.

Finally, multiple-kernels-on-top-of-hypervisor architecture is just more
complex and has more overhead then one-kernel-with-many-namespaces.

>2. Folks, how you think about other good points of Xen,
> like live migration, or runs solaris, or has suspend/resume or...
>
>
OpenVZ will have live zero downtime migration and suspend/resume some
time next month.

> No Linux jails have such feature for now, although I dont think
> it is impossible with jail.
>
>
>My current suggestion is,
>
>1. Dont use Xen for running multiple VMs.
>2. Use Xen for better admin/operation/deploy... tools.
>
>
This point is controversial. Tools are tools -- they can be made to
support Xen, Linux VServer, UML, OpenVZ, VMware -- or even all of them!

But anyway, speaking of tools and better admin operations, what it takes
to create a Xen domain (I mean create all those files needed to run a
new Xen domain), and how much time it takes? Say, in OpenVZ creation of
a VE (Virtual Environment) is a matter of unpacking a ~100MB tarball and
copying 1K config file -- which essentially means one can create a VE in
a minute. Linux-VServer should be pretty much the same.

Another concern is, yes, manageability. In OpenVZ model the host system
can easily access all the VPSs' files, making, say, a mass software
update a reality. You can easily change some settings in 100+ VEs very
easy. In systems based on Xen and, say, VMware one should log in into
each system, one by one, to administer them, which is not unlike the
'separate physical server' model.

>3. If you need multiple VMs, use jail on Xen.
>
>
Indeed, a mixed approach is very interesting. You can run OpenVZ or
Linux-VServer in a Xen domain, that makes a lot of sense.

2006-03-28 21:07:11

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

Herbert Poetzl wrote:
> On Tue, Mar 28, 2006 at 07:15:17PM +1000, Nick Piggin wrote:

[...]

Thanks for the clarifications, Herbert.

>>Ie. is there any consensus about the future of these patches?
>
>
> what patches?

One's being thrown around lkml, and future ones being talked about.
Patches ~= changes to kernel.

> what future?

I presume everyone's goal is to get something into the kernel?

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-03-28 21:35:09

by Jun OKAJIMA

[permalink] [raw]
Subject: Re: [Devel] Re: [RFC] Virtualization steps

>
>>2. Folks, how you think about other good points of Xen,
>> like live migration, or runs solaris, or has suspend/resume or...
>>
>>
>OpenVZ will have live zero downtime migration and suspend/resume some
>time next month.
>

COOL!!!!

>>
>>1. Dont use Xen for running multiple VMs.
>>2. Use Xen for better admin/operation/deploy... tools.
>>
>>
>This point is controversial. Tools are tools -- they can be made to
>support Xen, Linux VServer, UML, OpenVZ, VMware -- or even all of them!
>
>But anyway, speaking of tools and better admin operations, what it takes
>to create a Xen domain (I mean create all those files needed to run a
>new Xen domain), and how much time it takes? Say, in OpenVZ creation of
>a VE (Virtual Environment) is a matter of unpacking a ~100MB tarball and
>copying 1K config file -- which essentially means one can create a VE in
>a minute. Linux-VServer should be pretty much the same.
>
>Another concern is, yes, manageability. In OpenVZ model the host system
>can easily access all the VPSs' files, making, say, a mass software
>update a reality. You can easily change some settings in 100+ VEs very
>easy. In systems based on Xen and, say, VMware one should log in into
>each system, one by one, to administer them, which is not unlike the
>'separate physical server' model.
>
>>3. If you need multiple VMs, use jail on Xen.
>>
>>
>Indeed, a mixed approach is very interesting. You can run OpenVZ or
>Linux-VServer in a Xen domain, that makes a lot of sense.
>
>

Sorry for making misunderstanding.
What I wanted to say with "2" (use Xen as a tool) is, probably same as
what you are guessing now.
I mean, you make a server like this.
1. Install jailed Linux(OpenVZ/Vserver/or..) on Xen
2. make only one domU. and many VMs on this domU with jail.
3. runs many (more than 100 or...) VMs with jail, not with Xen.
4. but, for example, you want to migrate to another PC,
use Xen live migration.
The fourth point would help administration tasks easier. This is the point
where I mentioned about "better tool".
There is other usage of Xen as admin tool. For example, if you need device
driver (e.g. new iSCSI H/W driver or gigabit ether or...) of 2.6 kernel, but
no need to use any other 2.6 funcs, keep guest OS (domU) as 2.4, and make
dom0 as 2.6 Xen. This also helps admin tasks.
Probably, the biggest problem for now is, Xen patch conflicts with
Vserver/OpenVZ patch.


--- Okajima, Jun. Tokyo, Japan.

2006-03-28 21:54:56

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [Devel] Re: [RFC] Virtualization steps

Jun OKAJIMA <[email protected]> writes:

> Probably, the biggest problem for now is, Xen patch conflicts with
> Vserver/OpenVZ patch.

The implementations are significantly different enough that I don't
see Xen and any jail patch really conflicting. There might be some
trivial conflicts in /proc but even that seems unlikely.

Eric

2006-03-28 21:59:27

by Sam Vilain

[permalink] [raw]
Subject: Re: [Devel] Re: [RFC] Virtualization steps

On Tue, 2006-03-28 at 10:45 +0400, Kir Kolyshkin wrote:
> It is actually not a future goal, but rather a reality. Since os-level
> virtualization overhead is very low (1-2 per cent or so), one can run
> hundreds of VEs.

Huh? You managed to measure it!? Or do you just mean "negligible" by
"1-2 per cent" ? :-)

Sam.

2006-03-28 22:06:56

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

Herbert Poetzl <[email protected]> writes:

>> - network virtualization
>
> here I see many issues, as for example Linux-VServer
> does not necessarily aim for full virtualization, when
> simple and performant isolation is sufficient.

The current technique employed by vserver is implementable
in a security module today. We are implementing each of
these pieces as a separate namespace. So actually using
any one of them is optional. So implementing your current
method of network isolation in a security module should be straight
forward.

Eric

2006-03-28 22:24:08

by Kir Kolyshkin

[permalink] [raw]
Subject: Re: [Devel] Re: [RFC] Virtualization steps

Sam Vilain wrote:

>On Tue, 2006-03-28 at 10:45 +0400, Kir Kolyshkin wrote:
>
>
>>It is actually not a future goal, but rather a reality. Since os-level
>>virtualization overhead is very low (1-2 per cent or so), one can run
>>hundreds of VEs.
>>
>>
>
>Huh? You managed to measure it!? Or do you just mean "negligible" by
>"1-2 per cent" ? :-)
>
>
We run different tests to measure OpenVZ/Virtuozzo overhead, as we do
care much for that stuff. I do not remember all the gory details at the
moment, but I gave the correct numbers: "1-2 per cent or so".

There are things such as networking (OpenVZ's venet device) overhead, a
fair cpu scheduler overhead, something else.

Why do you think it can not be measured? It either can be, or it is too
low to be measured reliably (a fraction of a per cent or so).

Regards,
Kir.

2006-03-28 22:50:53

by Sam Vilain

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

On Tue, 2006-03-28 at 12:51 +0400, Kirill Korotaev wrote:
> we will create a separate branch also called -acked, where patches
> agreed upon will go.

No need. Just use Acked-By: comments.

Also, can I give some more feedback on the way you publish your patches:

1. git's replication uses the notion of a forward-only commit list.
So, if you change patches or rebase them then you have to rewind
the base point - which in pure git terms means create a new head.
So, you should use the convention of putting some identifier - a
date, or a version number - in each head.

2. Why do you have a seperate repository for your normal openvz and the
-ms trees? You can just you different heads.

3. Apache was doing something weird to the HEAD symlink in your
repository. (mind you, if you adopt notion 1., this becomes
irrelevant :-))

Otherwise, it's a great thing to see your patches published via git!

I can't recommend Stacked Git more highly for performing the 'winding'
of the patch stack necessary for revising patches. Google for "stgit".

Sam.

2006-03-28 23:04:09

by Sam Vilain

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

On Tue, 2006-03-28 at 19:15 +1000, Nick Piggin wrote:
> Kirill Korotaev wrote:
> > First of all, what it does which low level virtualization can't:
> > - it allows to run 100 containers on 1GB RAM
> > (it is called containers, VE - Virtual Environments,
> > VPS - Virtual Private Servers).
> > - it has no much overhead (<1-2%), which is unavoidable with hardware
> > virtualization. For example, Xen has >20% overhead on disk I/O.
> Are any future hardware solutions likely to improve these problems?

No, not all of them.

> > OS kernel virtualization
> > ~~~~~~~~~~~~~~~~~~~~~~~~
> Is this considered secure enough that multiple untrusted VEs are run
> on production systems?

Yes, hosting providers have been deploying this technology for years.

> What kind of users want this, who can't use alternatives like real
> VMs?

People who want low overhead and the administrative benefits of only
running a single kernel and not umpteen. For instance visibility from
the host into the guests' filesystems is a huge advantage, even if the
performance benefits can be magically overcome somehow.

> > Summary of previous discussions on LKML
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> Have their been any discussions between the groups pushing this
> virtualization, and important kernel developers who are not part of
> a virtualization effort? Ie. is there any consensus about the
> future of these patches?

Plenty recently. Check for threads involving (the people on the CC list
to the head of this thread) this year.

Comparing Xen/VMI with Vserver/OpenVZ is comparing apples with orchards.
May I refer you to some slides for a talk I gave at Linux.conf.au about
Vserver: http://utsl.gen.nz/talks/vserver/slide17a.html

Sam.




2006-03-28 23:07:29

by Sam Vilain

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

On Tue, 2006-03-28 at 09:41 -0500, Bill Davidsen wrote:
> > It is more than realistic. Hosting companies run more than 100 VPSs in
> > reality. There are also other usefull scenarios. For example, I know
> > the universities which run VPS for every faculty web site, for every
> > department, mail server and so on. Why do you think they want to run
> > only 5VMs on one machine? Much more!
>
> I made no commont on what "they" might want, I want to make the rack of
> underutilized Windows, BSD and Solaris servers go away. An approach
> which doesn't support unmodified guest installs doesn't solve any of my
> current problems. I didn't say it was in any way not useful, just not of
> interest to me. What needs I have for Linux environments are answered by
> jails and/or UML.

We are talking about adding jail technology, also known as containers on
Solaris and vserver/openvz on Linux, to the mainline kernel.

So, you are obviously interested!

Because of course, you can take an unmodified filesystem of the guest
and assuming the kernels are compatible run them without changes. I
find this consolidation approach indispensible.

Sam.

2006-03-28 23:17:58

by Sam Vilain

[permalink] [raw]
Subject: Re: [Devel] Re: [RFC] Virtualization steps

On Tue, 2006-03-28 at 14:51 -0700, Eric W. Biederman wrote:
> Jun OKAJIMA <[email protected]> writes:
>
> > Probably, the biggest problem for now is, Xen patch conflicts with
> > Vserver/OpenVZ patch.
>
> The implementations are significantly different enough that I don't
> see Xen and any jail patch really conflicting. There might be some
> trivial conflicts in /proc but even that seems unlikely.

This has been done before,

http://list.linux-vserver.org/archive/vserver/msg10235.html

Sam.

2006-03-28 23:28:11

by Sam Vilain

[permalink] [raw]
Subject: Re: [Devel] Re: [RFC] Virtualization steps

On Wed, 2006-03-29 at 02:24 +0400, Kir Kolyshkin wrote:
> >Huh? You managed to measure it!? Or do you just mean "negligible" by
> >"1-2 per cent" ? :-)
> We run different tests to measure OpenVZ/Virtuozzo overhead, as we do
> care much for that stuff. I do not remember all the gory details at the
> moment, but I gave the correct numbers: "1-2 per cent or so".
>
> There are things such as networking (OpenVZ's venet device) overhead, a
> fair cpu scheduler overhead, something else.
>
> Why do you think it can not be measured? It either can be, or it is too
> low to be measured reliably (a fraction of a per cent or so).

Well, for instance the fair CPU scheduling overhead is so tiny it may as
well not be there in the VServer patch. It's just a per-vserver TBF
that feeds back into the priority (and hence timeslice length) of the
process. ie, you get "CPU tokens" which deplete as processes in your
vserver run and you either get a boost or a penalty depending on the
level of the tokens in the bucket. This doesn't provide guarantees, but
works well for many typical workloads. And once Herbert fixed the SMP
cacheline problems in my code ;) it was pretty much full speed. That
is, until you want it to sacrifice overall performance for enforcing
limits.

How does your fair scheduler work? Do you just keep a runqueue for each
vps?

To be honest, I've never needed to determine whether its overhead is 1%
or 0.01%, it would just be a meaningless benchmark anyway :-). I know
it's "good enough for me".

Sam.

2006-03-29 00:55:14

by Kirill Korotaev

[permalink] [raw]
Subject: Re: [Devel] Re: [RFC] Virtualization steps

> Kirill Korotaev wrote:
>>> Oh, after you come to an agreement and start posting patches, can you
>>> also outline why we want this in the kernel (what it does that low
>>> level virtualization doesn't, etc, etc), and how and why you've agreed
>>> to implement it. Basically, some background and a summary of your
>>> discussions for those who can't follow everything. Or is that a faq
>>> item?
>> Nick, will be glad to shed some light on it.
>>
>> First of all, what it does which low level virtualization can't:
>> - it allows to run 100 containers on 1GB RAM
>> (it is called containers, VE - Virtual Environments,
>> VPS - Virtual Private Servers).
>> - it has no much overhead (<1-2%), which is unavoidable with hardware
>> virtualization. For example, Xen has >20% overhead on disk I/O.
>
> I think the Xen guys would disagree with you on this. Xen claims <3%
> overhead on the XenSource site.
>
> Where did you get these figures from? What Xen version did you test?
> What was your configuration? Did you have kernel debugging enabled? You
> can't just post numbers without the data to back it up, especially when
> it conflicts greatly with the Xen developers statements. AFAIK Xen is
> well on it's way to inclusion into the mainstream kernel.
I have no exact numbers in the hands as I'm in another country right now.
But! We tested Xen not long ago with iozone test suite and it gave
~20-30% disk I/O overhead. Recently we were testing CPU scheduler and
EDF scheduler gave me 33% overhead on some very simple loads with almost
busy loops inside VMs. It also was not providing any good fairness on
2CPU SMP system to my suprise. You can object to me, but better simply
retest it if interested yourself. There were other tests as well, which
reported very different overheads on Xen 3. I suppose Xen guys do such
measurements themself, no?
And I'm sure, they are constantly improving it, they are doing a good
work on it.

Thanks,
Kirill

2006-03-29 01:39:14

by Kirill Korotaev

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

Nick,

>> First of all, what it does which low level virtualization can't:
>> - it allows to run 100 containers on 1GB RAM
>> (it is called containers, VE - Virtual Environments,
>> VPS - Virtual Private Servers).
>> - it has no much overhead (<1-2%), which is unavoidable with hardware
>> virtualization. For example, Xen has >20% overhead on disk I/O.
>
> Are any future hardware solutions likely to improve these problems?
Probably you are aware of VT-i/VT-x technologies and planned virtualized
MMU and I/O MMU from Intel and AMD.
These features should improve the performance somehow, but there is
still a limit for decreasing the overhead, since at least disk, network,
video and such devices should be emulated.

>> OS kernel virtualization
>> ~~~~~~~~~~~~~~~~~~~~~~~~
>
> Is this considered secure enough that multiple untrusted VEs are run
> on production systems?
it is secure enough. What makes it secure? In general:
- virtualization, which makes resources private
- resource control, which makes VE to be limited with its usages
In more technical details virtualization projects make user access (and
capabilities) checks stricter. Moreover, OpenVZ is using "denied by
default" approach to make sure it is secure and VE users are not allowed
something else.

Also, about 2-3 month ago we had a security review of OpenVZ project
made by Solar Designer. So, in general such virtualization approach
should be not less secure than VM-like one. VM core code is bigger and
there is enough chances for bugs there.

> What kind of users want this, who can't use alternatives like real
> VMs?
Many companies, just can't share their names. But in general no
enterprise and hosting companies need to run different OSes on the same
machine. For them it is quite natural to use N machines for Linux and M
for Windows. And since VEs are much more lightweight and easier to work
with, they like it very much.

Just for example, OpenVZ core is running more than 300,000 VEs worldwide.

Thanks,
Kirill

2006-03-29 06:06:36

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

Nick Piggin <[email protected]> writes:

> I don't think I could give a complete answer...
> I guess it could be stated as the increase in the complexity of
> the rest of the code for someone who doesn't know anything about
> the virtualization implementation.
>
> Completely non intrusive is something like 2 extra function calls
> to/from generic code, changes to data structures are transparent
> (or have simple wrappers), and there is no shared locking or data
> with the rest of the kernel. And it goes up from there.
>
> Anyway I'm far from qualified... I just hope that with all the
> work you guys are putting in that you'll be able to justify it ;)

As I have been able to survey the work, the most common case
is replacing a global variable with a variable we lookup via
current.

That plus using the security module infrastructure you can
implement the semantics pretty in a straight forward manner.

The only really intrusive part is that because we tickle the
code differently we see a different set of problems. Such
as the mess that is the proc and sysctl code, and the lack of
good resource limits.

But none of that is inherent to the problem it is just when
you use the kernel harder and have more untrusted users you
see a different set of problems.

Eric

2006-03-29 06:19:34

by Sam Vilain

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

Eric W. Biederman wrote:

>That plus using the security module infrastructure you can
>implement the semantics pretty in a straight forward manner.
>
>

Yes, this is the essence of it all. Globals are bad, mmm'kay?

This raises a very interesting question. All those LSM globals,
shouldn't those be virtualisable, too? After all, isn't it natural to
want to apply a different security policy to different sets of processes?

I don't think anyone's done any work on this yet...

Man, fork() is going to get really expensive if we don't put in the
"process family" abstraction... but like you say, it comes later,
getting the semantics right comes first.

>The only really intrusive part is that because we tickle the
>code differently we see a different set of problems. Such
>as the mess that is the proc and sysctl code, and the lack of
>good resource limits.
>
>But none of that is inherent to the problem it is just when
>you use the kernel harder and have more untrusted users you
>see a different set of problems.
>
>

Indeed. Lots of old turds to clean up...

Sam.

2006-03-29 09:13:25

by Kirill Korotaev

[permalink] [raw]
Subject: Re: [Devel] Re: [RFC] Virtualization steps

Sam,

>> Why do you think it can not be measured? It either can be, or it is too
>> low to be measured reliably (a fraction of a per cent or so).
>
> Well, for instance the fair CPU scheduling overhead is so tiny it may as
> well not be there in the VServer patch. It's just a per-vserver TBF
> that feeds back into the priority (and hence timeslice length) of the
> process. ie, you get "CPU tokens" which deplete as processes in your
> vserver run and you either get a boost or a penalty depending on the
> level of the tokens in the bucket. This doesn't provide guarantees, but
> works well for many typical workloads.
I wonder what is the value of it if it doesn't do guarantees or QoS?
In our experiments with it we failed to observe any fairness. So I
suppose the only goal of this is too make sure that maliscuios user want
consume all the CPU power, right?

> How does your fair scheduler work? Do you just keep a runqueue for each
> vps?
we keep num_online_cpus runqueues per VPS.
Fairs scheduler is some kind of SFQ like algorithm which selects VPS to
be scheduled, than standart linux scheduler selects a process in a VPS
runqueues to run.

> To be honest, I've never needed to determine whether its overhead is 1%
> or 0.01%, it would just be a meaningless benchmark anyway :-). I know
> it's "good enough for me".
Sure! We feel the same, but people like numbers :)

Thanks,
Kirill

2006-03-29 11:08:45

by Sam Vilain

[permalink] [raw]
Subject: Re: [Devel] Re: [RFC] Virtualization steps

On Wed, 2006-03-29 at 13:13 +0400, Kirill Korotaev wrote:
> > Well, for instance the fair CPU scheduling overhead is so tiny it may as
> > well not be there in the VServer patch. It's just a per-vserver TBF
> > that feeds back into the priority (and hence timeslice length) of the
> > process. ie, you get "CPU tokens" which deplete as processes in your
> > vserver run and you either get a boost or a penalty depending on the
> > level of the tokens in the bucket. This doesn't provide guarantees, but
> > works well for many typical workloads.
> I wonder what is the value of it if it doesn't do guarantees or QoS?

It still does "QoS". The TBF has a "fill rate", which is basically N
tokens per M jiffies. Then you just set the size of the "bucket", and
the prio bonus given is between -5 (when bucket is full) and +15 (when
bucket is empty). The normal -10 to +10 'interactive' prio bonus is
reduced to -5 to +5 to compensate.

In other words, it's like a global 'nice' across all of the processes in
the vserver.

So, these characteristics do provide some level of guarantees, but not
all that people expect. eg, people want to say "cap usage at 5%", but
as designed the scheduler does not ever prevent runnable processes from
running if the CPUs have nothing better to do, so they think the
scheduler is broken. It is also possible with a fork bomb (assuming the
absence of appropriate ulimits) that you start enough processes that you
don't care that they are all effectively nice +19.

Herbert later made it add some of these guarantees, but I believe there
is a performance impact of some kind.

> In our experiments with it we failed to observe any fairness.

Well, it does not aim to be 'fair', it aims to be useful for allocating
CPU to vservers. ie, if you allocate X% of the CPU in the system to a
vserver, and it uses more, then try to make it use less via priority
penalties - and give others shortchanged or not using the CPU very much
performance bonuses. That's all.

So, if you under- or over-book CPU allocation, it doesn't work. The
idea was that monitoring it could be shipped out to userland. I just
wanted something flexible enough to allow virtually any policy to be put
into place without wasting too many cycles.

> > How does your fair scheduler work? Do you just keep a runqueue for each
> > vps?
> we keep num_online_cpus runqueues per VPS.

Right. I considered that approach but just couldn't be bothered
implementing it, so went with the TBF because it worked and was
lightweight.

> Fairs scheduler is some kind of SFQ like algorithm which selects VPS to
> be scheduled, than standart linux scheduler selects a process in a VPS
> runqueues to run.

Right.

> > To be honest, I've never needed to determine whether its overhead is 1%
> > or 0.01%, it would just be a meaningless benchmark anyway :-). I know
> > it's "good enough for me".
> Sure! We feel the same, but people like numbers :)

Sometimes the answer has to be "mu".

Sam.

2006-03-29 13:48:00

by Herbert Poetzl

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

On Wed, Mar 29, 2006 at 05:39:00AM +0400, Kirill Korotaev wrote:
> Nick,
>
> >>First of all, what it does which low level virtualization can't:
> >>- it allows to run 100 containers on 1GB RAM
> >> (it is called containers, VE - Virtual Environments,
> >> VPS - Virtual Private Servers).
> >>- it has no much overhead (<1-2%), which is unavoidable with hardware
> >> virtualization. For example, Xen has >20% overhead on disk I/O.
> >
> >Are any future hardware solutions likely to improve these problems?
> Probably you are aware of VT-i/VT-x technologies and planned virtualized
> MMU and I/O MMU from Intel and AMD.
> These features should improve the performance somehow, but there is
> still a limit for decreasing the overhead, since at least disk, network,
> video and such devices should be emulated.
>
> >>OS kernel virtualization
> >>~~~~~~~~~~~~~~~~~~~~~~~~
> >
> >Is this considered secure enough that multiple untrusted VEs are run
> >on production systems?
> it is secure enough. What makes it secure? In general:
> - virtualization, which makes resources private
> - resource control, which makes VE to be limited with its usages
> In more technical details virtualization projects make user access (and
> capabilities) checks stricter. Moreover, OpenVZ is using "denied by
> default" approach to make sure it is secure and VE users are not allowed
> something else.
>
> Also, about 2-3 month ago we had a security review of OpenVZ project
> made by Solar Designer. So, in general such virtualization approach
> should be not less secure than VM-like one. VM core code is bigger and
> there is enough chances for bugs there.
>
> >What kind of users want this, who can't use alternatives like real
> >VMs?
> Many companies, just can't share their names. But in general no
> enterprise and hosting companies need to run different OSes on the same
> machine. For them it is quite natural to use N machines for Linux and M
> for Windows. And since VEs are much more lightweight and easier to work
> with, they like it very much.
>
> Just for example, OpenVZ core is running more than 300,000 VEs worldwide.

not bad, how did you get to those numbers?
and, more important, how many of those are actually OpenVZ?
(compared to Virtuozzo(tm))

best,
Herbert

> Thanks,
> Kirill

2006-03-29 13:45:26

by Herbert Poetzl

[permalink] [raw]
Subject: Re: [Devel] Re: [RFC] Virtualization steps

On Wed, Mar 29, 2006 at 01:13:14PM +0400, Kirill Korotaev wrote:
> Sam,
>
> >>Why do you think it can not be measured? It either can be, or it is too
> >>low to be measured reliably (a fraction of a per cent or so).
> >
> >Well, for instance the fair CPU scheduling overhead is so tiny it may as
> >well not be there in the VServer patch. It's just a per-vserver TBF
> >that feeds back into the priority (and hence timeslice length) of the
> >process. ie, you get "CPU tokens" which deplete as processes in your
> >vserver run and you either get a boost or a penalty depending on the
> >level of the tokens in the bucket. This doesn't provide guarantees, but
> >works well for many typical workloads.

> I wonder what is the value of it if it doesn't do guarantees or QoS?
> In our experiments with it we failed to observe any fairness.

probably a misconfiguration on your side ...

> So I suppose the only goal of this is too make sure that maliscuios
> user want consume all the CPU power, right?

the currently used scheduler extensions do much
more than that, basically all kinds of scenarios
can be satisfied with it, at almost no overhead

> >How does your fair scheduler work?
> >Do you just keep a runqueue for each vps?
> we keep num_online_cpus runqueues per VPS.

> Fairs scheduler is some kind of SFQ like algorithm which selects VPS
> to be scheduled, than standart linux scheduler selects a process in a
> VPS runqueues to run.
>
> >To be honest, I've never needed to determine whether its overhead is 1%
> >or 0.01%, it would just be a meaningless benchmark anyway :-). I know
> >it's "good enough for me".

> Sure! We feel the same, but people like numbers :)

well, do you have numbers?

best,
Herbert

> Thanks,
> Kirill

2006-03-29 14:48:09

by Kirill Korotaev

[permalink] [raw]
Subject: Re: [Devel] Re: [RFC] Virtualization steps

>> I wonder what is the value of it if it doesn't do guarantees or QoS?
>> In our experiments with it we failed to observe any fairness.
>
> probably a misconfiguration on your side ...
maybe you can provide some instructions on which kernel version to use
and how to setup the following scenario:
2CPU box. 3 VPSs which should run with 1:2:3 ratio of CPU usage.

> well, do you have numbers?
just run the above scenario with one busy loop inside each VPS. I was
not able to observe 1:2:3 cpu distribution. Other scenarios also didn't
showed my any fairness. The results were different. Sometimes 1:1:2,
sometimes others.

Thanks,
Kirill

2006-03-29 17:29:04

by Herbert Poetzl

[permalink] [raw]
Subject: Re: [Devel] Re: [RFC] Virtualization steps

On Wed, Mar 29, 2006 at 06:47:58PM +0400, Kirill Korotaev wrote:
> >>I wonder what is the value of it if it doesn't do guarantees or QoS?
> >>In our experiments with it we failed to observe any fairness.
> >
> >probably a misconfiguration on your side ...
> maybe you can provide some instructions on which kernel version to use
> and how to setup the following scenario: 2CPU box. 3 VPSs which should
> run with 1:2:3 ratio of CPU usage.

that is quite simple, you enable the Hard CPU Scheduler
and select the Idle Time Skip, then you set the following
token bucket values depending on what your mean with
'should run with 1:2:3 ratio of CPU usage':

a) a guaranteed maximum of 16.7%, 33.3% and 50.0%

b) a fair sharing according to 1:2:3

c) a guaranteed minimum of 16.7%, 33.3% and 50.0%
with a fair sharing of 1:2:3 for the rest ...


for all cases you would set:
(adjust according to you reserve/boost likings)

VPS1,2,3: tokens_min = 50, tokens_max = 500
interval = interval2 = 6

a) VPS1: rate = 1, hard, noidleskip
VPS2: rate = 2, hard, noidleskip
VPS3: rate = 3, hard, noidleskip

b) VPS1: rate2 = 1, hard, idleskip
VPS2: rate2 = 2, hard, idleskip
VPS3: rate2 = 3, hard, idleskip

c) VPS1: rate = rate2 = 1, hard, idleskip
VPS2: rate = rate2 = 2, hard, idleskip
VPS3: rate = rate2 = 3, hard, idleskip

of course, adjusting rate/interval while keeping
the ratio might help you depending on the guest load
(i.e. more batch load type or mor interactive stuff)

of course, you can do those adjustments per CPU so, if
you for example want to assign one CPU to the third
guest, you can do that easily too ...

> >well, do you have numbers?
> just run the above scenario with one busy loop inside each VPS. I was
> not able to observe 1:2:3 cpu distribution. Other scenarios also didn't
> showed my any fairness. The results were different. Sometimes 1:1:2,
> sometimes others.

what was your setup?

best,
Herbert

> Thanks,
> Kirill

2006-03-29 18:19:18

by Chris Wright

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

* Sam Vilain ([email protected]) wrote:
> This raises a very interesting question. All those LSM globals,
> shouldn't those be virtualisable, too? After all, isn't it natural to
> want to apply a different security policy to different sets of processes?

Which globals? Policy could be informed by relevant containers.

thanks,
-chris

2006-03-29 20:30:51

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

On Tue, 2006-03-28 at 12:51 +0400, Kirill Korotaev wrote:
> Eric, we have a GIT repo on openvz.org already:
> http://git.openvz.org

Git is great for getting patches and lots of updates out, but I'm not
sure it is idea for what we're trying to do. We'll need things reviewed
at each step, especially because we're going to be touching so much
common code.

I'd guess set of quilt (or patch-utils) patches is probably best,
especially if we're trying to get stuff into -mm first.

-- Dave

2006-03-29 20:49:59

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

Dave Hansen <[email protected]> writes:

> On Tue, 2006-03-28 at 12:51 +0400, Kirill Korotaev wrote:
>> Eric, we have a GIT repo on openvz.org already:
>> http://git.openvz.org
>
> Git is great for getting patches and lots of updates out, but I'm not
> sure it is idea for what we're trying to do. We'll need things reviewed
> at each step, especially because we're going to be touching so much
> common code.
>
> I'd guess set of quilt (or patch-utils) patches is probably best,
> especially if we're trying to get stuff into -mm first.

Git is as good at holding patches as quilt. It isn't quite as
good at working with them as quilt but in the long term that is
fixable.

The important point is that we get a collection of patches that
we can all agree to, and that we publish it.

At this point it sounds like each group will happily publish the
patches, and that might not be a bad double check, on agreement.

Then we have someone send them to Andrew. Or we have a quilt or
a git tree that Andrew knows he can pull from.

But we do need lots of review so distribution to Andrew and the other
kernel developers as plain patches appears to be the healthy choice.
I'm going to go bury my head in the sand and finish my OLS paper now.


Eric

2006-03-29 21:39:13

by Sam Vilain

[permalink] [raw]
Subject: Re: [Devel] Re: [RFC] Virtualization steps

On Wed, 2006-03-29 at 18:47 +0400, Kirill Korotaev wrote:
> >> I wonder what is the value of it if it doesn't do guarantees or QoS?
> >> In our experiments with it we failed to observe any fairness.
> >
> > probably a misconfiguration on your side ...
> maybe you can provide some instructions on which kernel version to use
> and how to setup the following scenario:
> 2CPU box. 3 VPSs which should run with 1:2:3 ratio of CPU usage.

Ok, I'll call those three VPSes fast, faster and fastest.

"fast" : fill rate 1, interval 3
"faster" : fill rate 2, interval 3
"fastest" : fill rate 3, interval 3

That all adds up to a fill rate of 6 with an interval of 3, but that is
right because with two processors you have 2 tokens to allocate per
jiffie. Also set the bucket size to something of the order of HZ.

You can watch the processes within each vserver's priority jump up and
down with `vtop' during testing. Also you should be able to watch the
vserver's bucket fill and empty in /proc/virtual/XXX/sched (IIRC)

> > well, do you have numbers?
> just run the above scenario with one busy loop inside each VPS. I was
> not able to observe 1:2:3 cpu distribution. Other scenarios also didn't
> showed my any fairness. The results were different. Sometimes 1:1:2,
> sometimes others.

I mentioned this earlier, but for the sake of the archives I'll repeat -
if you are running with any of the buckets on empty, the scheduler is
imbalanced and therefore not going to provide the exact distribution you
asked for.

However with a single busy loop in each vserver I'd expect the above to
yield roughly 100% for fastest, 66% for faster and 33% for fast, within
5 seconds or so of starting those processes (assuming you set a bucket
size of HZ).

Sam.

2006-03-29 22:37:05

by Sam Vilain

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

Chris Wright wrote:

>* Sam Vilain ([email protected]) wrote:
>
>
>>This raises a very interesting question. All those LSM globals,
>>shouldn't those be virtualisable, too? After all, isn't it natural to
>>want to apply a different security policy to different sets of processes?
>>
>>
>
>Which globals? Policy could be informed by relevant containers.
>
>

extern struct security_operations *security_ops; in
include/linux/security.h is the global I refer to.

There is likely to be some contention there between the security folk
who probably won't like the idea that your security module can be
different for different processes, and the people who want to provide
access to security modules on the systems they want to host or consolidate.

Sam.

2006-03-29 22:44:58

by Sam Vilain

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

Dave Hansen wrote:

>On Tue, 2006-03-28 at 12:51 +0400, Kirill Korotaev wrote:
>
>
>>Eric, we have a GIT repo on openvz.org already:
>>http://git.openvz.org
>>
>>
>
>Git is great for getting patches and lots of updates out, but I'm not
>sure it is idea for what we're trying to do. We'll need things reviewed
>at each step, especially because we're going to be touching so much
>common code.
>
>I'd guess set of quilt (or patch-utils) patches is probably best,
>especially if we're trying to get stuff into -mm first.
>
>

The apparent problem is that the git commit history on a branch cannot
be unwound. However, that is fine - just make another branch and put
your new sequence of commits there.

Tools exist that allow you to wind and unwind the commit history
arbitrarily to revise patches before they are published on a branch that
you don't want to just delete. For instance:

stacked git

http://www.procode.org/stgit/

or patchy git

http://www.spearce.org/2006/02/pg-version-0111-released.html

are examples of such tools.

I recommend starting with stacked git, it really is nice.

Sam.

2006-03-29 22:51:30

by Chris Wright

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

* Sam Vilain ([email protected]) wrote:
> extern struct security_operations *security_ops; in
> include/linux/security.h is the global I refer to.

OK, I figured that's what you meant. The top-level ops are similar in
nature to inode_ops in that there's not a real compelling reason to make
them per process. The process context is (usually) available, and more
importantly, the object whose access is being mediated is readily
available with its security label.

> There is likely to be some contention there between the security folk
> who probably won't like the idea that your security module can be
> different for different processes, and the people who want to provide
> access to security modules on the systems they want to host or consolidate.

I think the current setup would work fine. It's less likely that we'd
want a separate security module for each container than simply policy
that is container aware.

thanks,
-chris

2006-03-29 23:01:47

by Sam Vilain

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

Chris Wright wrote:

>* Sam Vilain ([email protected]) wrote:
>
>
>>extern struct security_operations *security_ops; in
>>include/linux/security.h is the global I refer to.
>>
>>
>
>OK, I figured that's what you meant. The top-level ops are similar in
>nature to inode_ops in that there's not a real compelling reason to make
>them per process. The process context is (usually) available, and more
>importantly, the object whose access is being mediated is readily
>available with its security label.
>
>

AIUI inode_ops are not globals, they are per FS.

>>There is likely to be some contention there between the security folk
>>who probably won't like the idea that your security module can be
>>different for different processes, and the people who want to provide
>>access to security modules on the systems they want to host or consolidate.
>>
>>
>
>I think the current setup would work fine. It's less likely that we'd
>want a separate security module for each container than simply policy
>that is container aware.
>
>

That to me reads as:

"To avoid having to consider making security_ops non-global we will
force security modules to be container aware".

It also means you could not mix security modules that affect the same
operation different containers on a system. Personally I don't care, I
don't use them. But perhaps this inflexibility will bring problems later
for some.

I think it's a design decision that is not completely closed, but the
inertia is certainly in the favour of your position.

Sam.

2006-03-29 23:12:00

by Chris Wright

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

* Sam Vilain ([email protected]) wrote:
> AIUI inode_ops are not globals, they are per FS.

Heh, yes really bad example.

> That to me reads as:
>
> "To avoid having to consider making security_ops non-global we will
> force security modules to be container aware".

Not my intention. Rather, I think from a security standpoint there's
sanity in controlling things with a single policy. I'm thinking of
containers as a simple and logical extension of roles. Point being,
the per-object security label can easily include notion of container.

> It also means you could not mix security modules that affect the same
> operation different containers on a system. Personally I don't care, I
> don't use them. But perhaps this inflexibility will bring problems later
> for some.

No issue with addressing these issues as they come.

thanks,
-chris

2006-03-29 23:18:13

by Sam Vilain

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

Chris Wright wrote:

>Not my intention. Rather, I think from a security standpoint there's
>sanity in controlling things with a single policy.
>

Yes, certainly. Providing the features to the users in a different way
is a pragmatic alternative to trying to make sure the contained system
gets to use all the same kernel API calls it could without the
virtualisation. The only people who won't like that is are people
consolidating, so they still have to use Xen.

>I'm thinking of
>containers as a simple and logical extension of roles. Point being,
>the per-object security label can easily include notion of container.
>
>

If it fits the model well, sounds good.

Sam.

2006-03-29 23:27:13

by Chris Wright

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

* Sam Vilain ([email protected]) wrote:
> Yes, certainly. Providing the features to the users in a different way
> is a pragmatic alternative to trying to make sure the contained system
> gets to use all the same kernel API calls it could without the
> virtualisation. The only people who won't like that is are people
> consolidating, so they still have to use Xen.

Works for me ;-)
-chris

2006-03-30 01:04:49

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

Chris Wright <[email protected]> writes:

> * Sam Vilain ([email protected]) wrote:
>> extern struct security_operations *security_ops; in
>> include/linux/security.h is the global I refer to.
>
> OK, I figured that's what you meant. The top-level ops are similar in
> nature to inode_ops in that there's not a real compelling reason to make
> them per process. The process context is (usually) available, and more
> importantly, the object whose access is being mediated is readily
> available with its security label.
>
>> There is likely to be some contention there between the security folk
>> who probably won't like the idea that your security module can be
>> different for different processes, and the people who want to provide
>> access to security modules on the systems they want to host or consolidate.
>
> I think the current setup would work fine. It's less likely that we'd
> want a separate security module for each container than simply policy
> that is container aware.

I think what we really want are stacked security modules.

I have not yet fully digested all of the requirements for multiple servers
on the same machine but increasingly the security aspects look
like a job for a security module.

Enforcing policies like container A cannot send signals to processes
in container B or something like that.

Then inside of each container we could have the code that implements
a containers internal security policy.

At least one implementation Linux Jails by Serge E. Hallyn was done completely
with security modules, and the code was pretty minimal.


Eric

2006-03-30 01:35:22

by Chris Wright

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

* Eric W. Biederman ([email protected]) wrote:
> Chris Wright <[email protected]> writes:
>
> > * Sam Vilain ([email protected]) wrote:
> >> extern struct security_operations *security_ops; in
> >> include/linux/security.h is the global I refer to.
> >
> > OK, I figured that's what you meant. The top-level ops are similar in
> > nature to inode_ops in that there's not a real compelling reason to make
> > them per process. The process context is (usually) available, and more
> > importantly, the object whose access is being mediated is readily
> > available with its security label.
> >
> >> There is likely to be some contention there between the security folk
> >> who probably won't like the idea that your security module can be
> >> different for different processes, and the people who want to provide
> >> access to security modules on the systems they want to host or consolidate.
> >
> > I think the current setup would work fine. It's less likely that we'd
> > want a separate security module for each container than simply policy
> > that is container aware.
>
> I think what we really want are stacked security modules.

I'm not convinced we need a new module for each container. The module
is a policy enforcement engine, so give it a container aware policy and
you shouldn't need another module.

> I have not yet fully digested all of the requirements for multiple servers
> on the same machine but increasingly the security aspects look
> like a job for a security module.

There's two primary security areas here. One is container level
isolation, which is the job of the container itself. Security modules
can effectively introduce containers, but w/out any notion of a virtual
environment (easy example, uts). With namespaces you get isolation w/out
any formal access control check, you simply can't find objects that aren't
in your namespace. The second is object level isolation (objects such
as files, processes, etc), standard access control checks that should
happen within a container. This can be handled quite naturally by the
security module.

> Enforcing policies like container A cannot send signals to processes
> in container B or something like that.

This is a question of visibility. One method of containment is via
LSM. This checks all object access against a label that's aware of
container ids to disallow inter-container, well, anything. However,
if a namespace would mean you simply can't find those other processes,
then there's no need for the LSM side except for intra-container.

> Then inside of each container we could have the code that implements
> a containers internal security policy.

Right, and that's doable as a single top-level policy. It's a bit
interesting when you want to be able to specifiy policy from within a
container (e.g. virtual hosting), granted.

> At least one implementation Linux Jails by Serge E. Hallyn was done completely
> with security modules, and the code was pretty minimal.

Yes, although the networking area was something that looked better done
via namespaces (at least that's my recollection of my conversations with
Serge on that one a few years back).

thanks,
-chris

2006-03-30 01:56:44

by David Lang

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

On Wed, 29 Mar 2006, Chris Wright wrote:

>
> * Eric W. Biederman ([email protected]) wrote:
>> Chris Wright <[email protected]> writes:
>>
>>> * Sam Vilain ([email protected]) wrote:
>>>> extern struct security_operations *security_ops; in
>>>> include/linux/security.h is the global I refer to.
>>>
>>> OK, I figured that's what you meant. The top-level ops are similar in
>>> nature to inode_ops in that there's not a real compelling reason to make
>>> them per process. The process context is (usually) available, and more
>>> importantly, the object whose access is being mediated is readily
>>> available with its security label.
>>>
>>>> There is likely to be some contention there between the security folk
>>>> who probably won't like the idea that your security module can be
>>>> different for different processes, and the people who want to provide
>>>> access to security modules on the systems they want to host or consolidate.
>>>
>>> I think the current setup would work fine. It's less likely that we'd
>>> want a separate security module for each container than simply policy
>>> that is container aware.
>>
>> I think what we really want are stacked security modules.
>
> I'm not convinced we need a new module for each container. The module
> is a policy enforcement engine, so give it a container aware policy and
> you shouldn't need another module.

what if the people administering the container are different from the
people administering the host?

in that case the people working in the container want to be able to
implement and change their own policy, and the people working on the host
don't want to have to implement changes to their main policy config (wtih
all the auditing that would be involved with it) every time a container
wants to change it's internal policy.

I can definantly see where a container aware policy on the master would be
useful, but I can also see where the ability to nest seperate policies
would be useful.

David Lang

2006-03-30 02:03:53

by Chris Wright

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

* David Lang ([email protected]) wrote:
> what if the people administering the container are different from the
> people administering the host?

Yes, I alluded to that.

> in that case the people working in the container want to be able to
> implement and change their own policy, and the people working on the host
> don't want to have to implement changes to their main policy config (wtih
> all the auditing that would be involved with it) every time a container
> wants to change it's internal policy.

*nod*

> I can definantly see where a container aware policy on the master would be
> useful, but I can also see where the ability to nest seperate policies
> would be useful.

This is all fine. The question is whether this is a policy management
issue or a kernel infrastructure issue. So far, it's not clear that this
really necessitates kernel infrastructure changes to support container
aware policies to be loaded by physical host admin/owner or the virtual
host admin. The place where it breaks down is if each virtual host
wants not only to control its own policy, but also its security model.
Then we are left with stacking modules or heavier isolation (as in Xen).

thanks,
-chris

2006-03-30 02:24:55

by Sam Vilain

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

Eric W. Biederman wrote:

>I think what we really want are stacked security modules.
>
>I have not yet fully digested all of the requirements for multiple servers
>on the same machine but increasingly the security aspects look
>like a job for a security module.
>
>Enforcing policies like container A cannot send signals to processes
>in container B or something like that.
>
>

We could even end up making security modules to implement standard unix
security. ie, which processes can send any signal to other processes.
Why hardcode the (!sender.user_id || (sender.user_id == target.user_id)
) rule at all? That rule should be the default rule in a security module
chain.

I just think that doing it this way is the wrong way around, but I guess
I'm hardly qualified to speak on this. Aren't security modules supposed
to be for custom security policy, not standard system semantics ?

Sam.

2006-03-30 02:49:54

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

Chris Wright <[email protected]> writes:


>> At least one implementation Linux Jails by Serge E. Hallyn was done completely
>> with security modules, and the code was pretty minimal.
>
> Yes, although the networking area was something that looked better done
> via namespaces (at least that's my recollection of my conversations with
> Serge on that one a few years back).

For general networking yes the namespace flavor seems to be the sane
way to do it.

As I currently understand the problem everything goes along nicely
nothing really special needed until you start asking the question
how do I implement a root user with uid 0 who does not own the
machine. When you start asking that question is when the creepy
crawlies come out.

On most virtual filesystems the default owner of files is uid 0.
Additional privilege checks are not applied. Writing to those
files could potentially have global effect.

It is completely unclear how permissions checks should work
between two processes in different uid namespaces. Especially
there are cases where you do want interactions.

If every guest/container/jail is configured so the uids with the same
value mean the same user there are no security issues even when they
interact because the isolation is not perfect. So my gut feel it to
postpone a bunch of these problems and say making uids non-global
is a security module issue.

Eric

2006-03-30 03:02:32

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

Sam Vilain <[email protected]> writes:

>
> We could even end up making security modules to implement standard unix
> security. ie, which processes can send any signal to other processes.
> Why hardcode the (!sender.user_id || (sender.user_id == target.user_id)
> ) rule at all? That rule should be the default rule in a security module
> chain.
>
> I just think that doing it this way is the wrong way around, but I guess
> I'm hardly qualified to speak on this. Aren't security modules supposed
> to be for custom security policy, not standard system semantics ?

It is simply my contention that you into at least a semi custom
configuration when you have multiple users with the same uid.
Especially when that uid == 0.

For guests you have to change the rule about what permissions
a setuid root executable gets or else it will have CAP_SYS_MKNOD,
and CAP_RAW_IO. (Unless I didn't read that code right).

Plus all of the /proc and sysfs issues.

Now perhaps we can sit down and figure out how to get completely
isolated and only allow a new uid namespace when that is
the case, but that doesn't sound to interesting.

So at least until I can imagine what the semantics of a new uid
namespace are when we don't have perfect isolation that feels
like a job for a security module.

Eric

2006-03-30 06:29:56

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

Eric W. Biederman wrote:

>Nick Piggin <[email protected]> writes:
>
>
>>I don't think I could give a complete answer...
>>I guess it could be stated as the increase in the complexity of
>>the rest of the code for someone who doesn't know anything about
>>the virtualization implementation.
>>
>>Completely non intrusive is something like 2 extra function calls
>>to/from generic code, changes to data structures are transparent
>>(or have simple wrappers), and there is no shared locking or data
>>with the rest of the kernel. And it goes up from there.
>>
>>Anyway I'm far from qualified... I just hope that with all the
>>work you guys are putting in that you'll be able to justify it ;)
>>
>
>As I have been able to survey the work, the most common case
>is replacing a global variable with a variable we lookup via
>current.
>
>That plus using the security module infrastructure you can
>implement the semantics pretty in a straight forward manner.
>
>The only really intrusive part is that because we tickle the
>code differently we see a different set of problems. Such
>as the mess that is the proc and sysctl code, and the lack of
>good resource limits.
>
>But none of that is inherent to the problem it is just when
>you use the kernel harder and have more untrusted users you
>see a different set of problems.
>
>

Yes... about that; if/when namespaces get into the kernel, you guys
are going to start pushing all sorts of per-container resource
control, right? Or will you be happy to leave most of that to VMs?

--

Send instant messages to your online friends http://au.messenger.yahoo.com

2006-03-30 10:32:07

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

Nick Piggin <[email protected]> writes:

> Yes... about that; if/when namespaces get into the kernel, you guys
> are going to start pushing all sorts of per-container resource
> control, right? Or will you be happy to leave most of that to VMs?

That will certainly be an aspect of it, and that is one of the
pieces of the ongoing discussion. The out of tree implementations
already do this.

What flavor of resource limits these will be I don't know. That
is a part of the discussion we are just coming to now.


Eric

2006-03-30 13:29:51

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

Quoting Chris Wright ([email protected]):
> * Eric W. Biederman ([email protected]) wrote:
> > At least one implementation Linux Jails by Serge E. Hallyn was done completely
> > with security modules, and the code was pretty minimal.
>
> Yes, although the networking area was something that looked better done
> via namespaces (at least that's my recollection of my conversations with
> Serge on that one a few years back).

Yes, namespaces would be better - just as the file system isolation was
moved from a "strong chroot" approach to using pivot-root. Though note
that vserver still uses basically the method that bsdjail uses, and my
two attempts at getting network namespaces considered in the kernel so
far were dismal failures. Hopefully this time we've got some better,
more network-savvy minds on the task :)

-serge

2006-03-30 13:38:57

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

"Serge E. Hallyn" <[email protected]> writes:

> Quoting Chris Wright ([email protected]):
>> * Eric W. Biederman ([email protected]) wrote:
>> > At least one implementation Linux Jails by Serge E. Hallyn was done
> completely
>> > with security modules, and the code was pretty minimal.
>>
>> Yes, although the networking area was something that looked better done
>> via namespaces (at least that's my recollection of my conversations with
>> Serge on that one a few years back).
>
> Yes, namespaces would be better - just as the file system isolation was
> moved from a "strong chroot" approach to using pivot-root. Though note
> that vserver still uses basically the method that bsdjail uses, and my
> two attempts at getting network namespaces considered in the kernel so
> far were dismal failures. Hopefully this time we've got some better,
> more network-savvy minds on the task :)

Any pointers to those old discussions?

I'm curious why getting your network namespaces were dismal failures.
Everyone ignored the patch?

Eric

2006-03-30 13:51:56

by Kirill Korotaev

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

ok. This is also easier for us, as it is a usual way of doing things in
OpenVZ. Will see...

> On Tue, 2006-03-28 at 12:51 +0400, Kirill Korotaev wrote:
>> Eric, we have a GIT repo on openvz.org already:
>> http://git.openvz.org
>
> Git is great for getting patches and lots of updates out, but I'm not
> sure it is idea for what we're trying to do. We'll need things reviewed
> at each step, especially because we're going to be touching so much
> common code.
>
> I'd guess set of quilt (or patch-utils) patches is probably best,
> especially if we're trying to get stuff into -mm first.
>
> -- Dave
>
>

2006-03-30 14:32:29

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

Quoting Chris Wright ([email protected]):
> * David Lang ([email protected]) wrote:
> > what if the people administering the container are different from the
> > people administering the host?
>
> Yes, I alluded to that.
>
> > in that case the people working in the container want to be able to
> > implement and change their own policy, and the people working on the host
> > don't want to have to implement changes to their main policy config (wtih
> > all the auditing that would be involved with it) every time a container
> > wants to change it's internal policy.
>
> *nod*
>
> > I can definantly see where a container aware policy on the master would be
> > useful, but I can also see where the ability to nest seperate policies
> > would be useful.
>
> This is all fine. The question is whether this is a policy management
> issue or a kernel infrastructure issue. So far, it's not clear that this
> really necessitates kernel infrastructure changes to support container
> aware policies to be loaded by physical host admin/owner or the virtual
> host admin. The place where it breaks down is if each virtual host
> wants not only to control its own policy, but also its security model.

What do you define as 'policy', and how is it different from the
security model?

> Then we are left with stacking modules or heavier isolation (as in Xen).

Hmm, talking about 'container' in this sense is confusing, because we're
not yet clear on what a container is.

So I'm trying to get a handle on what we really want to do.

Talking about namespaces is tricky. For instance if I do
clone(CLONE_NEWNS), the new process is in a new fs namespace, but the fs
objects are still the same, so if it loads an LSM, then perhaps at most
the new process should only control mount activities in its own
namespace.

Frankly I thought, and am still not unconvinced, that containers owned
by someone other than the system owner would/should never want to load
their own LSMs, so that this wasn't a problem. Isolation, as Chris has
mentioned, would be taken care of by the very nature of namespaces.

There are of course two alternatives... First, we might want to allow the
machine admin to insert per-container/per-namespace LSMs. To support
this case, we would need a way for the admin to tag a container some way
identifying it as being subject to a particular set of security_ops.

Second, we might want container admins to insert LSMs. In addition to
a straightforward way of tagging subjects/objects with their container,
we'd need to implement at least permissions for "may insert global LSM",
"may insert container LSM", and "may not insert any LSM." This might be
sufficient if we trust userspace to always create full containers.
Otherwise we might want to support meta-policy along the lines of "may
authorize ptrace and mount hooks only", or even "not subject to the
global inode_permission hook, and may create its own." (yuck)

But so much of this depends on how the namespaces/containers end up
being implemented...

-serge

2006-03-30 14:55:50

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

Quoting Eric W. Biederman ([email protected]):
> "Serge E. Hallyn" <[email protected]> writes:
>
> > Quoting Chris Wright ([email protected]):
> >> * Eric W. Biederman ([email protected]) wrote:
> >> > At least one implementation Linux Jails by Serge E. Hallyn was done
> > completely
> >> > with security modules, and the code was pretty minimal.
> >>
> >> Yes, although the networking area was something that looked better done
> >> via namespaces (at least that's my recollection of my conversations with
> >> Serge on that one a few years back).
> >
> > Yes, namespaces would be better - just as the file system isolation was
> > moved from a "strong chroot" approach to using pivot-root. Though note
> > that vserver still uses basically the method that bsdjail uses, and my
> > two attempts at getting network namespaces considered in the kernel so
> > far were dismal failures. Hopefully this time we've got some better,
> > more network-savvy minds on the task :)
>
> Any pointers to those old discussions?

I can only find the one.

http://marc.theaimsgroup.com/?l=linux-netdev&m=109837694221901&w=2

I thought I'd sent one earlier than this too. Maybe I just got ready to
resend a new version, then decided the code quality wasn't worth it.

> I'm curious why getting your network namespaces were dismal failures.

Ok, I guess "dismal failure" most aptly applies to the patch itself :)

> Everyone ignored the patch?

Well, there was that. Then I briefly tried to rework the patch, but
just ran out of time, and have kept this on my todo list ever since,
but never really gotten back to it. At last it looks like this may
finally be coming back up.

-serge

2006-03-30 15:30:15

by Herbert Poetzl

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

On Thu, Mar 30, 2006 at 08:32:24AM -0600, Serge E. Hallyn wrote:
> Quoting Chris Wright ([email protected]):
> > * David Lang ([email protected]) wrote:
> > > what if the people administering the container are different from the
> > > people administering the host?
> >
> > Yes, I alluded to that.
> >
> > > in that case the people working in the container want to be able to
> > > implement and change their own policy, and the people working on the host
> > > don't want to have to implement changes to their main policy config (wtih
> > > all the auditing that would be involved with it) every time a container
> > > wants to change it's internal policy.
> >
> > *nod*
> >
> > > I can definantly see where a container aware policy on the master would be
> > > useful, but I can also see where the ability to nest seperate policies
> > > would be useful.
> >
> > This is all fine. The question is whether this is a policy management
> > issue or a kernel infrastructure issue. So far, it's not clear that this
> > really necessitates kernel infrastructure changes to support container
> > aware policies to be loaded by physical host admin/owner or the virtual
> > host admin. The place where it breaks down is if each virtual host
> > wants not only to control its own policy, but also its security model.
>
> What do you define as 'policy', and how is it different from the
> security model?
>
> > Then we are left with stacking modules or heavier isolation (as in Xen).
>
> Hmm, talking about 'container' in this sense is confusing, because we're
> not yet clear on what a container is.
>
> So I'm trying to get a handle on what we really want to do.
>
> Talking about namespaces is tricky. For instance if I do
> clone(CLONE_NEWNS), the new process is in a new fs namespace, but the
> fs objects are still the same, so if it loads an LSM, then perhaps at
> most the new process should only control mount activities in its own
> namespace.
>
> Frankly I thought, and am still not unconvinced, that containers owned
> by someone other than the system owner would/should never want to load
> their own LSMs, so that this wasn't a problem. Isolation, as Chris has
> mentioned, would be taken care of by the very nature of namespaces.
>
> There are of course two alternatives... First, we might want to
> allow the machine admin to insert per-container/per-namespace LSMs.
> To support this case, we would need a way for the admin to tag a
> container some way identifying it as being subject to a particular set
> of security_ops.
>
> Second, we might want container admins to insert LSMs. In addition
> to a straightforward way of tagging subjects/objects with their
> container, we'd need to implement at least permissions for "may insert
> global LSM", "may insert container LSM", and "may not insert any LSM."
> This might be sufficient if we trust userspace to always create full
> containers. Otherwise we might want to support meta-policy along the
> lines of "may authorize ptrace and mount hooks only", or even "not
> subject to the global inode_permission hook, and may create its own."

sorry folks, I don't think that we _ever_ want container
root to be able to load any kernel modues at any time
without having CAP_SYS_ADMIN or so, in which case the
modules can be global as well ... otherwise we end up
as a bad Xen imitation with a lot of security issues,
where it should be a security enhancement ...

best,
Herbert

> (yuck)
>
> But so much of this depends on how the namespaces/containers end up
> being implemented...
>
> -serge

2006-03-30 16:03:20

by Stephen Smalley

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

On Thu, 2006-03-30 at 08:32 -0600, Serge E. Hallyn wrote:
> Frankly I thought, and am still not unconvinced, that containers owned
> by someone other than the system owner would/should never want to load
> their own LSMs, so that this wasn't a problem. Isolation, as Chris has
> mentioned, would be taken care of by the very nature of namespaces.
>
> There are of course two alternatives... First, we might want to allow the
> machine admin to insert per-container/per-namespace LSMs. To support
> this case, we would need a way for the admin to tag a container some way
> identifying it as being subject to a particular set of security_ops.
>
> Second, we might want container admins to insert LSMs. In addition to
> a straightforward way of tagging subjects/objects with their container,
> we'd need to implement at least permissions for "may insert global LSM",
> "may insert container LSM", and "may not insert any LSM." This might be
> sufficient if we trust userspace to always create full containers.
> Otherwise we might want to support meta-policy along the lines of "may
> authorize ptrace and mount hooks only", or even "not subject to the
> global inode_permission hook, and may create its own." (yuck)
>
> But so much of this depends on how the namespaces/containers end up
> being implemented...

FWIW, SELinux now has a notion of a type hierarchy in its policy, so the
root admin can carve out a portion of the policy space and allow less
privileged admins to then define sub-types that are strictly constrained
by what was allowed to the parent type by the root admin. This is
handled in userspace, with the policy mediation performed by a userspace
agent (daemon, policy management server), which then becomes the focal
point for all policy loading.

--
Stephen Smalley
National Security Agency

2006-03-30 16:15:47

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

Quoting Stephen Smalley ([email protected]):
> On Thu, 2006-03-30 at 08:32 -0600, Serge E. Hallyn wrote:
> > Frankly I thought, and am still not unconvinced, that containers owned
> > by someone other than the system owner would/should never want to load
> > their own LSMs, so that this wasn't a problem. Isolation, as Chris has
> > mentioned, would be taken care of by the very nature of namespaces.
> >
> > There are of course two alternatives... First, we might want to allow the
> > machine admin to insert per-container/per-namespace LSMs. To support
> > this case, we would need a way for the admin to tag a container some way
> > identifying it as being subject to a particular set of security_ops.
> >
> > Second, we might want container admins to insert LSMs. In addition to
> > a straightforward way of tagging subjects/objects with their container,
> > we'd need to implement at least permissions for "may insert global LSM",
> > "may insert container LSM", and "may not insert any LSM." This might be
> > sufficient if we trust userspace to always create full containers.
> > Otherwise we might want to support meta-policy along the lines of "may
> > authorize ptrace and mount hooks only", or even "not subject to the
> > global inode_permission hook, and may create its own." (yuck)
> >
> > But so much of this depends on how the namespaces/containers end up
> > being implemented...
>
> FWIW, SELinux now has a notion of a type hierarchy in its policy, so the
> root admin can carve out a portion of the policy space and allow less
> privileged admins to then define sub-types that are strictly constrained
> by what was allowed to the parent type by the root admin. This is
> handled in userspace, with the policy mediation performed by a userspace
> agent (daemon, policy management server), which then becomes the focal
> point for all policy loading.

Yes, my first response (which I cancelled) mentioned this as a possible
solution.

The global admin could assign certain max privileges to 'container_b'.
The admin in container_b could create container_b.root_t,
container_b.user_t, etc, which would be limited by the container_b
max perms.

Presumably the policy daemon, running in container 0, could accept input
from a socket from container 2, labeled appropriately automatically
ensuring that all types created by the policy in container 2 are
prefixed with container_b, and doing the obvious restrictions.

Or something like that :)

-serge

2006-03-30 16:43:41

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

Quoting Herbert Poetzl ([email protected]):
> On Thu, Mar 30, 2006 at 08:32:24AM -0600, Serge E. Hallyn wrote:
> > Quoting Chris Wright ([email protected]):
> > > * David Lang ([email protected]) wrote:
> > > > what if the people administering the container are different from the
> > > > people administering the host?
> > >
> > > Yes, I alluded to that.
> > >
> > > > in that case the people working in the container want to be able to
> > > > implement and change their own policy, and the people working on the host
> > > > don't want to have to implement changes to their main policy config (wtih
> > > > all the auditing that would be involved with it) every time a container
> > > > wants to change it's internal policy.
> > >
> > > *nod*
> > >
> > > > I can definantly see where a container aware policy on the master would be
> > > > useful, but I can also see where the ability to nest seperate policies
> > > > would be useful.
> > >
> > > This is all fine. The question is whether this is a policy management
> > > issue or a kernel infrastructure issue. So far, it's not clear that this
> > > really necessitates kernel infrastructure changes to support container
> > > aware policies to be loaded by physical host admin/owner or the virtual
> > > host admin. The place where it breaks down is if each virtual host
> > > wants not only to control its own policy, but also its security model.
> >
> > What do you define as 'policy', and how is it different from the
> > security model?
> >
> > > Then we are left with stacking modules or heavier isolation (as in Xen).
> >
> > Hmm, talking about 'container' in this sense is confusing, because we're
> > not yet clear on what a container is.
> >
> > So I'm trying to get a handle on what we really want to do.
> >
> > Talking about namespaces is tricky. For instance if I do
> > clone(CLONE_NEWNS), the new process is in a new fs namespace, but the
> > fs objects are still the same, so if it loads an LSM, then perhaps at
> > most the new process should only control mount activities in its own
> > namespace.
> >
> > Frankly I thought, and am still not unconvinced, that containers owned
> > by someone other than the system owner would/should never want to load
> > their own LSMs, so that this wasn't a problem. Isolation, as Chris has
> > mentioned, would be taken care of by the very nature of namespaces.
> >
> > There are of course two alternatives... First, we might want to
> > allow the machine admin to insert per-container/per-namespace LSMs.
> > To support this case, we would need a way for the admin to tag a
> > container some way identifying it as being subject to a particular set
> > of security_ops.
> >
> > Second, we might want container admins to insert LSMs. In addition
> > to a straightforward way of tagging subjects/objects with their
> > container, we'd need to implement at least permissions for "may insert
> > global LSM", "may insert container LSM", and "may not insert any LSM."
> > This might be sufficient if we trust userspace to always create full
> > containers. Otherwise we might want to support meta-policy along the
> > lines of "may authorize ptrace and mount hooks only", or even "not
> > subject to the global inode_permission hook, and may create its own."
>
> sorry folks, I don't think that we _ever_ want container
> root to be able to load any kernel modues at any time
> without having CAP_SYS_ADMIN or so, in which case the
> modules can be global as well ... otherwise we end up
> as a bad Xen imitation with a lot of security issues,
> where it should be a security enhancement ...

I agree. As Chris points out, at most we should help LSM become
container-aware. But as the selinux example shows, even that should
not be necessary.

And that's for funky setups. For normal setups, the isolation provided
inherently by the namespaces should suffice.

thanks,
-serge

2006-03-30 18:01:57

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

Herbert Poetzl <[email protected]> writes:

> sorry folks, I don't think that we _ever_ want container
> root to be able to load any kernel modues at any time
> without having CAP_SYS_ADMIN or so, in which case the
> modules can be global as well ... otherwise we end up
> as a bad Xen imitation with a lot of security issues,
> where it should be a security enhancement ...

Agreed. At least until someone defines a user-mode
linux-security-module. We may want a different security module
in effect for a particular guest. Which modules you get
being defined by the one system administrator is fine.

The primary case I see worth worry about is using
a security module to ensure isolation of a container,
while still providing the selinux mandatory capabilities
to a container.

Eric

2006-03-30 18:45:36

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

"Serge E. Hallyn" <[email protected]> writes:

> Frankly I thought, and am still not unconvinced, that containers owned
> by someone other than the system owner would/should never want to load
> their own LSMs, so that this wasn't a problem. Isolation, as Chris has
> mentioned, would be taken care of by the very nature of namespaces.

Up to uids I agree. Once we hit uids things get very ugly.
And since security modules already seem to touch all of the places
we need to touch to make a UID namespace work I think using security
modules to implement the strange things we need with uid.

To ensure uid isolation we would need a different copy of every other
namespace. The pid space would need to be completely isolated,
and we couldn't share any filesystem mounts with any other namespace.
This especially includes /proc and sysfs.

> There are of course two alternatives... First, we might want to allow the
> machine admin to insert per-container/per-namespace LSMs. To support
> this case, we would need a way for the admin to tag a container some way
> identifying it as being subject to a particular set of security_ops.
>
> Second, we might want container admins to insert LSMs. In addition to
> a straightforward way of tagging subjects/objects with their container,
> we'd need to implement at least permissions for "may insert global LSM",
> "may insert container LSM", and "may not insert any LSM." This might be
> sufficient if we trust userspace to always create full containers.
> Otherwise we might want to support meta-policy along the lines of "may
> authorize ptrace and mount hooks only", or even "not subject to the
> global inode_permission hook, and may create its own." (yuck)

Security modules that are stackable call mod_reg_security.
Currently in the current we have: root_plug, seclvl, capability modules
that implement this. selinux currently only supports running
as the global security policy.

Allowing a different administrator to load modules is out of
the question, if we actually care about security.

However it is possible to build the capacity to multiplex
compiled in or already loaded security modules, and allowed which
security modules are in effect to be controlled by securityfs.

With appropriate care we should be able to allow the container
administrator to use this capability to select which security
policies, and mechanisms they want.

That is something we probably want to consider anyway as
currently the security modules break the basic rule that
compiling code in should not affect how the kernel operates
by default.

Until we get to that point simply specifying the name of a security
module in the static configuration of a container that the container
creation program can use should be enough.

> But so much of this depends on how the namespaces/containers end up
> being implemented...

Agreed. But if I hand wave and say an upper level security module will
decide when to call it then only the details of that upper level
security module are in question. The stacked module will likely just
work.

So I guess I am leaning towards a security namespace implemented with
an appropriate security module.

Eric

2006-03-30 18:52:26

by Chris Wright

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

* Serge E. Hallyn ([email protected]) wrote:
> Quoting Chris Wright ([email protected]):
> > This is all fine. The question is whether this is a policy management
> > issue or a kernel infrastructure issue. So far, it's not clear that this
> > really necessitates kernel infrastructure changes to support container
> > aware policies to be loaded by physical host admin/owner or the virtual
> > host admin. The place where it breaks down is if each virtual host
> > wants not only to control its own policy, but also its security model.
>
> What do you define as 'policy', and how is it different from the
> security model?

Model, as in TE, RBAC, or something trivially simple ala Openwall type
protection. Policy, as in rules to drive the model.

> Second, we might want container admins to insert LSMs.

I think we can agree that this way lies madness.

thanks,
-chris

2006-03-30 18:54:25

by Chris Wright

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

* Stephen Smalley ([email protected]) wrote:
> FWIW, SELinux now has a notion of a type hierarchy in its policy, so the
> root admin can carve out a portion of the policy space and allow less
> privileged admins to then define sub-types that are strictly constrained
> by what was allowed to the parent type by the root admin. This is
> handled in userspace, with the policy mediation performed by a userspace
> agent (daemon, policy management server), which then becomes the focal
> point for all policy loading.

*nod* this is exactly what I was thinking in terms of container
specifying policy. Goes through the system/root container and gets
validated before loaded.

thanks,
-chris

2006-03-30 19:06:41

by Chris Wright

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

* Eric W. Biederman ([email protected]) wrote:
> "Serge E. Hallyn" <[email protected]> writes:
>
> > Frankly I thought, and am still not unconvinced, that containers owned
> > by someone other than the system owner would/should never want to load
> > their own LSMs, so that this wasn't a problem. Isolation, as Chris has
> > mentioned, would be taken care of by the very nature of namespaces.
>
> Up to uids I agree. Once we hit uids things get very ugly.
> And since security modules already seem to touch all of the places
> we need to touch to make a UID namespace work I think using security
> modules to implement the strange things we need with uid.
>
> To ensure uid isolation we would need a different copy of every other
> namespace. The pid space would need to be completely isolated,
> and we couldn't share any filesystem mounts with any other namespace.
> This especially includes /proc and sysfs.

Security modules use labels not uid's. The uid is the basis for
traditional DAC checks, the label are used for MAC checks. And its
easy to imagine a label that includes a notion of container id.

> However it is possible to build the capacity to multiplex
> compiled in or already loaded security modules, and allowed which
> security modules are in effect to be controlled by securityfs.

Yes, it's been proposed and discussed many times. There's some
fundamental issues with composing security modules. First and foremost
is the notion that arbritrary security models may not compose to form
meaningful (in a security sense) results. Second, at an implementation
level, sharing labels is non-trivial and comes with overhead.

> With appropriate care we should be able to allow the container
> administrator to use this capability to select which security
> policies, and mechanisms they want.
>
> That is something we probably want to consider anyway as
> currently the security modules break the basic rule that
> compiling code in should not affect how the kernel operates
> by default.

Don't follow you on this one.

thanks,
-chris

2006-03-30 19:21:49

by Chris Wright

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

* Eric W. Biederman ([email protected]) wrote:
> As I currently understand the problem everything goes along nicely
> nothing really special needed until you start asking the question
> how do I implement a root user with uid 0 who does not own the
> machine. When you start asking that question is when the creepy
> crawlies come out.

Hehe. uid 0 _and_ full capabilities. So reducing capabilities is one
relatively easy way to handle that. And, if you have a security module
loaded it's going to use security labels, which can be richer than both
uid and capabilites combined.

> On most virtual filesystems the default owner of files is uid 0.
> Additional privilege checks are not applied. Writing to those
> files could potentially have global effect.

Yes, many (albeit far from all) have a capable() check as well.

> It is completely unclear how permissions checks should work
> between two processes in different uid namespaces. Especially
> there are cases where you do want interactions.

Are there? Why put them in different containers then? I'd think
network sockets is the extent of the interaction you'd want. Sharing
filesystem does leave room for named pipes and unix domain sockets (also
in the abstract namespace). And considering the side channel in unix
domain sockets, they become a potential hole. So for solid isolation,
I'd expect disallowing access to those when the object owner is in a
different security context from context which is trying to attach.

thanks,
-chris

2006-03-30 20:28:50

by Bill Davidsen

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

Sam Vilain wrote:
> On Tue, 2006-03-28 at 09:41 -0500, Bill Davidsen wrote:
>>> It is more than realistic. Hosting companies run more than 100 VPSs in
>>> reality. There are also other usefull scenarios. For example, I know
>>> the universities which run VPS for every faculty web site, for every
>>> department, mail server and so on. Why do you think they want to run
>>> only 5VMs on one machine? Much more!
>> I made no commont on what "they" might want, I want to make the rack of
>> underutilized Windows, BSD and Solaris servers go away. An approach
>> which doesn't support unmodified guest installs doesn't solve any of my
>> current problems. I didn't say it was in any way not useful, just not of
>> interest to me. What needs I have for Linux environments are answered by
>> jails and/or UML.
>
> We are talking about adding jail technology, also known as containers on
> Solaris and vserver/openvz on Linux, to the mainline kernel.
>
> So, you are obviously interested!
>
> Because of course, you can take an unmodified filesystem of the guest
> and assuming the kernels are compatible run them without changes. I
> find this consolidation approach indispensible.
>
The only way to assume kernels are compatible is to run the same distro.
Because vendor kernels are sure not compatible, even running a
kernel.org kernel on Fedora (for instance) reveals the the utilities are
also tweaked to expect the kernel changes, and you wind up with a system
which feels like wearing someone else's hat. It's stable but little
things just don't work right.

--
-bill davidsen ([email protected])
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me

2006-03-31 05:39:03

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

Chris Wright <[email protected]> writes:

>> With appropriate care we should be able to allow the container
>> administrator to use this capability to select which security
>> policies, and mechanisms they want.
>>
>> That is something we probably want to consider anyway as
>> currently the security modules break the basic rule that
>> compiling code in should not affect how the kernel operates
>> by default.
>
> Don't follow you on this one.

Very simple, it should be possible statically compile in
all of the security modules and be able to pick at run time which
security module to use.

Unless I have been very blind and missed something skimming
through the code compiling if I compile in all of the security
modules, whichever one is initialized first is the one
that we will use.

Eric

2006-03-31 05:50:51

by Chris Wright

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

* Eric W. Biederman ([email protected]) wrote:
> Very simple, it should be possible statically compile in
> all of the security modules and be able to pick at run time which
> security module to use.
>
> Unless I have been very blind and missed something skimming
> through the code compiling if I compile in all of the security
> modules, whichever one is initialized first is the one
> that we will use.

I see. No, you got that correct. That's rather intentional, to make
sure all objects are properly initialized as they are allocated rather
than having to double check at every access control check. That's why
security_initcalls are so early.

thanks,
-chris

2006-03-31 06:01:50

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

Chris Wright <[email protected]> writes:

> * Eric W. Biederman ([email protected]) wrote:
>> As I currently understand the problem everything goes along nicely
>> nothing really special needed until you start asking the question
>> how do I implement a root user with uid 0 who does not own the
>> machine. When you start asking that question is when the creepy
>> crawlies come out.
>
> Hehe. uid 0 _and_ full capabilities. So reducing capabilities is one
> relatively easy way to handle that.

It comes close the but capabilities are not currently factored correctly.

> And, if you have a security module
> loaded it's going to use security labels, which can be richer than both
> uid and capabilites combined.

Exactly. You can define the semantics with a security module,
but you cannot define the semantics in terms of uids.

>> On most virtual filesystems the default owner of files is uid 0.
>> Additional privilege checks are not applied. Writing to those
>> files could potentially have global effect.
>
> Yes, many (albeit far from all) have a capable() check as well.

Nothing controlled by sysctl has a capable check, except
the capabilities sysctl. The default if not the norm is not
to apply capability checks.

>> It is completely unclear how permissions checks should work
>> between two processes in different uid namespaces. Especially
>> there are cases where you do want interactions.
>
> Are there? Why put them in different containers then? I'd think
> network sockets is the extent of the interaction you'd want. Sharing
> filesystem does leave room for named pipes and unix domain sockets (also
> in the abstract namespace). And considering the side channel in unix
> domain sockets, they become a potential hole. So for solid isolation,
> I'd expect disallowing access to those when the object owner is in a
> different security context from context which is trying to attach.

Yes. My current implementation has all of that visibility closed,
when you create a new network namespace. But there are still
interactions. For me it isn't a real problem though as I have
a single system administrator and synchronized user ids. For
other use case it is a different story.

In a more normal use case, the container admin can't get out, but
the box admin can get in. At least for simple things like monitoring
and possibly some debugging.

Or you get weird cases where you want to allow access to some of
the files in /proc to the container but not all.

If I am the machine admin and I have discovered a process in
a container it has a bug and is going wild, it is preferable
to kill that process, or possibly that container rather than
rebooting the box to solve the problem.

All of the normal every day interactions get handled fine and there
is simply no visibility. But I don't ever expect perfect isolation,
from the machine admin.

I do still need to read up on the selinux mandatory access controls.
Although the comment from the NSA selinux FAQ about selinux being
just a proof-of-concept and no security bugs were discovered or
looked for during it's implementation scares me.


Eric

2006-03-31 06:53:56

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

Chris Wright <[email protected]> writes:

> * Eric W. Biederman ([email protected]) wrote:
>> Very simple, it should be possible statically compile in
>> all of the security modules and be able to pick at run time which
>> security module to use.
>>
>> Unless I have been very blind and missed something skimming
>> through the code compiling if I compile in all of the security
>> modules, whichever one is initialized first is the one
>> that we will use.
>
> I see. No, you got that correct. That's rather intentional, to make
> sure all objects are properly initialized as they are allocated rather
> than having to double check at every access control check. That's why
> security_initcalls are so early.

Ok. That make sense. The fact that some of the security modules
besides selinux are tristate in Kconfig had me confused for a moment.

Controlling what to run with a kernel command line makes sense
then.

Having a generic command line like lsm=[selinux|root_plug|capability|seclvl]
would be nice. Where nothing supplied would not enable any of
the linux security modules.

Eric

2006-03-31 13:40:25

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

Quoting Eric W. Biederman ([email protected]):
> Herbert Poetzl <[email protected]> writes:
>
> > sorry folks, I don't think that we _ever_ want container
> > root to be able to load any kernel modues at any time
> > without having CAP_SYS_ADMIN or so, in which case the
> > modules can be global as well ... otherwise we end up
> > as a bad Xen imitation with a lot of security issues,
> > where it should be a security enhancement ...
>
> Agreed. At least until someone defines a user-mode
> linux-security-module. We may want a different security module

It's been done before, at least for some hooks (ie one implementation by
antivirus folks). But to actually do this with full support for all
hooks would require some changes. For example, the security_task_kill()
hook is called under several potential locks. At least
read_lock(tasklist_lock) and plain rcu_read_lock() (and I thought also
write_lock(tasklist_lock), but can't find that instance right now).

Clearly that can be fixed, but atm a user-mode lsm isn't entirely
possible.

-serge

2006-03-31 14:48:34

by Stephen Smalley

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

On Thu, 2006-03-30 at 23:00 -0700, Eric W. Biederman wrote:
> I do still need to read up on the selinux mandatory access controls.
> Although the comment from the NSA selinux FAQ about selinux being
> just a proof-of-concept and no security bugs were discovered or
> looked for during it's implementation scares me.

Point of clarification: The original SELinux prototype NSA released in
Dec 2000 based on Linux 2.2.x kernels was a proof-of-concept reference
implementation. I wouldn't describe the current implementation in
mainline Linux 2.6 and certain distributions in the same manner. Also,
the separate Q&A about "did you try to fix any vulnerabilities" is just
saying that NSA did not perform a thorough code audit of the entire
Linux kernel; we just implemented the extensions needed for mandatory
access control.

http://selinux.sf.net/resources.php3 has some good pointers for SELinux
resources. There is also a recently created SELinux news site at
http://selinuxnews.org/wp/.


--
Stephen Smalley
National Security Agency

2006-03-31 16:41:39

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC] Virtualization steps

Stephen Smalley <[email protected]> writes:

> On Thu, 2006-03-30 at 23:00 -0700, Eric W. Biederman wrote:
>> I do still need to read up on the selinux mandatory access controls.
>> Although the comment from the NSA selinux FAQ about selinux being
>> just a proof-of-concept and no security bugs were discovered or
>> looked for during it's implementation scares me.
>
> Point of clarification: The original SELinux prototype NSA released in
> Dec 2000 based on Linux 2.2.x kernels was a proof-of-concept reference
> implementation. I wouldn't describe the current implementation in
> mainline Linux 2.6 and certain distributions in the same manner. Also,
> the separate Q&A about "did you try to fix any vulnerabilities" is just
> saying that NSA did not perform a thorough code audit of the entire
> Linux kernel; we just implemented the extensions needed for mandatory
> access control.
>
> http://selinux.sf.net/resources.php3 has some good pointers for SELinux
> resources. There is also a recently created SELinux news site at
> http://selinuxnews.org/wp/.

Thanks. I am concerned that there hasn't been an audit, of at least
the core kernel.

My first interaction with security modules was that I fixed a but
where /proc/pid/fd was performing the wrong super user security
checks and the system became unusable for people using selinux.

Eric