Hi,
the lkml discussion on pid virtualization has been covering many of the
issues both relating directly to pid virtualization, and relating to
optimizations in the two specific implementations.
However, if we're going to get anywhere, the first decision which we
need to make is whether to go with a (container,pid), (pspace,pid) or
equivalent pair like approach, or a virtualized pid approach. Linus had
previously said that he prefers the former. Since there has been much
discussion since then, I thought I'd try to recap the pros and cons of
each approach, with the hope that the head Penguins will chime in one
more time, after which we can hopefully focus our efforts.
Issues with the (pspace,pid) pair like approach:
1. how do we reap zombies when the "real" init process
is not visible from within a container?
2. global process view
userspace tools may need to be taught about containers
in order to provide any container with a "global pid view".
i.e. all tasks could be listed as (pspace,pid), or as
pid1/pid2/pid3 where pid1 is creator of pid2's pspace
which is creator of pid3's pspace...
3. no half-isolation mode?
containers are always fully isolated. This doesn't
need to be the case if userspace tools are taught
to deal with containerids. On the other hand, it
can also be considered one of it's strenghts.
Issues with pid virtualization;
1. maintenance/correctness
pids and vpids are now different and must not be mixed.
Enforcing this simply in the kernel is a concern. Sparse
may be useful here, or simply using different opaque types.
2. slowdown after migration
before checkpt, pid==vpid. After restore or migration,
vpid = hash(pid) or vice versa.
Please add any issues I've not listed, or correct anything you feel I've
misrepresented.
thanks,
-serge
"Serge E. Hallyn" <[email protected]> writes:
> Hi,
>
> the lkml discussion on pid virtualization has been covering many of the
> issues both relating directly to pid virtualization, and relating to
> optimizations in the two specific implementations.
>
> However, if we're going to get anywhere, the first decision which we
> need to make is whether to go with a (container,pid), (pspace,pid) or
> equivalent pair like approach, or a virtualized pid approach. Linus had
> previously said that he prefers the former. Since there has been much
> discussion since then, I thought I'd try to recap the pros and cons of
> each approach, with the hope that the head Penguins will chime in one
> more time, after which we can hopefully focus our efforts.
Does anyone see problems with implementing this as series of namespaces?
If not we can move forward and start talking about pids, and the
other namespaces.
With respect to pids lets not get caught up in the implementation
details. Let's first get clear on what the semantics should be.
- Should the first pid in a pid space have pid 1?
- Should pid == 1 ignore signals, it doesn't have a handler for?
- Should any children of pid 1 be allowed to live when pid == 1 is killed?
- Should a process have some sort of global (on the machine identifier)?
- Should the pids in a pid space be visible from the outside?
- Should the parent of pid 1 be able to wait for it for it's
children?
- Is a completely disjoin pid space acceptable to anyone?
- What should the parent of pid == 1 see?
- Should a process not in the default pid space be able to create
another pid space?
- Should we be able to monitor a pid space from the outside?
- Should we be able to have processes enter a pid space?
- Do we need to be able to be able to ptrace/kill individual processes
in a pid space, from the outside, and why?
- After migration what identifiers should the tasks have?
If we can answer these kinds of questions we can likely focus in
on what the implementation should look like. So far I have not
seen a question that could not be implemented with a (pspace, pid)/pid
or a vpid/pid implementation.
I think it is safe to say that from the inside things should look to
processes just as they do now. Which answers a lot of those
questions. But it still leaves a lot open.
Eric
Serge E. Hallyn wrote:
> However, if we're going to get anywhere, the first decision which we
> need to make is whether to go with a (container,pid), (pspace,pid) or
> equivalent pair like approach, or a virtualized pid approach. Linus had
> previously said that he prefers the former. Since there has been much
> discussion since then, I thought I'd try to recap the pros and cons of
> each approach, with the hope that the head Penguins will chime in one
> more time, after which we can hopefully focus our efforts.
I am thinking that you can have both. Not in the sense of
overcomplicating, but in the sense of having your cake and eating it
too.
The only thing which is a unique, system wide identifier for the process
is the &task_struct. So we are already virtualising this pointer into a
PID for userland. The only difference is that we cache it (nay, keep
the authorative version of it) in the task_struct.
The (XID, PID) approach internally is also fine. This says that there
is a container XID, and within it, the PID refers to a particular
task_struct. A given task_struct will likely exist in more than one
place in the (XID, PID) space. Perhaps the values of PID for XID = 0
and XID = task.xid can be cached in the task_struct, but that is a
detail.
Depending on the flags on the XID, we can incorporate all the approaches
being tabled. You want virtualised pids? Well, that'll hurt a little,
but suit yourself - set a flag on your container and inside the
container you get virtualised PIDs. You want a flat view for all your
vservers? Fine, just use an XID without the virtualisation flag and
with the "all seeing eye" property set. Or you use an XID _with_ the
virtualisation flag set, and then call a tuple-endowed API to find the
information you're after.
We can enforce this by simply removing all the internal macros that deal
with single PID references only; ie, enforce the XID to be used
everywhere. This removes the distinction between virtual PIDs and
"real" pids; it's not a type difference, but an XID value difference.
There are lots and lots of details I'm glossing over, but such finer
points are best discussed by trading patches.
IOW, we can stop arguing and start implementing :-).
Sam.
Sam Vilain <[email protected]> writes:
> Serge E. Hallyn wrote:
>> However, if we're going to get anywhere, the first decision which we
>> need to make is whether to go with a (container,pid), (pspace,pid) or
>> equivalent pair like approach, or a virtualized pid approach. Linus had
>> previously said that he prefers the former. Since there has been much
>> discussion since then, I thought I'd try to recap the pros and cons of
>> each approach, with the hope that the head Penguins will chime in one
>> more time, after which we can hopefully focus our efforts.
>
> IOW, we can stop arguing and start implementing :-).
PID Space god mode....
If internally each pspace had a small number, that we could prepend
to the pid. We would have a local global pid view.
If we hashed each pid by the unsigned long version of pspace->nr | pid.
We would have a hash table with a global view.
If we exported this number to user space we would have global pids.
I absolutely hate the idea because it yields a set of processes whose
view of the world is difficult if not impossible to migrate to another
machine, plus those processes need an extra set of translation functions.
It is worth mentioning because it is easy to implement, and either everyone
else will like it and it will get adopted or it will at least provide
an easy way to implement a transition API, for those people currently stuck.
Eric
Quoting Eric W. Biederman ([email protected]):
> "Serge E. Hallyn" <[email protected]> writes:
> With respect to pids lets not get caught up in the implementation
> details. Let's first get clear on what the semantics should be.
>
> - Should the first pid in a pid space have pid 1?
>
> - Should pid == 1 ignore signals, it doesn't have a handler for?
>
> - Should any children of pid 1 be allowed to live when pid == 1 is killed?
But doesn't that depend on whether we use (pspace,pid) or vpids? If
vpids, then init isn't really a problem, since from kernelspace
processes in a comtainer stil have a global pid and global parent, and
init knows them. If (pspace,pid), then we need a fakeinit bc the real
init doesn't know about the processes in the container...
> - Should a process have some sort of global (on the machine identifier)?
I think to satisfy openvz existing customers this must be a yes. With
vpid the answer is simple. With (pspace,pid), there are three anwers i've
heard, namely
1. just use pspaceid, pid
2. make pspaceid small and use (pspaceid << SOMEBITS | pid)
3. use pid1/pid2/pid3 where pid1 is creator of pid and its
pspace, etc...
But the openvz guys also don't want userspace tool changes, making (2)
the most likely option. Any other ideas?
> - Should the pids in a pid space be visible from the outside?
Again, the openvz guys say yes.
I think it should be acceptable if a pidspace is visible in all it's
ancestor pidspaces. I.e. if I create pspace2 and pspace3 from pid 234
in pspace1, then pspace2 doesn't need to be able to address pspace3
and vice versa.
Kirill, is that acceptable?
> - Should the parent of pid 1 be able to wait for it for it's
> children?
Yes.
> - Is a completely disjoin pid space acceptable to anyone?
To anyone? yes :)
TO everyone, I don't think so.
> - What should the parent of pid == 1 see?
>
> - Should a process not in the default pid space be able to create
> another pid space?
Yes.
This is to support using pidspaces for vservers, and creating
migrateable sub-pidspaces in each vserver.
> - Should we be able to monitor a pid space from the outside?
To some extent, yes.
> - Should we be able to have processes enter a pid space?
IMO that is crucial.
> - Do we need to be able to be able to ptrace/kill individual processes
> in a pid space, from the outside, and why?
I think this is completely unnecessary so long as a process can enter a
pidspace.
> - After migration what identifiers should the tasks have?
It must be possible to retain the same pids, at least from inside the
container.
So this is irrelevant, as the openvz approach can just virtualize the
old pid, while (pspace, pid) will be able to create a new container and
use the old pid values, which are then guaranteed to not be in use.
> If we can answer these kinds of questions we can likely focus in
> on what the implementation should look like. So far I have not
> seen a question that could not be implemented with a (pspace, pid)/pid
> or a vpid/pid implementation.
But you have, haven't you? Namely, how can openvz provide it's
customers with a global view of all processes without putting 5 years of
work into a new sysadmin interface?
-serge
On Wed, Feb 15, 2006 at 03:12:13PM -0700, Eric W. Biederman wrote:
> "Serge E. Hallyn" <[email protected]> writes:
>
> > Hi,
> >
> > the lkml discussion on pid virtualization has been covering many of the
> > issues both relating directly to pid virtualization, and relating to
> > optimizations in the two specific implementations.
> >
> > However, if we're going to get anywhere, the first decision which we
> > need to make is whether to go with a (container,pid), (pspace,pid) or
> > equivalent pair like approach, or a virtualized pid approach. Linus had
> > previously said that he prefers the former. Since there has been much
> > discussion since then, I thought I'd try to recap the pros and cons of
> > each approach, with the hope that the head Penguins will chime in one
> > more time, after which we can hopefully focus our efforts.
>
> Does anyone see problems with implementing this as series of namespaces?
> If not we can move forward and start talking about pids, and the
> other namespaces.
>
> With respect to pids lets not get caught up in the implementation
> details. Let's first get clear on what the semantics should be.
>
> - Should the first pid in a pid space have pid 1?
depends, usually either one process has pid==1 (the init)
or no process should use that pid (still handled special)
nevertheless, the latter case requires a 'fake' pid==1
to make some userspace processes happy (e.g. pstree)
> - Should pid == 1 ignore signals, it doesn't have a handler for?
the 'init' process must be protected in a similar fashion
than the real init is, otherwise guests will end up dying
in certain situations, of course, it would be nice to have
some kind of flag to turn this on and off
> - Should any children of pid 1 be allowed to live
> when pid == 1 is killed?
agan that's a feature which would be nice, especially
for the lightweight contexts, which do not have an init
process running inside the guest
> - Should a process have some sort of global (on the machine identifier)?
this is mandatory, as it is required to kill any process
from the host (admin) context, without entering the pid
space (which would lead to all kind of security issues)
> - Should the pids in a pid space be visible from the outside?
yes, while not strictly required, folks really like to
view the overall system state. this can be done with the
help of special tools, but again it should be done
without entering each guest pid space ...
> - Should the parent of pid 1 be able to wait for it for it's
> children?
definitely, we (Linux-VServer) added this some time ago
and it helps to maintain/restart a guest.
> - Is a completely disjoin pid space acceptable to anyone?
yes, as long as the beforementioned access, management
and control mechanisms are in place ...
> - What should the parent of pid == 1 see?
doesn't really matter, but I see three options there:
- the parent space
- the child space
- both
> - Should a process not in the default pid space be able to create
> another pid space?
that would be a requirement for hierarchical spaces
> - Should we be able to monitor a pid space from the outside?
yes, definitely, but it could happen via some special
interfaces, i.e. no need to make it compatible
> - Should we be able to have processes enter a pid space?
definitely, without that, the entire VPS concept will
not work, folks use the 'admin' backdoors 90% of the
time ...
> - Do we need to be able to be able to ptrace/kill individual
> processes in a pid space, from the outside, and why?
ptrace: no not really, if there are issues which want
to be investigated, you can always enter the space and
attach the strace there. IMHO there is not much info
in ptracing the space creation/transition
kill: yes, once you identified an evil guest process,
you want to get rid of it, without entering the space
> - After migration what identifiers should the tasks have?
doesn't matter, as long as they are unique, so
ppid1/ppid2/ppid3/pid would work ...
> If we can answer these kinds of questions we can likely focus in on
> what the implementation should look like. So far I have not seen a
> question that could not be implemented with a (pspace, pid)/pid or a
> vpid/pid implementation.
best,
Herbert
> I think it is safe to say that from the inside things should look to
> processes just as they do now. Which answers a lot of those
> questions. But it still leaves a lot open.
>
> Eric
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
Quoting Herbert Poetzl ([email protected]):
> > - Should a process have some sort of global (on the machine identifier)?
>
> this is mandatory, as it is required to kill any process
> from the host (admin) context, without entering the pid
> space (which would lead to all kind of security issues)
Just to be clear: you think there should be cases where pspace x can
kill processes in pspace y, but can't enter it?
I'm not convinced that grounded in reasonable assumptions...
> > - Should the pids in a pid space be visible from the outside?
>
> yes, while not strictly required, folks really like to
> view the overall system state. this can be done with the
> help of special tools, but again it should be done
> without entering each guest pid space ...
>
...
> > - Should we be able to monitor a pid space from the outside?
>
> yes, definitely, but it could happen via some special
> interfaces, i.e. no need to make it compatible
What sort of interfaces do you envision for these two? If we
can lay them out well enough, maybe the result will satisfy the
openvz folks?
For instance, perhaps we just use a proc interface, where in the
current pspace, if we've created a new pspace which in our pspace
is known as process 567, then we might see
/proc
/proc/567
/proc/567/pspace
/proc/567/pspace/1 -> link to /proc/567
/proc/567/pspace/2
Now we also might be able to interact with the pspace by doing
something like
echo -9 > /proc/567/pspace/2/kill
and of course do things like
cd /proc/567/pspace/1/root
> > - After migration what identifiers should the tasks have?
>
> doesn't matter, as long as they are unique, so
> ppid1/ppid2/ppid3/pid would work ...
And where are we talking about? Is this an identifier for userspace
tools? Or just in kernelspace?
-serge
"Serge E. Hallyn" <[email protected]> writes:
> Quoting Eric W. Biederman ([email protected]):
>> "Serge E. Hallyn" <[email protected]> writes:
>> With respect to pids lets not get caught up in the implementation
>> details. Let's first get clear on what the semantics should be.
>>
>> - Should the first pid in a pid space have pid 1?
>>
>> - Should pid == 1 ignore signals, it doesn't have a handler for?
>>
>> - Should any children of pid 1 be allowed to live when pid == 1 is killed?
>
> But doesn't that depend on whether we use (pspace,pid) or vpids?
No. This is completely an implementation detail.
> If
> vpids, then init isn't really a problem, since from kernelspace
> processes in a comtainer stil have a global pid and global parent, and
> init knows them. If (pspace,pid), then we need a fakeinit bc the real
> init doesn't know about the processes in the container...
If you take the fakeinit you can't run a normal distro, because you are
not running init, but that is a slightly separate issue.
If (pspace, pid) we still may want to wait until all of the process die
before we report to the parent.
I choose the semantics that I figured best match what currently happens
now and would most likely make existing user space software work.
>> - Should a process have some sort of global (on the machine identifier)?
>
> I think to satisfy openvz existing customers this must be a yes. With
> vpid the answer is simple. With (pspace,pid), there are three anwers i've
> heard, namely
>
> 1. just use pspaceid, pid
> 2. make pspaceid small and use (pspaceid << SOMEBITS | pid)
> 3. use pid1/pid2/pid3 where pid1 is creator of pid and its
> pspace, etc...
>
> But the openvz guys also don't want userspace tool changes, making (2)
> the most likely option. Any other ideas?
Not really.
Part of the problem is with pids we don't have very many bits, and
the users of pids aren't interested in all to all communication by
pids.
There is a certain sanity in making all of the pids visible in a
single flat pid namespace. But I think there are implementation
issues.
>> - Should the pids in a pid space be visible from the outside?
>
> Again, the openvz guys say yes.
Please let them speak for themselves.
> I think it should be acceptable if a pidspace is visible in all it's
> ancestor pidspaces. I.e. if I create pspace2 and pspace3 from pid 234
> in pspace1, then pspace2 doesn't need to be able to address pspace3
> and vice versa.
A good rule. Now consider pspace 4 which is a child of pid 567
in pspace 3.
What should pspace 3 see?
What should pspace 1 see?
What happens when you migrate pspace 3 into a different pspace
on a different machine?
Is there a sane implementation for this?
My only real objection to this approach is that I cannot see a sane
and simple implementation.
>> - Should the parent of pid 1 be able to wait for it for it's
>> children?
>
> Yes.
>
>> - Is a completely disjoin pid space acceptable to anyone?
>
> To anyone? yes :)
>
> TO everyone, I don't think so.
Well I would love to hear from someone it is acceptable to.
That would be an interesting perspective.
>> - What should the parent of pid == 1 see?
>>
>> - Should a process not in the default pid space be able to create
>> another pid space?
>
> Yes.
>
> This is to support using pidspaces for vservers, and creating
> migrateable sub-pidspaces in each vserver.
Agreed.
Now this case is very interesting, because supporting it creates
interesting restrictions on the rest of the problem, and
unless I miss something this is where the OpenVZ implementation
currently falls down.
Which names does the intermediate pidspace (vserver) see the child
pidspace with?
Which names does the initial pidspace see the child pid space with?
>> - Should we be able to monitor a pid space from the outside?
>
> To some extent, yes.
Especially if you don't want to migrate your monitor tools
with your jobs...
>> - Should we be able to have processes enter a pid space?
>
> IMO that is crucial.
It seems to be. However I am not yet convinced we want to
implement this in the kernel. The in-kernel implementation is
easy but that may be dictating security policy. So this
needs a thorough review.
A side note here is that if we allow ptrace from the outside
on even just the init process this opens things up very much
like an enter would (with respect to security).
>> - Do we need to be able to be able to ptrace/kill individual processes
>> in a pid space, from the outside, and why?
>
> I think this is completely unnecessary so long as a process can enter a
> pidspace.
Which is likely a good compromise.
>> - After migration what identifiers should the tasks have?
>
> It must be possible to retain the same pids, at least from inside the
> container.
>
> So this is irrelevant, as the openvz approach can just virtualize the
> old pid, while (pspace, pid) will be able to create a new container and
> use the old pid values, which are then guaranteed to not be in use.
Actually this gets to the heart of my issue with the openvz vpid
implementation. Either I am blind or it does not address the pids
in a nested pidspace when talking about migration. I am not even
certain it currently allows for nested pid spaces.
If I migrate a vserver with a sub-pidspace and migrate that what do I
see?
>> If we can answer these kinds of questions we can likely focus in
>> on what the implementation should look like. So far I have not
>> seen a question that could not be implemented with a (pspace, pid)/pid
>> or a vpid/pid implementation.
>
> But you have, haven't you? Namely, how can openvz provide it's
> customers with a global view of all processes without putting 5 years of
> work into a new sysadmin interface?
Well I think we can reuse most of the old sysadmin interfaces yes.
Eric
Herbert Poetzl <[email protected]> writes:
>> - Should the first pid in a pid space have pid 1?
>
> depends, usually either one process has pid==1 (the init)
> or no process should use that pid (still handled special)
> nevertheless, the latter case requires a 'fake' pid==1
> to make some userspace processes happy (e.g. pstree)
Ok. So we always need a pid == 1 to show up.
>> - Should pid == 1 ignore signals, it doesn't have a handler for?
>
> the 'init' process must be protected in a similar fashion
> than the real init is, otherwise guests will end up dying
> in certain situations, of course, it would be nice to have
> some kind of flag to turn this on and off
That is an interesting thought... Have a flag that sets
if you get to ignore unhandled signals :)
>> - Should any children of pid 1 be allowed to live
>> when pid == 1 is killed?
>
> agan that's a feature which would be nice, especially
> for the lightweight contexts, which do not have an init
> process running inside the guest
But if init is killed you want to still see init and have
children reaped?
>> - Should a process have some sort of global (on the machine identifier)?
>
> this is mandatory, as it is required to kill any process
> from the host (admin) context, without entering the pid
> space (which would lead to all kind of security issues)
Ok. Would it be sufficient to simply stomp the entire pid
space? I.e. do you need fine grained control, or is this merely
a matter of keeping the machine usable for everyone else?
>> - Should the pids in a pid space be visible from the outside?
>
> yes, while not strictly required, folks really like to
> view the overall system state. this can be done with the
> help of special tools, but again it should be done
> without entering each guest pid space ...
Ok. I didn't ask quite what I thought I had asked :)
I agree that monitoring seems important.
>> - Should the parent of pid 1 be able to wait for it for it's
>> children?
>
> definitely, we (Linux-VServer) added this some time ago
> and it helps to maintain/restart a guest.
That is how I expected that to get used. :)
>> - Is a completely disjoin pid space acceptable to anyone?
>
> yes, as long as the beforementioned access, management
> and control mechanisms are in place ...
Good answer.
>> - What should the parent of pid == 1 see?
>
> doesn't really matter, but I see three options there:
>
> - the parent space
> - the child space
> - both
An answer from a totally different perspective :)
>> - Should a process not in the default pid space be able to create
>> another pid space?
>
> that would be a requirement for hierarchical spaces
Well if the spaces were completely disjoin it would not be very
much of a hierarchy...
>> - Should we be able to monitor a pid space from the outside?
>
> yes, definitely, but it could happen via some special
> interfaces, i.e. no need to make it compatible
>> - Should we be able to have processes enter a pid space?
>
> definitely, without that, the entire VPS concept will
> not work, folks use the 'admin' backdoors 90% of the
> time ...
A good inducement to get this part right :)
To give a little more context when you are using this
functionality to make a super chroot you and you only have
one sysadmin, or when your talented sysadmin is assisting
other sysadmins of guests is when this comes up.
>> - Do we need to be able to be able to ptrace/kill individual
>> processes in a pid space, from the outside, and why?
>
> ptrace: no not really, if there are issues which want
> to be investigated, you can always enter the space and
> attach the strace there. IMHO there is not much info
> in ptracing the space creation/transition
>
> kill: yes, once you identified an evil guest process,
> you want to get rid of it, without entering the space
>
>> - After migration what identifiers should the tasks have?
>
> doesn't matter, as long as they are unique, so
> ppid1/ppid2/ppid3/pid would work ...
>
>> If we can answer these kinds of questions we can likely focus in on
>> what the implementation should look like. So far I have not seen a
>> question that could not be implemented with a (pspace, pid)/pid or a
>> vpid/pid implementation.
>
> best,
> Herbert
Thanks.
Eric
"Serge E. Hallyn" <[email protected]> writes:
> Quoting Herbert Poetzl ([email protected]):
>> > - Should a process have some sort of global (on the machine identifier)?
>>
>> this is mandatory, as it is required to kill any process
>> from the host (admin) context, without entering the pid
>> space (which would lead to all kind of security issues)
>
> Just to be clear: you think there should be cases where pspace x can
> kill processes in pspace y, but can't enter it?
>
> I'm not convinced that grounded in reasonable assumptions...
Actually I think it is. The admin should control what is running
on their box.
>> > - Should we be able to monitor a pid space from the outside?
>>
>> yes, definitely, but it could happen via some special
>> interfaces, i.e. no need to make it compatible
>
> What sort of interfaces do you envision for these two? If we
> can lay them out well enough, maybe the result will satisfy the
> openvz folks?
>
> For instance, perhaps we just use a proc interface, where in the
> current pspace, if we've created a new pspace which in our pspace
> is known as process 567, then we might see
>
> /proc
> /proc/567
> /proc/567/pspace
> /proc/567/pspace/1 -> link to /proc/567
> /proc/567/pspace/2
>
> Now we also might be able to interact with the pspace by doing
> something like
>
> echo -9 > /proc/567/pspace/2/kill
>
> and of course do things like
>
> cd /proc/567/pspace/1/root
Actually I think this is the model we need to investigate if we
need to extend the interface to handle new things.
By using the filesystem it allows things to be cobbled together with
scripts so new C programs are not required.
It happens that Plan9 does it this way successfully, so there is some
precedent.
Actually increasingly I like this notion, as it also allows us to
export the ability to kill a process with a network filesystem.
Which means multiple machine management in a cluster could easily
reduce to the same set of tools as multiple pid space management.
>> > - After migration what identifiers should the tasks have?
>>
>> doesn't matter, as long as they are unique, so
>> ppid1/ppid2/ppid3/pid would work ...
>
> And where are we talking about? Is this an identifier for userspace
> tools? Or just in kernelspace?
The point seems to be that the identifiers don't matter just
so long as there is one.
Eric
On Thu, 2006-02-16 at 15:30 +0100, Herbert Poetzl wrote:
> > - Should a process have some sort of global (on the machine identifier)?
>
> this is mandatory, as it is required to kill any process
> from the host (admin) context, without entering the pid
> space (which would lead to all kind of security issues)
Giving admin processes the ability to enter pid spaces seems like it
solves an entire class of problems, right?. Could you explain a bit
what kinds of security issues it introduces?
-- Dave
Quoting Eric W. Biederman ([email protected]):
> > I think it should be acceptable if a pidspace is visible in all it's
> > ancestor pidspaces. I.e. if I create pspace2 and pspace3 from pid 234
> > in pspace1, then pspace2 doesn't need to be able to address pspace3
> > and vice versa.
>
> A good rule. Now consider pspace 4 which is a child of pid 567
> in pspace 3.
>
> What should pspace 3 see?
Implementation dependent.
What I'd like to see is:
> What should pspace 3 see?
The pid of the init process for pspace 4.
> What should pspace 1 see?
The pid of the init process for pspace 3.
> What happens when you migrate pspace 3 into a different pspace
> on a different machine?
Nothing special. "Migrate" was just a checkpoint (from pspace 1)
and a resume (from pspace N on some machine). So now pspace N on
the new machine has created a new pspace - which happens to be
immediately populated with the contents of the old pspace 3 - and
see the pid of the init process of this new pspace.
> Is there a sane implementation for this?
IMO, definately yes.
But I haven't tried it, so my opinion is just that.
-serge
Quoting Eric W. Biederman ([email protected]):
> "Serge E. Hallyn" <[email protected]> writes:
>
> > Quoting Herbert Poetzl ([email protected]):
> >> > - Should a process have some sort of global (on the machine identifier)?
> >>
> >> this is mandatory, as it is required to kill any process
> >> from the host (admin) context, without entering the pid
> >> space (which would lead to all kind of security issues)
> >
> > Just to be clear: you think there should be cases where pspace x can
> > kill processes in pspace y, but can't enter it?
> >
> > I'm not convinced that grounded in reasonable assumptions...
>
> Actually I think it is. The admin should control what is running
> on their box.
Of course. I meant "grounded in reasonable security assumptions."
If you really are the admin then you will find another way of
"getting into" the pspace.
But really, what does "enter" mean in this case? If you can see
the processes so as to kill them, is that all you need? After all
this is distinct from the filesystem namespace - the pids are the
only thing that's distinct. So the only thing that I can see you
preventing by preventing "entering" the pspace is starting a new
process with a pid valid in the other pspace.
-serge
"Serge E. Hallyn" <[email protected]> writes:
> Quoting Eric W. Biederman ([email protected]):
>> > I think it should be acceptable if a pidspace is visible in all it's
>> > ancestor pidspaces. I.e. if I create pspace2 and pspace3 from pid 234
>> > in pspace1, then pspace2 doesn't need to be able to address pspace3
>> > and vice versa.
>>
>> A good rule. Now consider pspace 4 which is a child of pid 567
>> in pspace 3.
>>
>> What should pspace 3 see?
>
> Implementation dependent.
>
> What I'd like to see is:
>
>> What should pspace 3 see?
>
> The pid of the init process for pspace 4.
>
>> What should pspace 1 see?
>
> The pid of the init process for pspace 3.
>
>> What happens when you migrate pspace 3 into a different pspace
>> on a different machine?
>
> Nothing special. "Migrate" was just a checkpoint (from pspace 1)
> and a resume (from pspace N on some machine). So now pspace N on
> the new machine has created a new pspace - which happens to be
> immediately populated with the contents of the old pspace 3 - and
> see the pid of the init process of this new pspace.
>
>> Is there a sane implementation for this?
>
> IMO, definately yes.
>
> But I haven't tried it, so my opinion is just that.
If you are just talking the pid of the init process the problem seems
tractable.
Where I see real problems with migration is and nested pid spaces
is when you expose all of your pids to your parent, and perhaps
there was some miscommunication on this point.
To try and give an example.
pspace 1 pspace 2 pspace 3 pspace 4
pid 234 -> pid 1
pid 235 -> pid 2 -> pid 1
pid 236 -> pid 3 -> pid 2 -> pid 1
Hopefully this clearly shows what I was trying to avoid, by
only allow pid 1 of any pspace to be visible in the parent.
Eric
Quoting Eric W. Biederman ([email protected]):
> "Serge E. Hallyn" <[email protected]> writes:
> >> What happens when you migrate pspace 3 into a different pspace
> >> on a different machine?
> >
> > Nothing special. "Migrate" was just a checkpoint (from pspace 1)
> > and a resume (from pspace N on some machine). So now pspace N on
> > the new machine has created a new pspace - which happens to be
> > immediately populated with the contents of the old pspace 3 - and
> > see the pid of the init process of this new pspace.
> >
> >> Is there a sane implementation for this?
> >
> > IMO, definately yes.
> >
> > But I haven't tried it, so my opinion is just that.
>
> If you are just talking the pid of the init process the problem seems
> tractable.
>
> Where I see real problems with migration is and nested pid spaces
> is when you expose all of your pids to your parent, and perhaps
> there was some miscommunication on this point.
>
> To try and give an example.
>
> pspace 1 pspace 2 pspace 3 pspace 4
> pid 234 -> pid 1
> pid 235 -> pid 2 -> pid 1
> pid 236 -> pid 3 -> pid 2 -> pid 1
>
> Hopefully this clearly shows what I was trying to avoid, by
> only allow pid 1 of any pspace to be visible in the parent.
Yes, I saw it more like:
> pspace 1 pspace 2 pspace 3 pspace 4
> pid 234 -> pid 1
> pid 2 -> pid 1
> pid 2 -> pid 1
> pid 3
Now Dave and I were just talking about actually using the
init process in a pspace to do administration from outside.
For instance, the userspace code, in /sbin/pspaceinit, which
runs as (pspace 2, pid 1), could open a pipe with it's parent
(pspace1, pid 234). pid 234 can then ask the init process to
do things like list processes, kill a process, and maybe even
recursively talk to the init process in pspace 3.
-serge
On Thu, 2006-02-16 at 12:44 -0600, Serge E. Hallyn wrote:
> Now Dave and I were just talking about actually using the
> init process in a pspace to do administration from outside.
> For instance, the userspace code, in /sbin/pspaceinit, which
> runs as (pspace 2, pid 1), could open a pipe with it's parent
> (pspace1, pid 234). pid 234 can then ask the init process to
> do things like list processes, kill a process, and maybe even
> recursively talk to the init process in pspace 3.
This would require a much smarter init, and that a child be nice,
cooperate and pass on what is requested of it if it's nested children
are to be killed. If a child decided to be mean and ignore its parent's
requests, the parent can always just kill the child.
(Read the last sentence, and in case you're wondering, no I don't have
any children in real life)
-- Dave
On Thu, Feb 16, 2006 at 09:41:32AM -0800, Dave Hansen wrote:
> On Thu, 2006-02-16 at 15:30 +0100, Herbert Poetzl wrote:
> > > - Should a process have some sort of global (on the machine identifier)?
> >
> > this is mandatory, as it is required to kill any process
> > from the host (admin) context, without entering the pid
> > space (which would lead to all kind of security issues)
>
> Giving admin processes the ability to enter pid spaces seems like it
> solves an entire class of problems, right?.
really depends on the situation and the setup.
let me give a few examples here (I'll assume
fully blown guests not just pid spaces here)
- guest is running some kind of resource hog,
which already reached the given guest limits
any attempt to enter the guest will fail
because the limits do not permit that
- the guest has been suspended (unscheduled)
because of some fork bomb running inside
and you want to kill off only the bomb
entering the guest would immediately stop
your process, so no way to send signals
- there is a pid killer running inside the
guest, which kills every newly created
process as soon as it is discovered
entering the guest would kill your shell
> Could you explain a bit what kinds of security issues it introduces?
well, it introduces a bunch of issues, not all
directly security related, here some of them:
(I keep them general because most of them can
be worked around by additional checks and flags)
- ptrace from inside the context could hijack
your 'admin' task and use it for all kind
of evil stuff
- entering the guest *spaces might cause some
issues with dynamic libraries
- you process would show up in task lists and
guest security tracers, which might give a
false alarm when they get aware of your kill
task
- the entire guest accounting, regarding tasks
would get messed up by the 'outside' process
In general, I prefer to think of this as working
with nuclear material via an actuator from behind
a 4" lead wall -- you just do not want to go in
to fix things :)
best,
Herbert
> -- Dave
On Thu, 2006-02-16 at 20:12 +0100, Herbert Poetzl wrote:
> On Thu, Feb 16, 2006 at 09:41:32AM -0800, Dave Hansen wrote:
> > Giving admin processes the ability to enter pid spaces seems like it
> > solves an entire class of problems, right?.
>
> really depends on the situation and the setup.
> let me give a few examples here (I'll assume
> fully blown guests not just pid spaces here)
>
> - guest is running some kind of resource hog,
> which already reached the given guest limits
>
> any attempt to enter the guest will fail
> because the limits do not permit that
>
> - the guest has been suspended (unscheduled)
> because of some fork bomb running inside
> and you want to kill off only the bomb
>
> entering the guest would immediately stop
> your process, so no way to send signals
>
> - there is a pid killer running inside the
> guest, which kills every newly created
> process as soon as it is discovered
>
> entering the guest would kill your shell
Brainstorming ... what do you think about having a special init process
inside the child to act as a proxy of sorts? It is controlled by the
parent vserver/container, and would not be subject to resource limits.
It would not necessarily need to fork in order to kill other processes
inside the vserver (not subject to resource limits). It could also
continue when the rest of the guest was suspended.
A pid killer would be ineffective against such a process because you
can't kill init.
> > Could you explain a bit what kinds of security issues it introduces?
>
> well, it introduces a bunch of issues, not all
> directly security related, here some of them:
> (I keep them general because most of them can
> be worked around by additional checks and flags)
>
> - ptrace from inside the context could hijack
> your 'admin' task and use it for all kind
> of evil stuff
You can't ptrace init, right?
> In general, I prefer to think of this as working
> with nuclear material via an actuator from behind
> a 4" lead wall -- you just do not want to go in
> to fix things :)
Where does that lead you? Having a single global pid space which the
admin can see? Or, does a special set of system calls do it well
enough?
-- Dave
Dave Hansen wrote:
> Brainstorming ... what do you think about having a special init process
> inside the child to act as a proxy of sorts? It is controlled by the
> parent vserver/container, and would not be subject to resource limits.
> It would not necessarily need to fork in order to kill other processes
> inside the vserver (not subject to resource limits). It could also
> continue when the rest of the guest was suspended.
> A pid killer would be ineffective against such a process because you
> can't kill init.
Well, another approach would be to create a new context which has
visibility over the other container as well as the ability to send
signals to it.
>>In general, I prefer to think of this as working
>>with nuclear material via an actuator from behind
>>a 4" lead wall -- you just do not want to go in
>>to fix things :)
> Where does that lead you? Having a single global pid space which the
> admin can see? Or, does a special set of system calls do it well
> enough?
I don't like this term "single global pid space". Two containers might
be able to see all processes on the system, one might have a flat
mapping to all PIDs < 64k (or pid_max), one with the XID,PID encoded
bitwise. They are both global pid spaces, but there is no "single" one,
unless that is all you compile in.
Sam.
Serge E. Hallyn wrote:
> Quoting Eric W. Biederman ([email protected]):
>
>>"Serge E. Hallyn" <[email protected]> writes:
>>With respect to pids lets not get caught up in the implementation
>>details. Let's first get clear on what the semantics should be.
>>
>>- Should the first pid in a pid space have pid 1?
that should only be required if you have system containers
or if there are tools or requirements that if I wonderup my process tree
that I ultimately must end up at 1.
>>
>>- Should pid == 1 ignore signals, it doesn't have a handler for?
Yes
>>
>>- Should any children of pid 1 be allowed to live when pid == 1 is killed?
No .. it has init semantics !
>
>
> But doesn't that depend on whether we use (pspace,pid) or vpids? If
> vpids, then init isn't really a problem, since from kernelspace
> processes in a comtainer stil have a global pid and global parent, and
> init knows them. If (pspace,pid), then we need a fakeinit bc the real
> init doesn't know about the processes in the container...
>
>
>>- Should a process have some sort of global (on the machine identifier)?
First establish whether that global ID has to be persistent ...
I don't see why ! In which case the TASK_REF is the perfect global ID.
>
>
> I think to satisfy openvz existing customers this must be a yes. With
> vpid the answer is simple. With (pspace,pid), there are three anwers i've
> heard, namely
>
> 1. just use pspaceid, pid
> 2. make pspaceid small and use (pspaceid << SOMEBITS | pid)
> 3. use pid1/pid2/pid3 where pid1 is creator of pid and its
> pspace, etc...
This implies that pid2 can be looked up in the context of pid1.
In OpenVZ approach that's possible, In pspaces.. isn't that the wpid ?
>
> But the openvz guys also don't want userspace tool changes, making (2)
> the most likely option. Any other ideas?
>
>
>>- Should the pids in a pid space be visible from the outside?
>
>
> Again, the openvz guys say yes.
>
> I think it should be acceptable if a pidspace is visible in all it's
> ancestor pidspaces. I.e. if I create pspace2 and pspace3 from pid 234
> in pspace1, then pspace2 doesn't need to be able to address pspace3
> and vice versa.
That means you need to do a more complicated lookup ! for instance let's say you have hierarchy
pspace1
|--->pspace2
| |--->pspace2a
|--->pspace3
|--->pspace3a
let's assume we use the (pspaceid<<BITS | pid ) global id. To verify I have to
ensure that the target pid can reach the originating pid in its ancestor path.
Not a biggy as these pspace trees probably won't get much deeper then 3 or 4.
>
> Kirill, is that acceptable?
>
>
>>- Should the parent of pid 1 be able to wait for it for it's
>> children?
>
> Yes.
Yes ... VPID does that and wpid in pspace does that as well.
>
>
>>- Is a completely disjoin pid space acceptable to anyone?
>
>
> To anyone? yes :)
>
> TO everyone, I don't think so.
>
hehh... yes they should be disjoined other then at the top
where we want to wait ..
>
>>- What should the parent of pid == 1 see?
>>
>>- Should a process not in the default pid space be able to create
>> another pid space?
>
>
> Yes.
How else do you get hierarchy ....
>
> This is to support using pidspaces for vservers, and creating
> migrateable sub-pidspaces in each vserver.
>
>
>>- Should we be able to monitor a pid space from the outside?
>
>
> To some extent, yes.
>
>
>>- Should we be able to have processes enter a pid space?
>
>
> IMO that is crucial.
Existing ones .. now that is potentially difficult to do. Particular
if you want to enter a pidspace that has already been migrated.
Because ones assigned pid might already been taken in the target pspace.
>
>
>>- Do we need to be able to be able to ptrace/kill individual processes
>> in a pid space, from the outside, and why?
>
>
> I think this is completely unnecessary so long as a process can enter a
> pidspace.
>
>
>>- After migration what identifiers should the tasks have?
>
>
> It must be possible to retain the same pids, at least from inside the
> container.
Absolutely .. otherwise all cashed pids in userspace are meaningless.
>
> So this is irrelevant, as the openvz approach can just virtualize the
> old pid, while (pspace, pid) will be able to create a new container and
> use the old pid values, which are then guaranteed to not be in use.
Exactly .. mute issue, this is "trivial" as long as you can fork with
a particular pid used.
>
>
>>If we can answer these kinds of questions we can likely focus in
>>on what the implementation should look like. So far I have not
>>seen a question that could not be implemented with a (pspace, pid)/pid
>>or a vpid/pid implementation.
>
>
> But you have, haven't you? Namely, how can openvz provide it's
> customers with a global view of all processes without putting 5 years of
> work into a new sysadmin interface?
>
> -serge
>
Dave Hansen <[email protected]> writes:
> On Thu, 2006-02-16 at 12:44 -0600, Serge E. Hallyn wrote:
>> Now Dave and I were just talking about actually using the
>> init process in a pspace to do administration from outside.
>> For instance, the userspace code, in /sbin/pspaceinit, which
>> runs as (pspace 2, pid 1), could open a pipe with it's parent
>> (pspace1, pid 234). pid 234 can then ask the init process to
>> do things like list processes, kill a process, and maybe even
>> recursively talk to the init process in pspace 3.
>
> This would require a much smarter init, and that a child be nice,
> cooperate and pass on what is requested of it if it's nested children
> are to be killed. If a child decided to be mean and ignore its parent's
> requests, the parent can always just kill the child.
As for that. When I mad that suggestion to Herbert Poetzl <[email protected]>
his only concern was that a smart init might be too heavy weight for
lightweight vserver. Generally I like the idea.
> (Read the last sentence, and in case you're wondering, no I don't have
> any children in real life)
Speaking of that. One of my coworkers mentioned that it is unfortunate
that our names don't have the double meaning. So it was suggested we
call them
Speaking of that problematic naming. One of my coworkers mentioned that
it is unfortunate that our set of names does not have a double meaning.
After that the suggestion came up to call them families, instead of guest
or pidspaces. Although I guess calling them guests is about as bad :)
Eric
"Serge E. Hallyn" <[email protected]> writes:
> Quoting Eric W. Biederman ([email protected]):
>> "Serge E. Hallyn" <[email protected]> writes:
>> >> What happens when you migrate pspace 3 into a different pspace
>> >> on a different machine?
>> >
>> > Nothing special. "Migrate" was just a checkpoint (from pspace 1)
>> > and a resume (from pspace N on some machine). So now pspace N on
>> > the new machine has created a new pspace - which happens to be
>> > immediately populated with the contents of the old pspace 3 - and
>> > see the pid of the init process of this new pspace.
>> >
>> >> Is there a sane implementation for this?
>> >
>> > IMO, definately yes.
>> >
>> > But I haven't tried it, so my opinion is just that.
>>
>> If you are just talking the pid of the init process the problem seems
>> tractable.
>>
>> Where I see real problems with migration is and nested pid spaces
>> is when you expose all of your pids to your parent, and perhaps
>> there was some miscommunication on this point.
>>
>> To try and give an example.
>>
>> pspace 1 pspace 2 pspace 3 pspace 4
>> pid 234 -> pid 1
>> pid 235 -> pid 2 -> pid 1
>> pid 236 -> pid 3 -> pid 2 -> pid 1
>>
>> Hopefully this clearly shows what I was trying to avoid, by
>> only allow pid 1 of any pspace to be visible in the parent.
>
> Yes, I saw it more like:
>
>> pspace 1 pspace 2 pspace 3 pspace 4
>> pid 234 -> pid 1
>> pid 2 -> pid 1
>> pid 2 -> pid 1
>> pid 3
I kind of figured. I just wanted it to be clear why I have
a problem with the semantics of the current VPID implementation.
Especially as the tone of the post from the last day or two
was: Can't we satisfy everyone?
The picture you drew is all that is required for my implementation
and it doesn't run into problems because you only need one PID in
your parent pspace.
Eric
On Fri, Feb 17, 2006 at 03:57:26AM -0700, Eric W. Biederman wrote:
> Dave Hansen <[email protected]> writes:
>
> > On Thu, 2006-02-16 at 12:44 -0600, Serge E. Hallyn wrote:
> >> Now Dave and I were just talking about actually using the
> >> init process in a pspace to do administration from outside.
> >> For instance, the userspace code, in /sbin/pspaceinit, which
> >> runs as (pspace 2, pid 1), could open a pipe with it's parent
> >> (pspace1, pid 234). pid 234 can then ask the init process to
> >> do things like list processes, kill a process, and maybe even
> >> recursively talk to the init process in pspace 3.
> >
> > This would require a much smarter init, and that a child be nice,
> > cooperate and pass on what is requested of it if it's nested children
> > are to be killed. If a child decided to be mean and ignore its parent's
> > requests, the parent can always just kill the child.
>
> As for that. When I mad that suggestion to Herbert Poetzl
> his only concern was that a smart init might be too heavy weight
> for lightweight vserver. Generally I like the idea.
well, may I remind that this solution would require _two_
init processes for each guest, which could easily make up
300-400 unnecessary processes in a lightweight server
setup?
> > (Read the last sentence, and in case you're wondering, no I don't have
> > any children in real life)
>
> Speaking of that. One of my coworkers mentioned that it is unfortunate
> that our names don't have the double meaning. So it was suggested we
> call them
>
> Speaking of that problematic naming. One of my coworkers mentioned that
> it is unfortunate that our set of names does not have a double meaning.
> After that the suggestion came up to call them families, instead of guest
> or pidspaces. Although I guess calling them guests is about as bad :)
well, at least Guests or VEs are terms already used by
existing projects, where pspace sounds somewhat strange.
at the same time I'd like to point out that *spaces is
a good name for the building blocks, but we definitely
have to name the 'construct' different, i.e. a 'guest'
(or VPS or VE or whatever) is _more_ than just a p-space
it's the sum of all *-spaces required to make it look
like a real linux system.
best,
Herbert
> Eric
Herbert Poetzl <[email protected]> writes:
> On Fri, Feb 17, 2006 at 03:57:26AM -0700, Eric W. Biederman wrote:
>> As for that. When I mad that suggestion to Herbert Poetzl
>> his only concern was that a smart init might be too heavy weight
>> for lightweight vserver. Generally I like the idea.
>
> well, may I remind that this solution would require _two_
> init processes for each guest, which could easily make up
> 300-400 unnecessary processes in a lightweight server
> setup?
I take it seriously enough that I remembered the concern,
and I think it is legitimate. Figuring out how to safely
set the policy is a challenge. That is something a
user space daemon trivially gets right.
The kernel side of a process is about 10K if the user space
side was also lightweight we could have the entire
per process cost in the 30K range. 30K*400 = 12000K = 12M.
That is significant but we are still cheap enough that it
isn't necessarily a show stopper.
I think the cost was only one extra process, for the case where you
have fakeinit now it would be init, for other cases it would be a
daemon that gets setup when you initialize the vserver.
If we can get a permission checking model in the kernel right
it is potentially much cheaper, to have an enter model.
Having user space as a backup to that is still interesting.
>> > (Read the last sentence, and in case you're wondering, no I don't have
>> > any children in real life)
>>
>> Speaking of that. One of my coworkers mentioned that it is unfortunate
>> that our names don't have the double meaning. So it was suggested we
>> call them
>>
>> Speaking of that problematic naming. One of my coworkers mentioned that
>> it is unfortunate that our set of names does not have a double meaning.
>> After that the suggestion came up to call them families, instead of guest
>> or pidspaces. Although I guess calling them guests is about as bad :)
>
> well, at least Guests or VEs are terms already used by
> existing projects, where pspace sounds somewhat strange.
>
> at the same time I'd like to point out that *spaces is
> a good name for the building blocks, but we definitely
> have to name the 'construct' different, i.e. a 'guest'
> (or VPS or VE or whatever) is _more_ than just a p-space
> it's the sum of all *-spaces required to make it look
> like a real linux system.
I totally agree. Sorry. This was meant as a humerous tangent!
I thought the smiley and the fact I was looking for a name
with a double meaning that would have made it easier to get
confused would have made that clear!
Oh well such is confusion an email :)
Eric
On Fri, Feb 17, 2006 at 05:16:06AM -0700, Eric W. Biederman wrote:
> Herbert Poetzl <[email protected]> writes:
>
> > On Fri, Feb 17, 2006 at 03:57:26AM -0700, Eric W. Biederman wrote:
> >> As for that. When I mad that suggestion to Herbert Poetzl
> >> his only concern was that a smart init might be too heavy weight
> >> for lightweight vserver. Generally I like the idea.
> >
> > well, may I remind that this solution would require _two_
> > init processes for each guest, which could easily make up
> > 300-400 unnecessary processes in a lightweight server
> > setup?
>
> I take it seriously enough that I remembered the concern,
> and I think it is legitimate. Figuring out how to safely
> set the policy is a challenge. That is something a
> user space daemon trivially gets right.
>
> The kernel side of a process is about 10K if the user space
> side was also lightweight we could have the entire
> per process cost in the 30K range. 30K*400 = 12000K = 12M.
that's something I'm not so worried about, but a statically
compiled userspace process with 20K sounds unusual in the
time of 2M *libcs :)
> That is significant but we are still cheap enough that it
> isn't necessarily a show stopper.
>
> I think the cost was only one extra process, for the case where you
> have fakeinit now it would be init, for other cases it would be a
> daemon that gets setup when you initialize the vserver.
well, depends, currently we do not need a parent to handle
the guest, so there is _no_ waiting process in the light-
weight case either, which makes that two processes for each
guest, no?
anyway, I'm not strictly against having an init process
inside a guest, as long as it is not an essential part
of the overall design, because that would make it much
harder to rip it out later :)
best,
Herbert
> If we can get a permission checking model in the kernel right
> it is potentially much cheaper, to have an enter model.
>
> Having user space as a backup to that is still interesting.
>
> >> > (Read the last sentence, and in case you're wondering, no I don't have
> >> > any children in real life)
> >>
> >> Speaking of that. One of my coworkers mentioned that it is unfortunate
> >> that our names don't have the double meaning. So it was suggested we
> >> call them
> >>
> >> Speaking of that problematic naming. One of my coworkers mentioned that
> >> it is unfortunate that our set of names does not have a double meaning.
> >> After that the suggestion came up to call them families, instead of guest
> >> or pidspaces. Although I guess calling them guests is about as bad :)
> >
> > well, at least Guests or VEs are terms already used by
> > existing projects, where pspace sounds somewhat strange.
> >
> > at the same time I'd like to point out that *spaces is
> > a good name for the building blocks, but we definitely
> > have to name the 'construct' different, i.e. a 'guest'
> > (or VPS or VE or whatever) is _more_ than just a p-space
> > it's the sum of all *-spaces required to make it look
> > like a real linux system.
>
> I totally agree. Sorry. This was meant as a humerous tangent!
> I thought the smiley and the fact I was looking for a name
> with a double meaning that would have made it easier to get
> confused would have made that clear!
>
> Oh well such is confusion an email :)
>
> Eric
Herbert Poetzl <[email protected]> writes:
> that's something I'm not so worried about, but a statically
> compiled userspace process with 20K sounds unusual in the
> time of 2M *libcs :)
For unshared data, a stack and page tables is should be achievable
even with glibc but I haven't sat down and done the math. But
last I looked if you took out the debug symbols glibc was only 1M.
If glibc can't do it one of the lightweight embedded c libraries
certainly should be able to.
>> That is significant but we are still cheap enough that it
>> isn't necessarily a show stopper.
>>
>> I think the cost was only one extra process, for the case where you
>> have fakeinit now it would be init, for other cases it would be a
>> daemon that gets setup when you initialize the vserver.
>
> well, depends, currently we do not need a parent to handle
> the guest, so there is _no_ waiting process in the light-
> weight case either, which makes that two processes for each
> guest, no?
As I put it together you don't need a parent. The parent can wait for it
or exit the child doesn't care. Usual unix semantics here :)
If you are using a pipe to communicate to the outside world that
has to be put someplace, but you can always create a fifo in
the filesystem. You could have a single parent for all of
the lightweight guests.
Lots of choices on how to put the pieces together.
> anyway, I'm not strictly against having an init process
> inside a guest, as long as it is not an essential part
> of the overall design, because that would make it much
> harder to rip it out later :)
Except for being a child reaper no. Having a process that
is the child reaper is interesting mainly because it allows
you to get an accurate struct rusage picture of all of the
children in pspace.
Eric
Herbert Poetzl wrote:
> On Fri, Feb 17, 2006 at 05:16:06AM -0700, Eric W. Biederman wrote:
>
>>Herbert Poetzl <[email protected]> writes:
>>
>>
>>>On Fri, Feb 17, 2006 at 03:57:26AM -0700, Eric W. Biederman wrote:
>>>
>>>>As for that. When I mad that suggestion to Herbert Poetzl
>>>>his only concern was that a smart init might be too heavy weight
>>>>for lightweight vserver. Generally I like the idea.
>>>
>>>well, may I remind that this solution would require _two_
>>>init processes for each guest, which could easily make up
>>>300-400 unnecessary processes in a lightweight server
>>>setup?
>>
>>I take it seriously enough that I remembered the concern,
>>and I think it is legitimate. Figuring out how to safely
>>set the policy is a challenge. That is something a
>>user space daemon trivially gets right.
>>
>>The kernel side of a process is about 10K if the user space
>>side was also lightweight we could have the entire
>>per process cost in the 30K range. 30K*400 = 12000K = 12M.
>
>
> that's something I'm not so worried about, but a statically
> compiled userspace process with 20K sounds unusual in the
> time of 2M *libcs :)
>
>
>>That is significant but we are still cheap enough that it
>>isn't necessarily a show stopper.
>>
>>I think the cost was only one extra process, for the case where you
>>have fakeinit now it would be init, for other cases it would be a
>>daemon that gets setup when you initialize the vserver.
>
Eric, Herbert.. why do we need an extra process in each and every
pspace.
Why not have single global pspace-init daemon that acts as the reaper
for all pspace-top processes.
Its only at the boundaries of pspaces and with signals were we
seem to have trouble.
The "pspace-init" reaps the signal of all its sub-pspace's top processes
and then "forwards" the signal to processes actually waiting.
Kind of an interposer.
Same way from the other side.
You allocate a pid on behalf of the process you spawn in your pidspace.
You mark in the pid hash of the lookup that this is merely a proxy
and you forward that to the pspace-init where you have a separate lookup
with <pspace-caller,pspace,pid>.
Same with signals, once the signal is reaped by pspace-init and its looked
up who is the parent pspace and the pid in there, we forward it..
Is something like that workable, idiotic (be kind), too intrusive ?
-- Hubertus
>
> well, depends, currently we do not need a parent to handle
> the guest, so there is _no_ waiting process in the light-
> weight case either, which makes that two processes for each
> guest, no?
>
> anyway, I'm not strictly against having an init process
> inside a guest, as long as it is not an essential part
> of the overall design, because that would make it much
> harder to rip it out later :)
>
> best,
> Herbert
>
>
Quoting Hubertus Franke ([email protected]):
> >>- Should a process have some sort of global (on the machine identifier)?
>
> First establish whether that global ID has to be persistent ...
> I don't see why ! In which case the TASK_REF is the perfect global ID.
The task_refs were only intended to be used in the kernel, iiuc.
...
> >>- Should we be able to have processes enter a pid space?
> >
> >
> >IMO that is crucial.
>
> Existing ones .. now that is potentially difficult to do. Particular
> if you want to enter a pidspace that has already been migrated.
> Because ones assigned pid might already been taken in the target pspace.
Well the answer changes depending on whether the question asks
about pid spaces, or full containers. For full containers, you have
problems with unsharing various pieces of namespace. But for the
pidspace, a simple function with clone() semantics makes perfect sense.
So your code does:
child_pid = pidspace_clone(pspace_id);
if (child_pid == 0) {
/* we are inside pspace 'pspace_id' with a random
pid which is unique in that pspace */
} else {
/* the child_pid is known by some other pid in it's
own pspace, but in our pspace it's known and hashed
by 'child_pid'. child_pid and that other pid are
likely different.
*/
}
So long as we're just talking about the pidspaces, I don't see any
unexpected side effects here.
-serge
On Fri, Feb 17, 2006 at 08:39:51AM -0500, Hubertus Franke wrote:
> Herbert Poetzl wrote:
> >On Fri, Feb 17, 2006 at 05:16:06AM -0700, Eric W. Biederman wrote:
> >
> >>Herbert Poetzl <[email protected]> writes:
> >>
> >>
> >>>On Fri, Feb 17, 2006 at 03:57:26AM -0700, Eric W. Biederman wrote:
> >>>
> >>>>As for that. When I mad that suggestion to Herbert Poetzl
> >>>>his only concern was that a smart init might be too heavy weight
> >>>>for lightweight vserver. Generally I like the idea.
> >>>
> >>>well, may I remind that this solution would require _two_
> >>>init processes for each guest, which could easily make up
> >>>300-400 unnecessary processes in a lightweight server
> >>>setup?
> >>
> >>I take it seriously enough that I remembered the concern,
> >>and I think it is legitimate. Figuring out how to safely
> >>set the policy is a challenge. That is something a
> >>user space daemon trivially gets right.
> >>
> >>The kernel side of a process is about 10K if the user space
> >>side was also lightweight we could have the entire
> >>per process cost in the 30K range. 30K*400 = 12000K = 12M.
> >
> >
> >that's something I'm not so worried about, but a statically
> >compiled userspace process with 20K sounds unusual in the
> >time of 2M *libcs :)
> >
> >
> >>That is significant but we are still cheap enough that it
> >>isn't necessarily a show stopper.
> >>
> >>I think the cost was only one extra process, for the case where you
> >>have fakeinit now it would be init, for other cases it would be a
> >>daemon that gets setup when you initialize the vserver.
> >
>
> Eric, Herbert.. why do we need an extra process in each and every
> pspace.
>
> Why not have single global pspace-init daemon that acts as the reaper
> for all pspace-top processes. Its only at the boundaries of pspaces
> and with signals were we seem to have trouble.
that would probably work, but I think it adds some
complications and might require certain design changes
just to give some ideas:
- how to reach the guest space if there is no 'handle'?
- how to handle hierarchical contexts?
> The "pspace-init" reaps the signal of all its sub-pspace's top
> processes and then "forwards" the signal to processes actually
> waiting. Kind of an interposer. Same way from the other side.
>
> You allocate a pid on behalf of the process you spawn in your
> pidspace. You mark in the pid hash of the lookup that this is merely
> a proxy and you forward that to the pspace-init where you have a
> separate lookup with <pspace-caller,pspace,pid>.
>
> Same with signals, once the signal is reaped by pspace-init and its looked
> up who is the parent pspace and the pid in there, we forward it..
yup, could work ...
best,
Herbert
> Is something like that workable, idiotic (be kind), too intrusive ?
>
> -- Hubertus
>
>
> >
> >well, depends, currently we do not need a parent to handle
> >the guest, so there is _no_ waiting process in the light-
> >weight case either, which makes that two processes for each
> >guest, no?
> >
> >anyway, I'm not strictly against having an init process
> >inside a guest, as long as it is not an essential part
> >of the overall design, because that would make it much
> >harder to rip it out later :)
> >
> >best,
> >Herbert
> >
> >
Hello,
> With respect to pids lets not get caught up in the implementation
> details. Let's first get clear on what the semantics should be.
>
> - Should the first pid in a pid space have pid 1?
yup.
> - Should pid == 1 ignore signals, it doesn't have a handler for?
yup.
> - Should any children of pid 1 be allowed to live when pid == 1 is killed?
nope. you have this problem in your code, when child_reaper references
freed task.
> - Should a process have some sort of global (on the machine identifier)?
yep. otherwise it is imposible to manage (ptrace, kill, ...) it, without
introducing new syscalls.
> - Should the pids in a pid space be visible from the outside?
This can be done tunable, but this is VERY highly desired.
This also makes introducing many new syscalls unneeded.
> - Should the parent of pid 1 be able to wait for it for it's
> children?
What for? This doesn't guarantee any completion of the container destroy.
> - Is a completely disjoin pid space acceptable to anyone?
no, not acceptable for us :)
> - What should the parent of pid == 1 see?
if pidspaces are fully isolated it should see nothing (otherwise, it is
still weak isolation, as host admin will be able to get access to the
container).
if pidspaces are weak isolated it should see the whole process tree.
> - Should a process not in the default pid space be able to create
> another pid space?
optional. I really can hardly see it's usecases, if any...
Yes, I remember some talks about checkpointing of group of processes,
but this doesn't help it, believe me (ask Kuznetsov :) )...
> - Should we be able to monitor a pid space from the outside?
Yes. We strongly beleive we need it.
> - Should we be able to have processes enter a pid space?
Yes. The same.
> - Do we need to be able to be able to ptrace/kill individual processes
> in a pid space, from the outside, and why?
Yes. This is very helpful management feature. Otherwise you won't be
able to resolve issues with containers. Why it stuck? For example, after
checkpoint/restore how do plan to debug it?
> - After migration what identifiers should the tasks have?
pids? in pspace it should the same pids which were assigned to them on
fork(). in host they can have any other pid allocated.
> If we can answer these kinds of questions we can likely focus in
> on what the implementation should look like. So far I have not
> seen a question that could not be implemented with a (pspace, pid)/pid
> or a vpid/pid implementation.
It seems to me that we still talk too much about PID spaces, while this
is not the most problematic thing. This can be out of tree for some time
if required.
> I think it is safe to say that from the inside things should look to
> processes just as they do now. Which answers a lot of those
> questions. But it still leaves a lot open.
Kirill
>> However, if we're going to get anywhere, the first decision which we
>> need to make is whether to go with a (container,pid), (pspace,pid) or
>> equivalent pair like approach, or a virtualized pid approach. Linus had
>> previously said that he prefers the former. Since there has been much
>> discussion since then, I thought I'd try to recap the pros and cons of
>> each approach, with the hope that the head Penguins will chime in one
>> more time, after which we can hopefully focus our efforts.
I think the first thing we have to do, is not to decide which pids we
want to see, but what and how we want to virtualize.
> I am thinking that you can have both. Not in the sense of
> overcomplicating, but in the sense of having your cake and eating it
> too.
BTW, really, why not both?
Eric, can we do something universal which will suite all the parties?
> The only thing which is a unique, system wide identifier for the process
> is the &task_struct. So we are already virtualising this pointer into a
> PID for userland. The only difference is that we cache it (nay, keep
> the authorative version of it) in the task_struct.
>
> The (XID, PID) approach internally is also fine. This says that there
> is a container XID, and within it, the PID refers to a particular
> task_struct. A given task_struct will likely exist in more than one
> place in the (XID, PID) space. Perhaps the values of PID for XID = 0
> and XID = task.xid can be cached in the task_struct, but that is a
> detail.
>
> Depending on the flags on the XID, we can incorporate all the approaches
> being tabled. You want virtualised pids? Well, that'll hurt a little,
> but suit yourself - set a flag on your container and inside the
> container you get virtualised PIDs. You want a flat view for all your
> vservers? Fine, just use an XID without the virtualisation flag and
> with the "all seeing eye" property set. Or you use an XID _with_ the
> virtualisation flag set, and then call a tuple-endowed API to find the
> information you're after.
This sounds good. But pspaces are also used for access controls. So this
should be incorparated there as well.
Kirill
>>- Should the pids in a pid space be visible from the outside?
>
> Again, the openvz guys say yes.
>
> I think it should be acceptable if a pidspace is visible in all it's
> ancestor pidspaces. I.e. if I create pspace2 and pspace3 from pid 234
> in pspace1, then pspace2 doesn't need to be able to address pspace3
> and vice versa.
>
> Kirill, is that acceptable?
yes, acceptable.
once, again, believe me, this is very required feature for
troubleshouting and management (as Eric likes to take about maintanance :) )
>>- Should the parent of pid 1 be able to wait for it for it's
>> children?
> Yes.
why? any reason?
>>- Should a process not in the default pid space be able to create
>> another pid space?
>
>
> Yes.
>
> This is to support using pidspaces for vservers, and creating
> migrateable sub-pidspaces in each vserver.
this doesn't help to create migratable sub-pidspaces.
for example, will you share IPCs in your pid parent and child pspaces?
if yes, then it won't be migratable;
if no, then you need to create fully isolated spaces to the end and
again you end up with a question, why nested pspaces are required at all?
>>- Should we be able to monitor a pid space from the outside?
> To some extent, yes.
SURE! :)
>>- Should we be able to have processes enter a pid space?
> IMO that is crucial.
required.
>>- Do we need to be able to be able to ptrace/kill individual processes
>> in a pid space, from the outside, and why?
> I think this is completely unnecessary so long as a process can enter a
> pidspace.
No. This is required.
Because, container can be limited with some resource limitations. You
may be unable to enter inside. For example, if container forked() many
threads up to its limit, you won't be able to enter it.
>>- After migration what identifiers should the tasks have?
> So this is irrelevant, as the openvz approach can just virtualize the
> old pid, while (pspace, pid) will be able to create a new container and
> use the old pid values, which are then guaranteed to not be in use.
agreed. irrelevant.
>>If we can answer these kinds of questions we can likely focus in
>>on what the implementation should look like. So far I have not
>>seen a question that could not be implemented with a (pspace, pid)/pid
>>or a vpid/pid implementation.
> But you have, haven't you? Namely, how can openvz provide it's
> customers with a global view of all processes without putting 5 years of
> work into a new sysadmin interface?
it is not only about OpenVz. This is about manageability.
This is the feature our users like _very_ much, when administrator can
fix the problems. Have you ever tried to fix broken VM in VMWare/Xen?
On the other hand, VPID approach can fully isolate containers if needed
for security reasons.
Kirill
Hello,
>>- Should any children of pid 1 be allowed to live
>> when pid == 1 is killed?
> agan that's a feature which would be nice, especially
> for the lightweight contexts, which do not have an init
> process running inside the guest
whom should child_reaper refer to?
>>- Should a process have some sort of global (on the machine identifier)?
> this is mandatory, as it is required to kill any process
> from the host (admin) context, without entering the pid
> space (which would lead to all kind of security issues)
fine, agreed on this finally, same for OpenVZ.
>>- Should the pids in a pid space be visible from the outside?
> yes, while not strictly required, folks really like to
> view the overall system state. this can be done with the
> help of special tools, but again it should be done
> without entering each guest pid space ...
also fine.
>>- Should the parent of pid 1 be able to wait for it for it's
>> children?
> definitely, we (Linux-VServer) added this some time ago
> and it helps to maintain/restart a guest.
but why sys_waitpid? we can make it in many other ways,
can't we? moreover, sys_waitpid() is the most unnatural from my point of
view, since container is not fully dead when the last process dies, it
makes some cleanup postponed.
And we had issues in OpenVZ, that very fast VPS stop/start can fail due
to not freed resources yet.
>>- Is a completely disjoin pid space acceptable to anyone?
> yes, as long as the beforementioned access, management
> and control mechanisms are in place ...
then it is not disjoin? :)
>>- What should the parent of pid == 1 see?
> doesn't really matter, but I see three options there:
>
> - the parent space
> - the child space
> - both
but should parent see pspace init? only one task from pspace?
>>- Should we be able to monitor a pid space from the outside?
> yes, definitely, but it could happen via some special
> interfaces, i.e. no need to make it compatible
disagree. why we need to introduce copy of existing syscalls?
do you want to fix all the existing apps? ps, top, kill etc.? How about
third party apps?
>>- Should we be able to have processes enter a pid space?
> definitely, without that, the entire VPS concept will
> not work, folks use the 'admin' backdoors 90% of the
> time ...
agreed. Though I don't like a backdoor name :) It is just a way to get
access to VPS.
Kirill
>>this is mandatory, as it is required to kill any process
>>from the host (admin) context, without entering the pid
>>space (which would lead to all kind of security issues)
> Just to be clear: you think there should be cases where pspace x can
> kill processes in pspace y, but can't enter it?
YES! When you have resource management such a situation is quite common.
As I wrote in antoher email example:
VPS has reached it's process limit and you can't enter it.
If you suggest to make enter without resource limitations, then it will
be a security hole.
>>>- Should we be able to monitor a pid space from the outside?
>>
>>yes, definitely, but it could happen via some special
>>interfaces, i.e. no need to make it compatible
>
>
> What sort of interfaces do you envision for these two? If we
> can lay them out well enough, maybe the result will satisfy the
> openvz folks?
>
> For instance, perhaps we just use a proc interface, where in the
> current pspace, if we've created a new pspace which in our pspace
> is known as process 567, then we might see
>
> /proc
> /proc/567
> /proc/567/pspace
> /proc/567/pspace/1 -> link to /proc/567
> /proc/567/pspace/2
>
> Now we also might be able to interact with the pspace by doing
> something like
>
> echo -9 > /proc/567/pspace/2/kill
>
> and of course do things like
>
> cd /proc/567/pspace/1/root
uff... we can start calling ptrace etc. via proc then :)
UGLY! I think Linus will ban us for such ideas. And he will be right.
Kirill
>>This is to support using pidspaces for vservers, and creating
>>migrateable sub-pidspaces in each vserver.
>
>
> Agreed.
>
> Now this case is very interesting, because supporting it creates
> interesting restrictions on the rest of the problem, and
> unless I miss something this is where the OpenVZ implementation
> currently falls down.
why do you think so? VPIDs approach supports nested pspaces easily.
Moreover it can be used in any configuration. See below.
> Which names does the intermediate pidspace (vserver) see the child
> pidspace with
options:
- all pspaces except for host system can live fully with virtual pids
- you can restrict what parent pspace can see from it's child. and as in
your case you can see only "init".
- you can make fully isolated pspaces, where these problems doesn't
arise at all.
> Which names does the initial pidspace see the child pid space with?
initial pidspace always sees "global" pids.
>>>- Do we need to be able to be able to ptrace/kill individual processes
>>> in a pid space, from the outside, and why?
>>
>>I think this is completely unnecessary so long as a process can enter a
>>pidspace.
See my other emails. This is required.
1. Enter doesn't always work. e.g. due to resource limitations.
2. you may don't want to install some apps inside, especiall taking into
account that libs in VPS can be broken.
>>But you have, haven't you? Namely, how can openvz provide it's
>>customers with a global view of all processes without putting 5 years of
>>work into a new sysadmin interface?
> Well I think we can reuse most of the old sysadmin interfaces yes.
Doesn't look so.
Kirill
>>this is mandatory, as it is required to kill any process
>>from the host (admin) context, without entering the pid
>>space (which would lead to all kind of security issues)
>
>
> Giving admin processes the ability to enter pid spaces seems like it
> solves an entire class of problems, right?. Could you explain a bit
> what kinds of security issues it introduces?
Enter is not always possible.
For example when you have exhausted your resources in VPS.
(e.g. hit process limit inside).
And you can't make enter without resource limitations, since it will be
a security hole then.
Kirill
On Mon, Feb 20, 2006 at 12:37:25PM +0300, Kirill Korotaev wrote:
> >>- Should the pids in a pid space be visible from the outside?
> >
> >Again, the openvz guys say yes.
> >
> >I think it should be acceptable if a pidspace is visible in all it's
> >ancestor pidspaces. I.e. if I create pspace2 and pspace3 from pid 234
> >in pspace1, then pspace2 doesn't need to be able to address pspace3
> >and vice versa.
> >
> >Kirill, is that acceptable?
> yes, acceptable.
> once, again, believe me, this is very required feature for
> troubleshouting and management (as Eric likes to take about
> maintanance :) )
IMHO there are certain things which _are_ required
and others which are nice to have but not strictly
required, just think "ptrace across pid spaces"
> >>- Should the parent of pid 1 be able to wait for it for it's
> >> children?
> >Yes.
> why? any reason?
>
> >>- Should a process not in the default pid space be able to create
> >> another pid space?
> >
> >Yes.
> >
> >This is to support using pidspaces for vservers, and creating
> >migrateable sub-pidspaces in each vserver.
> this doesn't help to create migratable sub-pidspaces.
> for example, will you share IPCs in your pid parent and child pspaces?
> if yes, then it won't be migratable;
well, not the child pspace, but the parent, no?
> if no, then you need to create fully isolated spaces to the end and
> again you end up with a question, why nested pspaces are required at
> all?
because we are not trying to implement a VPS only
solution for mainline, we are trying to provide
building blocks for many different uses, including
the VPS approach ...
> >>- Should we be able to monitor a pid space from the outside?
> >To some extent, yes.
> SURE! :)
>
> >>- Should we be able to have processes enter a pid space?
> >IMO that is crucial.
> required.
>
> >>- Do we need to be able to be able to ptrace/kill individual processes
> >> in a pid space, from the outside, and why?
> >I think this is completely unnecessary so long as a process can enter a
> >pidspace.
> No. This is required.
ptrace across pid spaces is not required, it is
nice to have and probably adds a bunch of security
issues ...
> Because, container can be limited with some resource limitations. You
> may be unable to enter inside. For example, if container forked() many
> threads up to its limit, you won't be able to enter it.
>
> >>- After migration what identifiers should the tasks have?
> >So this is irrelevant, as the openvz approach can just virtualize the
> >old pid, while (pspace, pid) will be able to create a new container and
> >use the old pid values, which are then guaranteed to not be in use.
> agreed. irrelevant.
>
> >>If we can answer these kinds of questions we can likely focus in
> >>on what the implementation should look like. So far I have not
> >>seen a question that could not be implemented with a (pspace, pid)/pid
> >>or a vpid/pid implementation.
> >But you have, haven't you? Namely, how can openvz provide it's
> >customers with a global view of all processes without putting 5 years of
> >work into a new sysadmin interface?
> it is not only about OpenVz. This is about manageability.
management tools should have a way to get the
required information, they do not necessarily need
to utilize existing interfaces ...
> This is the feature our users like _very_ much, when administrator can
> fix the problems. Have you ever tried to fix broken VM in VMWare/Xen?
> On the other hand, VPID approach can fully isolate containers if needed
> for security reasons.
best,
Herbert
> Kirill
On Mon, Feb 20, 2006 at 12:50:02PM +0300, Kirill Korotaev wrote:
> Hello,
>
> >>- Should any children of pid 1 be allowed to live
> >> when pid == 1 is killed?
>
> >agan that's a feature which would be nice, especially
> >for the lightweight contexts, which do not have an init
> >process running inside the guest
> whom should child_reaper refer to?
as Eric already pointed out, those tasks could
either be self reaping or be reaped by a single
kernel init/reaper
> >>- Should a process have some sort of global (on the machine
> >>identifier)?
> >this is mandatory, as it is required to kill any process
> >from the host (admin) context, without entering the pid
> >space (which would lead to all kind of security issues)
> fine, agreed on this finally, same for OpenVZ.
hey we have soemthing :)
> >>- Should the pids in a pid space be visible from the outside?
> >yes, while not strictly required, folks really like to
> >view the overall system state. this can be done with the
> >help of special tools, but again it should be done
> >without entering each guest pid space ...
> also fine.
>
> >>- Should the parent of pid 1 be able to wait for it for it's
> >> children?
> >definitely, we (Linux-VServer) added this some time ago
> >and it helps to maintain/restart a guest.
> but why sys_waitpid? we can make it in many other ways,
yes, we currently have a syscall switch command
to wait for the guest, but, of course, it is
very similar to the 'normal' unix waitpid()
> can't we? moreover, sys_waitpid() is the most unnatural from my point
> of view, since container is not fully dead when the last process dies,
> it makes some cleanup postponed.
that is correct, the interesting event is not
the disposal of the pid space (i.e. not the last
reference to it) it is the exit of the last task
inside the pid space ...
> And we had issues in OpenVZ, that very fast VPS stop/start can fail due
> to not freed resources yet.
this is a design problem, if your design allows
to have _more_ than one pid space with the same
identifier/properties, but with only one active
and thus reachable space, it is no problem to
create a new one right after the old one did send
the event (which doesn't mean that it was destroyed
just that the last process left the space)
> >>- Is a completely disjoin pid space acceptable to anyone?
> >yes, as long as the beforementioned access, management
> >and control mechanisms are in place ...
> then it is not disjoin? :)
well, really depends, disjoin is something which
refers to the existing interfaces, no? otherwise
disjoin spaces are not accepted by any party as
we all agreed that management and backdoors seem
essential ...
> >>- What should the parent of pid == 1 see?
> >doesn't really matter, but I see three options there:
> >
> > - the parent space
> > - the child space
> > - both
> but should parent see pspace init? only one task from pspace?
as I'd like to have the option to get completely
rid of this parent, I do not really care what it
sees or not :)
> >>- Should we be able to monitor a pid space from the outside?
> >yes, definitely, but it could happen via some special
> >interfaces, i.e. no need to make it compatible
> disagree. why we need to introduce copy of existing syscalls?
> do you want to fix all the existing apps? ps, top, kill etc.?
well, they should be pid space aware anyway, so
they will need some change, and the older apps'll
only see what they saw before (and will not get
confused)
> How about third party apps?
I don't think we care about third party apps when
adding new kernel functionality, especially not
proprietary ones which cannot be modified easily
> >>- Should we be able to have processes enter a pid space?
> >definitely, without that, the entire VPS concept will
> >not work, folks use the 'admin' backdoors 90% of the
> >time ...
> agreed. Though I don't like a backdoor name :)
> It is just a way to get access to VPS.
well, it is often a way to get access to the VPS
without the 'owner' of that VPS even knowing, so
IMHO it's a backdoor, access would be via sshd or
console :)
best,
Herbert
> Kirill
>>yes, acceptable.
>>once, again, believe me, this is very required feature for
>>troubleshouting and management (as Eric likes to take about
>>maintanance :) )
> IMHO there are certain things which _are_ required
> and others which are nice to have but not strictly
> required, just think "ptrace across pid spaces"
these "nice to have" features often make one solution more usable than
another.
>>>This is to support using pidspaces for vservers, and creating
>>>migrateable sub-pidspaces in each vserver.
>>
>>this doesn't help to create migratable sub-pidspaces.
>>for example, will you share IPCs in your pid parent and child pspaces?
>>if yes, then it won't be migratable;
> well, not the child pspace, but the parent, no?
if IPC objects are shared between them, then they can only be migrated
together.
>>if no, then you need to create fully isolated spaces to the end and
>>again you end up with a question, why nested pspaces are required at
>>all?
> because we are not trying to implement a VPS only
> solution for mainline, we are trying to provide
> building blocks for many different uses, including
> the VPS approach ...
nice! do you think I'm against building blocks?
no :) I'm just trying to get out from you how this can be used in real
life and how will it work.
Kirill
>>fine, agreed on this finally, same for OpenVZ.
> hey we have soemthing :)
:)
>>>definitely, we (Linux-VServer) added this some time ago
>>>and it helps to maintain/restart a guest.
>>but why sys_waitpid? we can make it in many other ways,
>
> yes, we currently have a syscall switch command
> to wait for the guest, but, of course, it is
> very similar to the 'normal' unix waitpid()
this is more logically clean to me, since containers/namespaces are not
tasks.
If someone wants to use more unix-like semantics, he can obtain fd for
namespace and call select/poll on it :))))
>>And we had issues in OpenVZ, that very fast VPS stop/start can fail due
>>to not freed resources yet.
> this is a design problem, if your design allows
> to have _more_ than one pid space with the same
> identifier/properties, but with only one active
> and thus reachable space, it is no problem to
> create a new one right after the old one did send
> the event (which doesn't mean that it was destroyed
> just that the last process left the space)
see my another email about sockets.
>>How about third party apps?
> I don't think we care about third party apps when
> adding new kernel functionality, especially not
> proprietary ones which cannot be modified easily
Even if we don't take into account proprietary apps, there too many
opensource control panels, management tools etc.
So this doesn't look good to me anyhow.
>>agreed. Though I don't like a backdoor name :)
>>It is just a way to get access to VPS.
> well, it is often a way to get access to the VPS
> without the 'owner' of that VPS even knowing, so
> IMHO it's a backdoor, access would be via sshd or
> console :)
When you have a physical box there are many ways to get access to it
without knowing passwords etc. This is the same.
Kirill
On Mon, Feb 20, 2006 at 05:34:33PM +0300, Kirill Korotaev wrote:
>>> yes, acceptable.
>>> once, again, believe me, this is very required feature for
>>> troubleshouting and management (as Eric likes to take about
>>> maintanance :) )
>
>> IMHO there are certain things which _are_ required
>> and others which are nice to have but not strictly
>> required, just think "ptrace across pid spaces"
> these "nice to have" features often make one solution more usable than
> another.
agreed, but do you really have to strace the OpenVZ
tools so often, as I don't see any other purpose
for cross space strace. usually you strace either
inside or outside the space, as the transition is
not that well defined anyway ...
>>>> This is to support using pidspaces for vservers, and creating
>>>> migrateable sub-pidspaces in each vserver.
>>> this doesn't help to create migratable sub-pidspaces.
>>> for example, will you share IPCs in your pid parent and child pspaces?
>>> if yes, then it won't be migratable;
>> well, not the child pspace, but the parent, no?
> if IPC objects are shared between them, then they can only be migrated
> together.
not all spaces will be done fur the purpose of
migration, some of them might have other purposes
>>> if no, then you need to create fully isolated spaces to the end and
>>> again you end up with a question, why nested pspaces are required at
>>> all?
>> because we are not trying to implement a VPS only
>> solution for mainline, we are trying to provide
>> building blocks for many different uses, including
>> the VPS approach ...
> nice! do you think I'm against building blocks? no :)
> I'm just trying to get out from you how this can be used in real
> life and how will it work.
ah, I'm glad to share my experience here ...
the linux-vserver.org wiki shows some of those
real world usage scenarios, which include, but are
not limited to:
- Administrative Separation
- Service Separation
- Enhancing Security (better chroot)
- Easy Maintenance (hardware independance)
- Testing
most application scenarios do not require a complete
VPS or Guest to make them usable, often only a single
* space would suffice to accomplish the goal ...
best,
Herbert
> Kirill
On Mon, Feb 20, 2006 at 05:44:47PM +0300, Kirill Korotaev wrote:
>> >fine, agreed on this finally, same for OpenVZ.
>> hey we have soemthing :)
> :)
>>>> definitely, we (Linux-VServer) added this some time ago
>>>> and it helps to maintain/restart a guest.
>>> but why sys_waitpid? we can make it in many other ways,
>> yes, we currently have a syscall switch command
>> to wait for the guest, but, of course, it is
>> very similar to the 'normal' unix waitpid()
> this is more logically clean to me, since containers/namespaces are
> not tasks. If someone wants to use more unix-like semantics, he can
> obtain fd for namespace and call select/poll on it :))))
well, I'm neither for nor against a separate syscall
here, don't get me wrong, but it will take ages to
get the 'new' syscalls added to all archs, and the
arch maintainers will probably have a dozent reasons
why this particular syscall is completely wrong :)
>>> And we had issues in OpenVZ, that very fast VPS stop/start can fail
>>> due to not freed resources yet.
>> this is a design problem, if your design allows
>> to have _more_ than one pid space with the same
>> identifier/properties, but with only one active
>> and thus reachable space, it is no problem to
>> create a new one right after the old one did send
>> the event (which doesn't mean that it was destroyed
>> just that the last process left the space)
> see my another email about sockets.
which one?
[ some context lost here ]
>>> How about third party apps?
>> I don't think we care about third party apps when
>> adding new kernel functionality, especially not
>> proprietary ones which cannot be modified easily
> Even if we don't take into account proprietary apps, there too many
> opensource control panels, management tools etc. So this doesn't look
> good to me anyhow.
well, they would see exactly the same as before, not
more and not less, new features will require new tools
and/or adaptations to the old ones. period.
>>> agreed. Though I don't like a backdoor name :)
>>> It is just a way to get access to VPS.
>> well, it is often a way to get access to the VPS
>> without the 'owner' of that VPS even knowing, so
>> IMHO it's a backdoor, access would be via sshd or
>> console :)
> When you have a physical box there are many ways to get access to it
> without knowing passwords etc. This is the same.
does that change what it is, a backdoor circumventing
established security? I don't think so ...
best,
Herbert
> Kirill
On Mon, 2006-02-20 at 12:13 +0300, Kirill Korotaev wrote:
> > - Should a process have some sort of global (on the machine identifier)?
> yep. otherwise it is imposible to manage (ptrace, kill, ...) it, without
> introducing new syscalls.
Why is introducing syscalls so bad? Does anybody have a list of exactly
how many we would need if we added some kind of container argument?
-- Dave
On Mon, 2006-02-20 at 12:54 +0300, Kirill Korotaev wrote:
> VPS has reached it's process limit and you can't enter it.
> If you suggest to make enter without resource limitations, then it will
> be a security hole.
I think the question is:
Can or should an administrative process be able to do things
inside of a container, without being subject that that
container's resource limitations?
Implementation wise, I'm sure we _can_ do something like that. We
simply have to make sure that when processes are entering containers,
they are subject to the originating container's resource limits, not the
destination.
Could you explain why this is a security hole?
-- Dave
Kirill Korotaev wrote:
> I think the first thing we have to do, is not to decide which pids we
> want to see, but what and how we want to virtualize.
No, let's not even decide on that :).
I think where we've come to, is that there is no *important* difference
between virtualising on a per-process or a per-process family basis, so
long as you can suitably arrange arbitrary families it is equivalent to
the "pure" method of strict per-process virtualisation as Eric has been
implementing.
>> Depending on the flags on the XID, we can incorporate all the approaches
>> being tabled. You want virtualised pids? Well, that'll hurt a little,
>> but suit yourself - set a flag on your container and inside the
>> container you get virtualised PIDs. You want a flat view for all your
>> vservers? Fine, just use an XID without the virtualisation flag and
>> with the "all seeing eye" property set. Or you use an XID _with_ the
>> virtualisation flag set, and then call a tuple-endowed API to find the
>> information you're after.
> This sounds good. But pspaces are also used for access controls. So this
> should be incorparated there as well.
Yes, and I'm hoping that with the central structure there it should be
easy to start re-basing the Linux VServer patch as well as openvz and
any other similar technology people have.
Then we can cherry-pick features from any of the 'competing' solutions
in this space.
I have a preliminary patch, and hope to have a public submission this
week.
Sam.