This patchset is a followup to the posting by Serge.
http://marc.theaimsgroup.com/?l=linux-kernel&m=113200410620972&w=2
In this patchset here, we are providing the pid virtualization mentioned
in serge's posting.
> I'm part of a project implementing checkpoint/restart processes.
> After a process or group of processes is checkpointed, killed, and
> restarted, the changing of pids could confuse them. There are many
> other such issues, but we wanted to start with pids.
>
> This patchset introduces functions to access task->pid and ->tgid,
> and updates ->pid accessors to use the functions. This is in
> preparation for a subsequent patchset which will separate the kernel
> and virtualized pidspaces. This will allow us to virtualize pids
> from users' pov, so that, for instance, a checkpointed set of
> processes could be restarted with particular pids. Even though their
> kernel pids may already be in use by new processes, the checkpointed
> processes can be started in a new user pidspace with their old
> virtual pid. This also gives vserver a simpler way to fake vserver
> init processes as pid 1. Note that this does not change the kernel's
> internal idea of pids, only what users see.
>
> The first 12 patches change all locations which access ->pid and
> ->tgid to use the inlined functions. The last patch actually
> introduces task_pid() and task_tgid(), and renames ->pid and ->tgid
> to __pid and __tgid to make sure any uncaught users error out.
>
> Does something like this, presumably after much working over, seem
> mergeable?
These patches build on top of serge's posted patches (if necessary
we can repost them here).
PID Virtualization is based on the concept of a container.
The ultimate goal is to checkpoint/restart containers.
The mechanism to start a container
is to 'echo "container_name" > /proc/container' which creates a new
container and associates the calling process with it. All subsequently
forked tasks then belong to that container.
There is a separate pid space associated with each container.
Only processes/task belonging to the same container "see" each other.
The exception is an implied default system container that has
a global view.
The following patches accomplish 3 things:
1) identify the locations at the user/kernel boundary where pids and
related ids ( pgrp, sessionids, .. ) need to be (de-)virtualized and
call appropriate (de-)virtualization functions.
2) provide the virtualization implementation in these functions.
3) implement a container object and a simple /proc interface to create one
4) provide a per container /proc/fs
-- Hubertus Franke ([email protected])
-- Cedric Le Goater ([email protected])
-- Serge E Hallyn ([email protected])
-- Dave Hansen ([email protected])
On Thu, 15 Dec 2005 09:35:57 EST, Hubertus Franke wrote:
> This patchset is a followup to the posting by Serge.
> http://marc.theaimsgroup.com/?l=linux-kernel&m=113200410620972&w=2
>
> In this patchset here, we are providing the pid virtualization mentioned
> in serge's posting.
>
> > I'm part of a project implementing checkpoint/restart processes.
> > After a process or group of processes is checkpointed, killed, and
> > restarted, the changing of pids could confuse them. There are many
> > other such issues, but we wanted to start with pids.
> >
> > This patchset introduces functions to access task->pid and ->tgid,
> > and updates ->pid accessors to use the functions. This is in
> > preparation for a subsequent patchset which will separate the kernel
> > and virtualized pidspaces. This will allow us to virtualize pids
> > from users' pov, so that, for instance, a checkpointed set of
> > processes could be restarted with particular pids. Even though their
> > kernel pids may already be in use by new processes, the checkpointed
> > processes can be started in a new user pidspace with their old
> > virtual pid. This also gives vserver a simpler way to fake vserver
> > init processes as pid 1. Note that this does not change the kernel's
> > internal idea of pids, only what users see.
> >
> > The first 12 patches change all locations which access ->pid and
> > ->tgid to use the inlined functions. The last patch actually
> > introduces task_pid() and task_tgid(), and renames ->pid and ->tgid
> > to __pid and __tgid to make sure any uncaught users error out.
> >
> > Does something like this, presumably after much working over, seem
> > mergeable?
>
> These patches build on top of serge's posted patches (if necessary
> we can repost them here).
>
> PID Virtualization is based on the concept of a container.
> The ultimate goal is to checkpoint/restart containers.
>
> The mechanism to start a container
> is to 'echo "container_name" > /proc/container' which creates a new
> container and associates the calling process with it. All subsequently
> forked tasks then belong to that container.
> There is a separate pid space associated with each container.
> Only processes/task belonging to the same container "see" each other.
> The exception is an implied default system container that has
> a global view.
>
> The following patches accomplish 3 things:
> 1) identify the locations at the user/kernel boundary where pids and
> related ids ( pgrp, sessionids, .. ) need to be (de-)virtualized and
> call appropriate (de-)virtualization functions.
> 2) provide the virtualization implementation in these functions.
> 3) implement a container object and a simple /proc interface to create one
> 4) provide a per container /proc/fs
>
> -- Hubertus Franke ([email protected])
> -- Cedric Le Goater ([email protected])
> -- Serge E Hallyn ([email protected])
> -- Dave Hansen ([email protected])
I think this is actually quite interesting in a number of ways - it
might actually be a way of cleanly addressing several current out
of tree problems, several of which are indpendently (occasionally) striving
for mainline adoption: vserver, openvz, cluster checkpoint/restart.
I think perhaps this could also be the basis for a CKRM "class"
grouping as well. Rather than maintaining an independent class
affiliation for tasks, why not have a class devolve (evolve?) into
a "container" as described here. The container provides much of
the same grouping capabilities as a class as far as I can see. The
right information would be availble for scheduling and IO resource
management. The memory component of CKRM is perhaps a bit tricky
still, but an overall strategy (can I use that word here? ;-) might
be to use these "containers" as the single intrinsic grouping mechanism
for vserver, openvz, application checkpoint/restart, resource
management, and possibly others?
Opinions, especially from the CKRM folks? This might even be useful
to the PAGG folks as a grouping mechanism, similar to their jobs or
containers.
"This patchset solves multiple problems".
gerrit
On Thu, 2005-12-15 at 11:49 -0800, Gerrit Huizenga wrote:
> I think perhaps this could also be the basis for a CKRM "class"
> grouping as well. Rather than maintaining an independent class
> affiliation for tasks, why not have a class devolve (evolve?) into
> a "container" as described here.
Wasn't one of the grand schemes of CKRM to be able to have application
instances be shared? For instance, running a single DB2, Oracle, or
Apache server, and still accounting for all of the classes separately.
If so, that wouldn't work with a scheme that requires process
separation.
But, sharing the application instances is probably mostly (only)
important for databases anyway. I would imagine that most of the
overhead in a server like an Apache instance is for the page cache for
content, as well as a bit for Apache's executables themselves. The
container schemes should be able to share page cache for both cases.
The main issues would be managing multiple configurations, and the
increased overhead from having more processes around than with a single
server.
There might also be some serious restrictions on containerized
applications. For instance, taking a running application, moving it out
of one container, and into another might not be feasible. Is this
something that is common or desired in the current CKRM framework?
-- Dave
On Thu, 15 Dec 2005 12:02:41 PST, Dave Hansen wrote:
> On Thu, 2005-12-15 at 11:49 -0800, Gerrit Huizenga wrote:
> > I think perhaps this could also be the basis for a CKRM "class"
> > grouping as well. Rather than maintaining an independent class
> > affiliation for tasks, why not have a class devolve (evolve?) into
> > a "container" as described here.
>
> Wasn't one of the grand schemes of CKRM to be able to have application
> instances be shared? For instance, running a single DB2, Oracle, or
> Apache server, and still accounting for all of the classes separately.
> If so, that wouldn't work with a scheme that requires process
> separation.
Yes, it is. However, that may be a sub-case where a single, large
server application actually jumps around from container to container.
I consider that a detail (well, our DB2 folks don't but I'm all for
solving one problem at a time ;-) and we can work some of that out
later. They are less concerned about the application being shared
or part of multiple "classes" simultaneously, as opposed to being
appropriately resource contrained based on the (large) transactions
that they are handling on behalf of a user. So, if it were possible
to jump from one container to another dynamically, then the appropriate
resource management stuff could be handled at some other level.
> There might also be some serious restrictions on containerized
> applications. For instance, taking a running application, moving it out
> of one container, and into another might not be feasible. Is this
> something that is common or desired in the current CKRM framework?
Desired, but primarily for large server applications. And, I don't
think I see much in this patch set that makes that infeasible. If
containers are going to work, you are going to have to have a mechanism
to get applications into them and to move them anyway, right? While
it would be nice if that were dirt-cheap, if it isn't, applications
may have to adapt their usage of them based on the cost. Not a big
deal as I see it.
gerrit
On Thu, 2005-12-15 at 11:49 -0800, Gerrit Huizenga wrote:
> On Thu, 15 Dec 2005 09:35:57 EST, Hubertus Franke wrote:
> > PID Virtualization is based on the concept of a container.
> > The ultimate goal is to checkpoint/restart containers.
> >
> > The mechanism to start a container
> > is to 'echo "container_name" > /proc/container' which creates a new
> > container and associates the calling process with it. All subsequently
> > forked tasks then belong to that container.
> > There is a separate pid space associated with each container.
> > Only processes/task belonging to the same container "see" each other.
> > The exception is an implied default system container that has
> > a global view.
> >
> > The following patches accomplish 3 things:
> > 1) identify the locations at the user/kernel boundary where pids and
> > related ids ( pgrp, sessionids, .. ) need to be (de-)virtualized and
> > call appropriate (de-)virtualization functions.
> > 2) provide the virtualization implementation in these functions.
> > 3) implement a container object and a simple /proc interface to create one
> > 4) provide a per container /proc/fs
> >
> > -- Hubertus Franke ([email protected])
> > -- Cedric Le Goater ([email protected])
> > -- Serge E Hallyn ([email protected])
> > -- Dave Hansen ([email protected])
>
> I think this is actually quite interesting in a number of ways - it
> might actually be a way of cleanly addressing several current out
> of tree problems, several of which are indpendently (occasionally) striving
> for mainline adoption: vserver, openvz, cluster checkpoint/restart.
Indeed the entire set might be able to benefit wrt to pid
virtualization. I think we are quite open to embrace a larger set of
applications of pid virtualization.
> I think perhaps this could also be the basis for a CKRM "class"
> grouping as well. Rather than maintaining an independent class
> affiliation for tasks, why not have a class devolve (evolve?) into
> a "container" as described here. The container provides much of
> the same grouping capabilities as a class as far as I can see. The
> right information would be availble for scheduling and IO resource
> management. The memory component of CKRM is perhaps a bit tricky
> still, but an overall strategy (can I use that word here? ;-) might
> be to use these "containers" as the single intrinsic grouping mechanism
> for vserver, openvz, application checkpoint/restart, resource
> management, and possibly others?
>
> Opinions, especially from the CKRM folks? This might even be useful
> to the PAGG folks as a grouping mechanism, similar to their jobs or
> containers.
>
Not being to alien to the CKRM concept, yes there is some nice synergy
here. As well as to PAGG and SGI's jobs. CKRM provides resource
constraints and runtime enforcements based on some grouping of
processes. Similar to container, class membership is inherited (if
that's still the case from last time I looked at it) until explicitely
changed. Containers and in particular provide another dimension
namely the ability to constraint "visibility" of resources and objects,
in this particular case pids as the first resource used.
> "This patchset solves multiple problems".
> gerrit
>
--
Hubertus Franke <[email protected]>
On Thu, 2005-12-15 at 12:02 -0800, Dave Hansen wrote:
> On Thu, 2005-12-15 at 11:49 -0800, Gerrit Huizenga wrote:
> > I think perhaps this could also be the basis for a CKRM "class"
> > grouping as well. Rather than maintaining an independent class
> > affiliation for tasks, why not have a class devolve (evolve?) into
> > a "container" as described here.
>
> Wasn't one of the grand schemes of CKRM to be able to have application
> instances be shared? For instance, running a single DB2, Oracle, or
> Apache server, and still accounting for all of the classes separately.
> If so, that wouldn't work with a scheme that requires process
> separation.
f-series CKRM manages tasks via the task struct -- this means it
manages each thread and not a process. Since, generally speaking, each
thread is assigned the same class as the main thread this effectively
manages processes. So yes, separate DB2, Oracle, Apache, etc. threads
could be assigned to different classes. This is definitely something a
strict container could not do.
> But, sharing the application instances is probably mostly (only)
> important for databases anyway. I would imagine that most of the
<nit>
I wouldn't say only for databases. human-interaction-bound processes can
share instances (gnome-terminal). Granted, these probably would never
need to span a container or a class...
</nit>
> overhead in a server like an Apache instance is for the page cache for
> content, as well as a bit for Apache's executables themselves. The
> container schemes should be able to share page cache for both cases.
> The main issues would be managing multiple configurations, and the
> increased overhead from having more processes around than with a single
> server.
>
> There might also be some serious restrictions on containerized
> applications. For instance, taking a running application, moving it out
> of one container, and into another might not be feasible. Is this
> something that is common or desired in the current CKRM framework?
>
> -- Dave
Yes, being able to move a process from one class to another is
important. This can happen as a consequence of the system administrator
deciding to change the distribution of resources without having to
restart services. The change in distribution can be done by changing
shares of a class, manually moving processes between classes, by making
or deleting classes, or a combination of these operations.
Cheers,
-Matt Helsley
On Thu, 2005-12-15 at 11:49 -0800, Gerrit Huizenga wrote:
> On Thu, 15 Dec 2005 09:35:57 EST, Hubertus Franke wrote:
> > This patchset is a followup to the posting by Serge.
> > http://marc.theaimsgroup.com/?l=linux-kernel&m=113200410620972&w=2
> >
> > In this patchset here, we are providing the pid virtualization mentioned
> > in serge's posting.
> >
> > > I'm part of a project implementing checkpoint/restart processes.
> > > After a process or group of processes is checkpointed, killed, and
> > > restarted, the changing of pids could confuse them. There are many
> > > other such issues, but we wanted to start with pids.
> > >
> > > This patchset introduces functions to access task->pid and ->tgid,
> > > and updates ->pid accessors to use the functions. This is in
> > > preparation for a subsequent patchset which will separate the kernel
> > > and virtualized pidspaces. This will allow us to virtualize pids
> > > from users' pov, so that, for instance, a checkpointed set of
> > > processes could be restarted with particular pids. Even though their
> > > kernel pids may already be in use by new processes, the checkpointed
> > > processes can be started in a new user pidspace with their old
> > > virtual pid. This also gives vserver a simpler way to fake vserver
> > > init processes as pid 1. Note that this does not change the kernel's
> > > internal idea of pids, only what users see.
> > >
> > > The first 12 patches change all locations which access ->pid and
> > > ->tgid to use the inlined functions. The last patch actually
> > > introduces task_pid() and task_tgid(), and renames ->pid and ->tgid
> > > to __pid and __tgid to make sure any uncaught users error out.
> > >
> > > Does something like this, presumably after much working over, seem
> > > mergeable?
> >
> > These patches build on top of serge's posted patches (if necessary
> > we can repost them here).
> >
> > PID Virtualization is based on the concept of a container.
> > The ultimate goal is to checkpoint/restart containers.
> >
> > The mechanism to start a container
> > is to 'echo "container_name" > /proc/container' which creates a new
> > container and associates the calling process with it. All subsequently
> > forked tasks then belong to that container.
> > There is a separate pid space associated with each container.
> > Only processes/task belonging to the same container "see" each other.
> > The exception is an implied default system container that has
> > a global view.
<snip>
> I think perhaps this could also be the basis for a CKRM "class"
> grouping as well. Rather than maintaining an independent class
> affiliation for tasks, why not have a class devolve (evolve?) into
> a "container" as described here. The container provides much of
> the same grouping capabilities as a class as far as I can see. The
> right information would be availble for scheduling and IO resource
> management. The memory component of CKRM is perhaps a bit tricky
> still, but an overall strategy (can I use that word here? ;-) might
> be to use these "containers" as the single intrinsic grouping mechanism
> for vserver, openvz, application checkpoint/restart, resource
> management, and possibly others?
>
> Opinions, especially from the CKRM folks? This might even be useful
> to the PAGG folks as a grouping mechanism, similar to their jobs or
> containers.
>
> "This patchset solves multiple problems".
>
> gerrit
CKRM classes seem too different from containers to merge the two
concepts:
- Classes don't assign class-unique pids to tasks.
- Tasks can move between classes.
- Tasks move between classes without any need for checkpoint/restart.
- Classes show up in a filesystem interface rather that using a file
in /proc to create them. (trivial interface difference)
- There are no "visibility boundaries" to enforce between tasks in
different classes.
- Classes are hierarchial.
- Unless I am mistaken, a container groups processes (Can one thread run
in container A and another in container B?) while a class groups tasks.
Since a task represents a thread or a process one thread could be in
class A and another in class B.
Cheers,
-Matt Helsley
On Thu, 15 Dec 2005 18:20:52 PST, Matt Helsley wrote:
> On Thu, 2005-12-15 at 11:49 -0800, Gerrit Huizenga wrote:
> > On Thu, 15 Dec 2005 09:35:57 EST, Hubertus Franke wrote:
> > > PID Virtualization is based on the concept of a container.
> > > The ultimate goal is to checkpoint/restart containers.
> > >
> > > The mechanism to start a container
> > > is to 'echo "container_name" > /proc/container' which creates a new
> > > container and associates the calling process with it. All subsequently
> > > forked tasks then belong to that container.
> > > There is a separate pid space associated with each container.
> > > Only processes/task belonging to the same container "see" each other.
> > > The exception is an implied default system container that has
> > > a global view.
>
> <snip>
>
> > I think perhaps this could also be the basis for a CKRM "class"
> > grouping as well. Rather than maintaining an independent class
> > affiliation for tasks, why not have a class devolve (evolve?) into
> > a "container" as described here. The container provides much of
> > the same grouping capabilities as a class as far as I can see. The
> > right information would be availble for scheduling and IO resource
> > management. The memory component of CKRM is perhaps a bit tricky
> > still, but an overall strategy (can I use that word here? ;-) might
> > be to use these "containers" as the single intrinsic grouping mechanism
> > for vserver, openvz, application checkpoint/restart, resource
> > management, and possibly others?
> >
> > Opinions, especially from the CKRM folks? This might even be useful
> > to the PAGG folks as a grouping mechanism, similar to their jobs or
> > containers.
> >
> > "This patchset solves multiple problems".
> >
> > gerrit
>
> CKRM classes seem too different from containers to merge the two
> concepts:
I agree that the implementation of pid virtualization and classes have
different characteristics. However, you bring up interesting points
about the differences... But I question whether or not they are
relevent to an implementation of resource management. I'm going out
on a limb here looking at a possibly radical change which might
simplify things so there is only one grouping mechanism in kernel.
I could be wrong but...
> - Classes don't assign class-unique pids to tasks.
What part of this is important to resource management? A container
ID is like a class ID. Yes, I think container ID's are assigned to
processes rather than tasks, but is that really all that important?
> - Tasks can move between classes.
In the pid virtualization, I would think that tasks can move between
containers as well, although it isn't all that useful for most things.
For instance, checkpoint/restart needs to checkpoint a process and all
of its threads if it wants to restart it. So there may be restrictions
on what you can checkpoint/restart. Vserver probably wants isolation
at a process boundary, rather than a task boundary. Most resource
management, e.g. Java, probably doesn't care about task vs. process.
> - Tasks move between classes without any need for checkpoint/restart.
That *should* be possible with a generalized container solution.
For instance, just like with classes, you have to move things into
containers in the first place. And, you could in theory have a classification
engine that helped choose which container to put a task/process in
at creation/instantiation/significant event...
> - Classes show up in a filesystem interface rather that using a file
> in /proc to create them. (trivial interface difference)
Yep - there will probably be a /proc or /configfs interface to containers
at some point, I would expect. No significant difference there.
> - There are no "visibility boundaries" to enforce between tasks in
> different classes.
Are there in virtualized pids? There *can* be - e.g. ps can distinguish,
but it is possible for tasks to interact across container boundaries.
Not ideal for vserver, checkpoint/restart, for instance (makes c/r a
little harder or more limited - signals heading outside the container
may "disappear" when you checkpoint/restart but for apps that c/r, that
probably isn't all that likely).
> - Classes are hierarchial.
Conceptually they are. But are they in the CKRM f series? I thought
that was one area for simplification. And, how important is that *really*
for most applications?
> - Unless I am mistaken, a container groups processes (Can one thread run
> in container A and another in container B?) while a class groups tasks.
> Since a task represents a thread or a process one thread could be in
> class A and another in class B.
Definitely useful, and one question is whether pid virtualization is
container isolation, or simply virtualization to enable container
isolation. If it is an enabling technology, perhaps it doesn't have
that restriction and could be used either way based on resource management
needs or based on vserver or c/r needs...
Debate away... ;-)
gerrit
On Thu, 2005-12-15 at 19:28 -0800, Gerrit Huizenga wrote:
> In the pid virtualization, I would think that tasks can move between
> containers as well,
I don't think tasks can not be permitted to move between containers. As
a simple exercise, imagine that you have two processes with the same
pid, one in container A and one in container B. You wish to have them
both run in container A. They can't both have the same pid. What do
you do?
I've been talking a lot lately about how important filesystem isolation
between containers is to implement containers properly. Isolating the
filesystem namespaces makes it much easier to do things like fs-based
shared memory during a checkpoint/resume. If we want to allow tasks to
move around, we'll have to throw out this entire concept. That means
that a _lot_ of things get a notch closer to the too-costly-to-implement
category.
-- Dave
On Fri, 16 Dec 2005 09:35:19 PST, Dave Hansen wrote:
> On Thu, 2005-12-15 at 19:28 -0800, Gerrit Huizenga wrote:
> > In the pid virtualization, I would think that tasks can move between
> > containers as well,
>
> I don't think tasks can not be permitted to move between containers. As
> a simple exercise, imagine that you have two processes with the same
> pid, one in container A and one in container B. You wish to have them
> both run in container A. They can't both have the same pid. What do
> you do?
>
> I've been talking a lot lately about how important filesystem isolation
> between containers is to implement containers properly. Isolating the
> filesystem namespaces makes it much easier to do things like fs-based
> shared memory during a checkpoint/resume. If we want to allow tasks to
> move around, we'll have to throw out this entire concept. That means
> that a _lot_ of things get a notch closer to the too-costly-to-implement
> category.
Interesting... So how to tasks get *into* a container? And can they
ever get back "out" of a container? Are most processes on the system
initially not in a container? And then they can be stuffed in a container?
And then containers can be moved around or be isolated from each other?
And, is pid virtualization the point where this happens? Or is that
a slightly higher level? In other words, is pid virtualization the
full implementation of container isolation? Or is it a significant
element on which additional policy, restrictions, and usage models
can be built?
gerrit
On Fri, 2005-12-16 at 12:45 -0800, Gerrit Huizenga wrote:
> Interesting... So how to tasks get *into* a container?
Only by inheritance.
> And can they ever get back "out" of a container?
No. Think of the pids again. Even the "outside" of a container, things
like the real init, have to have unique pids. What if the process's pid
is the same as one in use in the default container?
> Are most processes on the system
> initially not in a container? And then they can be stuffed in a container?
> And then containers can be moved around or be isolated from each other?
The current idea is that processes are assigned at fork-time. The
isolation is for the lifetime of the process.
> And, is pid virtualization the point where this happens? Or is that
> a slightly higher level? In other words, is pid virtualization the
> full implementation of container isolation? Or is it a significant
> element on which additional policy, restrictions, and usage models
> can be built?
pid virtualization is simply the one that's easiest to understand, and
the one that demonstrates the largest number of issues. It is a small
piece of the puzzle, but an important one.
-- Dave
On Fri, 2005-12-16 at 13:10 -0800, Dave Hansen wrote:
> On Fri, 2005-12-16 at 12:45 -0800, Gerrit Huizenga wrote:
> > Interesting... So how to tasks get *into* a container?
>
> Only by inheritance.
That is only true today. There is no reason (other then introducing
some heavy code complexity (haven't thought about that)
why we can't at some point move a process group/tree into a container.
The reason for this is that for the global container V=R in pid space
terms (read the vpid=realpid). Moving an entire group into a container
requires to assign new kernel pids to each task, while keeping the
the vpid part constant. Lots of kpid related references though..
Don't know whether that's worth the trouble, particularly at this stage.
>
> > And can they ever get back "out" of a container?
>
> No. Think of the pids again. Even the "outside" of a container, things
> like the real init, have to have unique pids. What if the process's pid
> is the same as one in use in the default container?
Correct..look at my answer above moving from global to container can be
accomplished because in a fresh container all pids are available, so we
can simply reoccupy the same vpids in the new pidspace. This keeps all
user level "references" and pid values valid.
The only way we could EVER go back is if we could guarantee that the
pids the global space are free, hence they would have to be reserved.
NOWAY.... particularly if migration is involved later on..
>
> > Are most processes on the system
> > initially not in a container? And then they can be stuffed in a container?
> > And then containers can be moved around or be isolated from each other?
>
> The current idea is that processes are assigned at fork-time. The
> isolation is for the lifetime of the process.
>
> > And, is pid virtualization the point where this happens? Or is that
> > a slightly higher level? In other words, is pid virtualization the
> > full implementation of container isolation? Or is it a significant
> > element on which additional policy, restrictions, and usage models
> > can be built?
>
> pid virtualization is simply the one that's easiest to understand, and
> the one that demonstrates the largest number of issues. It is a small
> piece of the puzzle, but an important one.
>
Ditto..
> -- Dave
>
>
--
Hubertus Franke <[email protected]>
On Fri, 2005-12-16 at 09:35 -0800, Dave Hansen wrote:
> On Thu, 2005-12-15 at 19:28 -0800, Gerrit Huizenga wrote:
> > In the pid virtualization, I would think that tasks can move between
> > containers as well,
>
> I don't think tasks can not be permitted to move between containers. As
> a simple exercise, imagine that you have two processes with the same
> pid, one in container A and one in container B. You wish to have them
> both run in container A. They can't both have the same pid. What do
> you do?
>
Dave, I think you meant "I don't think tasks can <strike>not</strike> be
permitted"...
Anyway, you make the constraints very clear, unless one can guarantee
that the pidspaces don't have any overlaps in vpid usage, there is NOWAY
that we can allow this. Otherwise vpids that have been handed to
to userspace (think sys_getpid()) need to be revoked (think coherence
here). That violates the transparency requirements.
> I've been talking a lot lately about how important filesystem isolation
> between containers is to implement containers properly. Isolating the
> filesystem namespaces makes it much easier to do things like fs-based
> shared memory during a checkpoint/resume. If we want to allow tasks to
> move around, we'll have to throw out this entire concept. That means
> that a _lot_ of things get a notch closer to the too-costly-to-implement
> category.
>
Not only that, as the example of pids already show, while at the surface
these might seem as desirable features ( particular since they came up
wrt to the CKRM discussion ), there are significant technical limitation
to these.
--
Hubertus Franke <[email protected]>
On Fri, 2005-12-16 at 18:47 -0500, Hubertus Franke wrote:
> On Fri, 2005-12-16 at 09:35 -0800, Dave Hansen wrote:
<snip>
> > I've been talking a lot lately about how important filesystem isolation
> > between containers is to implement containers properly. Isolating the
> > filesystem namespaces makes it much easier to do things like fs-based
> > shared memory during a checkpoint/resume. If we want to allow tasks to
> > move around, we'll have to throw out this entire concept. That means
> > that a _lot_ of things get a notch closer to the too-costly-to-implement
> > category.
> >
>
> Not only that, as the example of pids already show, while at the surface
> these might seem as desirable features ( particular since they came up
> wrt to the CKRM discussion ), there are significant technical limitation
> to these.
Perhaps merging the container process grouping functionality is not a
good idea.
However, I think CKRM could be made minimally consistent with
containers using a few small modifications. I suspect all that is
necessary is:
1) Expanding the pid syntax accepted and reported when accessing the
members file to include an optional container id:
# classify init in container 0 to a class
echo 0:1 >> ${RCFS}/class_foo/members
echo :1 >> ${RCFS}/class_foo/members
# while in container 0 classify init in container 0 to a class
echo 1 >> ${RCFS}/class_foo/members
# while in container 0 classify init in container 3 to a class
echo 3:1 >> ${RCFS}/class_foo/bar_class/members
Then pids in container 0 would show up as cid:pid
$ cat ${RCFS}/class_foo/members
0:1
5:2
...
3:4
Processes listing members in container n would only see the pid
and only pids in that container.
2) Limiting the pids and container ids accepted as input to the members
file from processes doing classification from within containers:
# classify init in the current container to a class
echo :1 >> ${RCFS}/class_foo/members
echo 1 >> ${RCFS}/class_foo/members
# returns an error when not in container 0
echo 0:1 >> ${RCFS}/class_foo/members
# returns an error when not in container 1
echo 1:1 >> ${RCFS}/class_foo/members
...
(Incidentally these kind of details are what I was referring to earlier
in this thread as "visibility boundaries")
I think this would be sufficient to make CKRM and containers play
nicely with each other. I suspect further kernel-enforced constraints
between CKRM and containers may constitute policy and not functionality.
<shameless_plug>I also suspect that with the right userspace
classification engine a wide variety of useful container resource
management policies could be enforced based on these simple
modifications.</shameless_plug>
Cheers,
-Matt Helsley
On Thu, 2005-12-15 at 19:28 -0800, Gerrit Huizenga wrote:
> On Thu, 15 Dec 2005 18:20:52 PST, Matt Helsley wrote:
> > On Thu, 2005-12-15 at 11:49 -0800, Gerrit Huizenga wrote:
> > > On Thu, 15 Dec 2005 09:35:57 EST, Hubertus Franke wrote:
> > > > PID Virtualization is based on the concept of a container.
> > > > The ultimate goal is to checkpoint/restart containers.
> > > >
> > > > The mechanism to start a container
> > > > is to 'echo "container_name" > /proc/container' which creates a new
> > > > container and associates the calling process with it. All subsequently
> > > > forked tasks then belong to that container.
> > > > There is a separate pid space associated with each container.
> > > > Only processes/task belonging to the same container "see" each other.
> > > > The exception is an implied default system container that has
> > > > a global view.
> >
> > <snip>
> >
> > > I think perhaps this could also be the basis for a CKRM "class"
> > > grouping as well. Rather than maintaining an independent class
> > > affiliation for tasks, why not have a class devolve (evolve?) into
> > > a "container" as described here. The container provides much of
> > > the same grouping capabilities as a class as far as I can see. The
> > > right information would be availble for scheduling and IO resource
> > > management. The memory component of CKRM is perhaps a bit tricky
> > > still, but an overall strategy (can I use that word here? ;-) might
> > > be to use these "containers" as the single intrinsic grouping mechanism
> > > for vserver, openvz, application checkpoint/restart, resource
> > > management, and possibly others?
> > >
> > > Opinions, especially from the CKRM folks? This might even be useful
> > > to the PAGG folks as a grouping mechanism, similar to their jobs or
> > > containers.
> > >
> > > "This patchset solves multiple problems".
> > >
> > > gerrit
> >
> > CKRM classes seem too different from containers to merge the two
> > concepts:
>
> I agree that the implementation of pid virtualization and classes have
> different characteristics. However, you bring up interesting points
> about the differences... But I question whether or not they are
> relevent to an implementation of resource management. I'm going out
> on a limb here looking at a possibly radical change which might
> simplify things so there is only one grouping mechanism in kernel.
> I could be wrong but...
<snip>
> > - Classes don't assign class-unique pids to tasks.
>
> What part of this is important to resource management? A container
> ID is like a class ID. Yes, I think container ID's are assigned to
> processes rather than tasks, but is that really all that important?
Perhaps you misunderstood my point. Upon inserting a task into a
container you must assign it a pid unique within the container.
Inserting a task into a class requires no analogous operation. While
there is no conflict here neither is there commonality.
<snip>
> For instance, checkpoint/restart needs to checkpoint a process and all
> of its threads if it wants to restart it. So there may be restrictions
> on what you can checkpoint/restart. Vserver probably wants isolation
> at a process boundary, rather than a task boundary. Most resource
> management, e.g. Java, probably doesn't care about task vs. process.
I really don't see how Java itself is a good example of most resource
management. As I see it Java tries to present a runtime environment for
applications and it is the applications administrators are concerned
with.
A process could allocate different roles to each thread or dole out
uniform pieces of work to each thread. Being able to manage the resource
usage of these threads could be useful -- so while Java may not "care"
about task vs. process an administrator might.
> > - Tasks move between classes without any need for checkpoint/restart.
>
> That *should* be possible with a generalized container solution.
> For instance, just like with classes, you have to move things into
> containers in the first place. And, you could in theory have a classification
> engine that helped choose which container to put a task/process in
> at creation/instantiation/significant event...
Since arbitrary movement (time, source, and destination) is not
possible the classification analogy does not fit. This is one very big
difference between classes and containers that suggests merging the two
might not be best.
<snip>
> > - There are no "visibility boundaries" to enforce between tasks in
> > different classes.
>
> Are there in virtualized pids? There *can* be - e.g. ps can distinguish,
> but it is possible for tasks to interact across container boundaries.
Right. I didn't say they were entirely invisible to each other. If they
were entirely visible to each other then these boundaries I'm talking
about wouldn't exist and a container would be more similar to a class.
These boundaries are probably delineated in miscellaneous areas of the
kernel like getpid(), kill(), any /proc file that shows a set of pids,
etc. Each of these would have to correctly limit the set of pids
displayed and/or accepted as input.
A CKRM class on the other hand has no such boundaries to present to
userspace and hence does not alter code in such diverse places. I think
this is a consequence of the fact it doesn't virtualize resources for
the purposes of checkpoint/restart (esp. well-known and user-visible
resources like pids, filehandles, etc).
<snip>
> > - Classes are hierarchial.
>
> Conceptually they are. But are they in the CKRM f series? I thought
> that was one area for simplification. And, how important is that *really*
> for most applications?
Hiearchy still exists in f-series. It's something Chandra has been
considering removing in order to simplify the code. I think hierarchy
offers a chance for administrators to better organize their classes. I
think the goal should be to enable administrators to let users manage a
class and/or subclasses of their own -- though implementing rcfs via
configfs limits config items to root currently. Perhaps this could be
useful for CKRM inside containers if each container had a virtual root
user id of its own with a corresponding non-zero id in container 0...
> > - Unless I am mistaken, a container groups processes (Can one thread run
> > in container A and another in container B?) while a class groups tasks.
> > Since a task represents a thread or a process one thread could be in
> > class A and another in class B.
>
> Definitely useful, and one question is whether pid virtualization is
Above you suggested that most resource management ("e.g. Java") doesn't
care about process vs. threads. Here you say it could be useful.
> container isolation, or simply virtualization to enable container
> isolation. If it is an enabling technology, perhaps it doesn't have
> that restriction and could be used either way based on resource management
> needs or based on vserver or c/r needs...
I thought that the point of pid virtualization was to enable
checkpoint/restart and that, as a consequence, moving processes to other
containers is impossible.
> Debate away... ;-)
>
> gerrit
The strongest disimilarity between the two I can see is the lack of
task movement between containers. The core similarity is the ability to
group. However, they don't group quite the same things -- from what I
can see containers group _trees of tasks_ with process (thread group)
granularity while classes group _tasks_ with thread granularity.
At the very least I think we need to know the full extent of isolation
and interaction that are planned/necessary for containers before further
considering any merge proposals.
Cheers,
-Matt Helsley
On Fri, 2005-12-16 at 17:18 -0800, Matt Helsley wrote:
> On Fri, 2005-12-16 at 18:47 -0500, Hubertus Franke wrote:
> > On Fri, 2005-12-16 at 09:35 -0800, Dave Hansen wrote:
> <snip>
> > > I've been talking a lot lately about how important filesystem isolation
> > > between containers is to implement containers properly. Isolating the
> > > filesystem namespaces makes it much easier to do things like fs-based
> > > shared memory during a checkpoint/resume. If we want to allow tasks to
> > > move around, we'll have to throw out this entire concept. That means
> > > that a _lot_ of things get a notch closer to the too-costly-to-implement
> > > category.
> > >
> >
> > Not only that, as the example of pids already show, while at the surface
> > these might seem as desirable features ( particular since they came up
> > wrt to the CKRM discussion ), there are significant technical limitation
> > to these.
>
> Perhaps merging the container process grouping functionality is not a
> good idea.
>
> However, I think CKRM could be made minimally consistent with
> containers using a few small modifications. I suspect all that is
> necessary is:
>
<snip>
> I think this would be sufficient to make CKRM and containers play
> nicely with each other. I suspect further kernel-enforced constraints
> between CKRM and containers may constitute policy and not functionality.
>
I think that as a first step mutual coexistence is already quite
useful.
Once I containerize applications, having the ability to actually
constrain and manage the resources consumed by that application would
be a real plus. In that sense a container and CKRM class coincide.
So even enforcing that "alignment" at a higher level through some
awareness in the classification engine for instance would be quite
useful. Are they the same kernel object .. NO .. because of the
life cycle management of a process, namely once moved into a container
it stays there...
>
> Cheers,
> -Matt Helsley
Prost ...
Hubertus Franke <[email protected]>