--
I'm part of a project implementing checkpoint/restart processes.
After a process or group of processes is checkpointed, killed, and
restarted, the changing of pids could confuse them. There are many
other such issues, but we wanted to start with pids.
This patchset introduces functions to access task->pid and ->tgid,
and updates ->pid accessors to use the functions. This is in
preparation for a subsequent patchset which will separate the kernel
and virtualized pidspaces. This will allow us to virtualize pids
from users' pov, so that, for instance, a checkpointed set of
processes could be restarted with particular pids. Even though their
kernel pids may already be in use by new processes, the checkpointed
processes can be started in a new user pidspace with their old
virtual pid. This also gives vserver a simpler way to fake vserver
init processes as pid 1. Note that this does not change the kernel's
internal idea of pids, only what users see.
The first 12 patches change all locations which access ->pid and
->tgid to use the inlined functions. The last patch actually
introduces task_pid() and task_tgid(), and renames ->pid and ->tgid
to __pid and __tgid to make sure any uncaught users error out.
Does something like this, presumably after much working over, seem
mergeable?
thanks
-serge
How about adding the accessor routines in the first patch (still
referencing task->pid), then doing all the changes as you did, then
renaming task->pid to task->__pid and updating the accessor to that
change, in the last patch? Then it would build all the way through.
Serge wrote:
> The resulting object code seems to be identical in most cases, and is
> actually shorter in cases where current->pid is used twice in a row,
> as it does not dereference task-> twice.
You lost me here. Why does using these accessor routines avoid the
second reference?
Have you crosstool'd built this for most arch's? I could imagine
some piece of code having a local or other struct variable named 'pid'
that would be broken by a mistake in this change. This could be so
whether the change was done by a script, or by hand. Probably need
to test 'allyesconfig' too.
> Note that this does not change the kernel's
> internal idea of pids, only what users see.
How can that be? Doesn't it run all accesses to the task->pid
field through the accessor, regardless of whether it's something
the user will see, or something used within the kernel?
How about other fields holding a pid, such as (one I happen to know
about) kernel/cpuset.c marker_pid? Grep for "pid_t" in include/linux
for other such possible fields. What about other kerel-user interfaces
that deal with pids such as fcntl, msgctl, sched_setaffinity, semop,
shmctl, sigaction, ...
How do you propose to synchronize incoming pid's with these potentially
modified displayed pids? There many invocations of find_task_by_pid()
in the kernel, typically converting a user provided pid into a task
struct. If doing "kill(getpid(), 1)" in user code didn't sighup
myself, that would be uncool.
How do you intend to use these accessor routines in order to help solve
the problems with checkpoint/restart?
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401
Quoting Paul Jackson ([email protected]):
> How about adding the accessor routines in the first patch (still
> referencing task->pid), then doing all the changes as you did, then
> renaming task->pid to task->__pid and updating the accessor to that
> change, in the last patch? Then it would build all the way through.
Ok, thanks - will send out a new patchset to do this.
>
> Serge wrote:
> > The resulting object code seems to be identical in most cases, and is
> > actually shorter in cases where current->pid is used twice in a row,
> > as it does not dereference task-> twice.
>
> You lost me here. Why does using these accessor routines avoid the
> second reference?
Why, I don't know :) I just looked at the resulting object code, and
the static inline causes task->pid to be dereferenced once and pushed
twice, whereas using task->pid causes it to be dereferenced twice.
> Have you crosstool'd built this for most arch's? I could imagine
No - I've never used crosstool, but will it a shot.
> > Note that this does not change the kernel's
> > internal idea of pids, only what users see.
>
> How can that be? Doesn't it run all accesses to the task->pid
> field through the accessor, regardless of whether it's something
> the user will see, or something used within the kernel?
Ok... our intent is to not change the kernel pid concept :) Of
course the accessor functions could be coded so as to change it, but
as far as I know we will not do so.
> How about other fields holding a pid, such as (one I happen to know
> about) kernel/cpuset.c marker_pid? Grep for "pid_t" in include/linux
> for other such possible fields. What about other kerel-user interfaces
> that deal with pids such as fcntl, msgctl, sched_setaffinity, semop,
> shmctl, sigaction, ...
Yes, our next patchset will introduce the actual pid to vpid translations
and place those functions in the right place. What I meant to say was
that the kernel pids will still be pids just as they are now. The barrier
between virtual pids and pids does not yet exist in this patchset. This
patchset is to lay the groundwork to make those translations simpler. We
switch task_pid() to task_vpid() in the right places, and use
task_pid_to_vpid() at the barriers between pids and vpids.
> How do you propose to synchronize incoming pid's with these potentially
> modified displayed pids? There many invocations of find_task_by_pid()
> in the kernel, typically converting a user provided pid into a task
These are replaced with find_task_by_vpid(), which looks for a the given
vpid in the pid-space of the calling process.
> struct. If doing "kill(getpid(), 1)" in user code didn't sighup
> myself, that would be uncool.
>
> How do you intend to use these accessor routines in order to help solve
> the problems with checkpoint/restart?
Let's say we want to start a process group. Start the first process in
a new pidspace. (Hubertus or Dave can tell use exactly how this will be
done in the first prototype, but I would expect something like
echo 5 > /proc/$$/childpidspace
start_my_program # This forks, and its children remain in pidspace 5
)
Now the processes in pidspace 5 are checkpointed and killed. When they
are restarted, they will create a new pidspace for the group again, and
ask for their checkpointed pids within that pidspace. Their kernelpids
will still be the same pid it would have been. But when one of the
processes looks for process 10, task_vpid_to_pid(current, 10) will return
the real pid for the vpid 10 in current's pidspace.
thanks,
-serge
Serge wrote:
> Ok, thanks - will send out a new patchset to do this.
This one change probably isn't worth new patches.
At least keep this in mind next time. No biggie
either way.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401
Serge wrote:
> But when one of the
> processes looks for process 10, task_vpid_to_pid(current, 10) will return
> the real pid for the vpid 10 in current's pidspace.
So a "kill -1 10" will mean different things, depending on the pidspace
that the kill is running in. And pid's passed about between user
tasks as if they were usable system-wide are now aliased by their
invisible pidspace.
Yuck. Such virtualizations usually have a much harder time addressing
the last 10% of situations than they did the easy 90%.
How about instead having a way to put the pid's of checkpointed tasks
in deep freeze, saving them for reuse when the task restarts?
System calls that operate on pid values could error out with some
new errno, -EFROZEN or some such.
This would seem far less invasive. Not just less invasive of the code,
but more importantly, not introducing some never entirely realizable
semantic change to the scope of pids.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401
Quoting Paul Jackson ([email protected]):
> Serge wrote:
> > But when one of the
> > processes looks for process 10, task_vpid_to_pid(current, 10) will return
> > the real pid for the vpid 10 in current's pidspace.
>
> So a "kill -1 10" will mean different things, depending on the pidspace
> that the kill is running in. And pid's passed about between user
> tasks as if they were usable system-wide are now aliased by their
> invisible pidspace.
>
> Yuck. Such virtualizations usually have a much harder time addressing
> the last 10% of situations than they did the easy 90%.
For simplicity, the only pids a process will see are those in its own
pidspace, and the only controls (I expect) will be the ability to start
a new pidspace, and request a pid. So it is no more complicated than
the vserver model, where a process becomes pid 1 only for other proceses
in the same vserver, and process don't see processes in other vservers -
except that now every process in the pidspace can be known as a different
pid, not just the first.
> How about instead having a way to put the pid's of checkpointed tasks
> in deep freeze, saving them for reuse when the task restarts?
> System calls that operate on pid values could error out with some
> new errno, -EFROZEN or some such.
Unfortunately that would not work for checkpoints across boots, or, more
importantly, for process (set) migration.
> This would seem far less invasive. Not just less invasive of the code,
> but more importantly, not introducing some never entirely realizable
> semantic change to the scope of pids.
Hopefully the next patchset, implementing the pid-vpid split, will show
it's not as complicated as I've made it sound.
Of course, if it remains too complicated a conceptual change to be
mergeable, we're better off knowing that now...
thanks,
-serge
Serge wrote:
> the vserver model
What's that?
> Unfortunately that would not work for checkpoints across boots,
That could be made to work, by an early init script that looked in the
"checkedpointed tasks storage locker" on disk, and reserved any pid's
used therein, marking them as frozen. A little care can ensure that
no task with such a pid is already running.
> or, more importantly, for process (set) migration.
Migration to a system that has already been up a while, where no
reservation of pid's was made ahead of time, hence where pid's overlap,
would not work with my EFROZEN scheme - you are right there.
How large is our numeric pid space (on 64 bit systems, anyway)? If
large enough, then reservation of pid ranges becomes an easy task. If
say we had 10 bits to spare, then a server farm could pre-ordain say a
thousand virtual servers, which come and go on various hardware
systems, each virtual server with its own hostname, pid-range, and
other such paraphernalia.
However there is an elephant in the room, or a camel outside the tent
or some such. Yes, the camel's nose may well fit inside the tent,
but before we invite his nose, we should have some idea if the rest
of the camel will fit.
In other words, since I don't see any compelling reason for this
virtualization of pids -except- for checkpoint/restart sorts of
features, the usefulness of pid virtualization would seem to rest on
the acceptability of the rest of the checkpoint/restart proposal.
For all I know now (not much) the amount of effort required to
sufficiently virtualize all the elements of the Linux kernel-user
interface enough to enable robust job migration across machines and
reboots may well make virtualizing the kernel's address space look easy.
Linux is not VM.
Hence, until sold on the larger direction, I am skeptical of this
first step.
Though, I will grant, I am interested too. A good Linux
checkpoint/restart solution would be valuable.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401
Quoting Paul Jackson ([email protected]):
> Serge wrote:
> > the vserver model
>
> What's that?
:) Well a vserver pretends to be a full system of its own, though you
can have lots of vservers on one machine. Processes in each virtual
server see only other processes in the same vserver. However in
vserver the pids they see are the real kernel pids - except for one
process per vserver which can be the fakeinit. Other processes in the
same vserver see it as pid 1, but to the kernel it is still known by
its real pid.
> How large is our numeric pid space (on 64 bit systems, anyway)? If
> large enough, then reservation of pid ranges becomes an easy task. If
> say we had 10 bits to spare, then a server farm could pre-ordain say a
> thousand virtual servers, which come and go on various hardware
> systems, each virtual server with its own hostname, pid-range, and
> other such paraphernalia.
In fact this is one way we considered implementing the virtual pids -
the pidspace id would be the upper some bits of the pid, and the vpid
would be the lower bits, so that the kernel pid would simply be
(pidspace_id << some_shift | vpid).
-serge
Quoting Paul Jackson ([email protected]):
> Have you crosstool'd built this for most arch's? I could imagine
> some piece of code having a local or other struct variable named 'pid'
> that would be broken by a mistake in this change. This could be so
> whether the change was done by a script, or by hand. Probably need
> to test 'allyesconfig' too.
Argh - in fact it appears I compiled and booted my 2.6.14 version,
not this 2.6.15-rc1 version. Another patch is needed for this to
compile and boot (on a power5 system, in addition to a patch pending
for -mm to make rpaphp_pci compile). Sorry.
Signed-off-by: Serge Hallyn <[email protected]>
---
block/cfq-iosched.c | 4 ++--
block/ll_rw_blk.c | 2 +-
kernel/ptrace.c | 2 +-
net/llc/af_llc.c | 2 +-
4 files changed, 5 insertions(+), 5 deletions(-)
Index: linux-2.6.14/kernel/ptrace.c
===================================================================
--- linux-2.6.14.orig/kernel/ptrace.c 2005-11-14 22:52:24.000000000 -0600
+++ linux-2.6.14/kernel/ptrace.c 2005-11-14 22:54:37.000000000 -0600
@@ -155,7 +155,7 @@ int ptrace_attach(struct task_struct *ta
retval = -EPERM;
if (task_pid(task) <= 1)
goto bad;
- if (task->tgid == current->tgid)
+ if (task_tgid(task) == task_tgid(current))
goto bad;
/* the same process cannot be attached many times */
if (task->ptrace & PT_PTRACED)
Index: linux-2.6.14/block/ll_rw_blk.c
===================================================================
--- linux-2.6.14.orig/block/ll_rw_blk.c 2005-11-14 22:52:07.000000000 -0600
+++ linux-2.6.14/block/ll_rw_blk.c 2005-11-14 23:07:51.000000000 -0600
@@ -2925,7 +2925,7 @@ void submit_bio(int rw, struct bio *bio)
if (unlikely(block_dump)) {
char b[BDEVNAME_SIZE];
printk(KERN_DEBUG "%s(%d): %s block %Lu on %s\n",
- current->comm, current->pid,
+ current->comm, task_pid(current),
(rw & WRITE) ? "WRITE" : "READ",
(unsigned long long)bio->bi_sector,
bdevname(bio->bi_bdev,b));
Index: linux-2.6.14/block/cfq-iosched.c
===================================================================
--- linux-2.6.14.orig/block/cfq-iosched.c 2005-11-14 22:52:07.000000000 -0600
+++ linux-2.6.14/block/cfq-iosched.c 2005-11-14 23:08:44.000000000 -0600
@@ -621,7 +621,7 @@ cfq_reposition_crq_rb(struct cfq_queue *
static struct request *cfq_find_rq_rb(struct cfq_data *cfqd, sector_t sector)
{
- struct cfq_queue *cfqq = cfq_find_cfq_hash(cfqd, current->pid, CFQ_KEY_ANY);
+ struct cfq_queue *cfqq = cfq_find_cfq_hash(cfqd, task_pid(current), CFQ_KEY_ANY);
struct rb_node *n;
if (!cfqq)
@@ -1754,7 +1754,7 @@ static void cfq_prio_boost(struct cfq_qu
static inline pid_t cfq_queue_pid(struct task_struct *task, int rw)
{
if (rw == READ || process_sync(task))
- return task->pid;
+ return task_pid(task);
return CFQ_KEY_ASYNC;
}
Index: linux-2.6.14/net/llc/af_llc.c
===================================================================
--- linux-2.6.14.orig/net/llc/af_llc.c 2005-10-27 19:02:08.000000000 -0500
+++ linux-2.6.14/net/llc/af_llc.c 2005-11-14 23:09:44.000000000 -0600
@@ -757,7 +757,7 @@ static int llc_ui_recvmsg(struct kiocb *
if (net_ratelimit())
printk(KERN_DEBUG "LLC(%s:%d): Application "
"bug, race in MSG_PEEK.\n",
- current->comm, current->pid);
+ current->comm, task_pid(current));
peek_seq = llc->copied_seq;
}
continue;
Serge wrote:
> Well a vserver pretends to be a full system of its own
Do you have any references, links?
> In fact this is one way we considered implementing the virtual pids -
No no - not what I meant. I meant to have the pid be the same in all
views, for all time, kernel and user, inside and outside the virtual
server. Just like pids are now. A given virtual server would have
a dedicated range of pids, reserved for it on all hardware systems
in the farm.
You could move the tasks in one such virtual server from one hardware
system to another without having pid collisions because the destination
hardware system would have been reserving those pids all along for
that virtual server.
The additional kernel facilities this would require:
1) For a given task (inherited) designate which pid range to use for
newly forked children.
2) Restart a task into the same pid it had before, which would fail
if that pid was in use by any other task on the system.
Administratively, create named virtual servers, each one assigned
a permanent pid range. On each hardware system, each task would
be managed to be using pids for forking from the pid range for the
virtual server it was running in.
Perhaps each hardware system would have one pid range, say pids
0..2000, overlapping with all other such systems, for tasks specific
to that hardware system. These tasks could not be checkpoint/restarted
on another hardware system.
For example, the virtual server "magnolia" would always have pids
9,000 to 9,999, and all hardware systems in the server farm would
keep this pid range open for running "magnolia", if asked to. If you
saw a tasks pid was 9.543, then you would immediately know it ran
on virtual server "magnolia."
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401
Quoting Paul Jackson ([email protected]):
> Serge wrote:
> > Well a vserver pretends to be a full system of its own
>
> Do you have any references, links?
Oh - linux-vserver.org
> > In fact this is one way we considered implementing the virtual pids -
>
> No no - not what I meant. I meant to have the pid be the same in all
> views, for all time, kernel and user, inside and outside the virtual
> server. Just like pids are now. A given virtual server would have
> a dedicated range of pids, reserved for it on all hardware systems
> in the farm.
But in the end isn't that more complicated than our approach? The
kernel still needs to be modified to let processes request their pids,
and now processes have to worry *always* about the value or range of
their pids, both at startup and restart. In the pidspace approach,
processes simply have a concept of starting a new pidspace, after
which the rest of the system processes are effectively gone as far as
this pidspace is concerned, and, other than that, processes continue
as normal. Upon restart, they do have to reclaim their vpids, either
from userspace, or through in-kernel restart code.
-serge
> But in the end isn't that more complicated than our approach? The
> kernel still needs to be modified to let processes request their pids,
No - getpid() works, as always. Perhaps I don't understand your
comment.
> and now processes have to worry *always* about the value or range of
> their pids, both at startup and restart.
No - tasks get the pid the kernel gives them at fork, as always.
The task keeps that exact same pid, across all checkpoints, restarts
and migrations. Nothing that the application process has to worry
about, either inside the kernel code or in userspace, beyond the fork
code honoring the assigned pid range when allocating a new pid.
No wide spread kernel code change, compared to yours. As now, tasks
have a pid field, and that pid is the same value, system-wide.
An additional per-task attribute, set by a batch manager typically
when it starts a job on a checkpointable, restartable, movable
"virtual server" connects the job parent task, and hence all its
descendents in that job, with a named kernel object that has among its
attributes a pid range. When fork is handing out new pids, it honors
that pid range. User level code, in the batch manager or system
administration layer manages a set of these named virtual servers,
including assigning pid ranges to them, and puts what is usually the
same such configuration on each server in the farm.
There will likely be other system-wide or job-wide name spaces and
associated resources that will need to be preserved across these
operations, such as shared memory, ipc, sockets, tmp files, signals,
locking, shared file descriptors, process tree, permissions, ulimits,
accounting, ... For each system-wide namespace, give each virtual
server a dedicated portion of that space, the same across all servers
in the farm. Where those names are kernel assigned, such as pids,
teach the kernel to assign within the specified portion, such as the
assigned pid range.
The real complexity comes, I claim, from changing the pid from a
system-wide name space to a partially per-job namespace. You can
never do that conversion entirely and will always have confusions
around the edges, as pids relative to one virtual server are used,
incorrectly, in the environment of another virtual server or system
wide.
The difficulty of things is best not measured by the effort to
do the first 90%, but by the effort to do the last 10%. And when
trying to reconcile two irreconcilable concepts, that last 10% can
never be completed.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401
On Tue, 2005-11-15 at 01:06 -0800, Paul Jackson wrote:
> No - tasks get the pid the kernel gives them at fork, as always.
> The task keeps that exact same pid, across all checkpoints, restarts
> and migrations. Nothing that the application process has to worry
> about, either inside the kernel code or in userspace, beyond the fork
> code honoring the assigned pid range when allocating a new pid.
The main issues I worry about with such a static allocation scheme are
getting the allocation patterns right, without causing new restrictions
on the containers. This kind of scheme is completely thrown out the
window if someone wanted to start a process on their disconnected laptop
and later migrate it to another machine when they connect back up to the
network.
> The real complexity comes, I claim, from changing the pid from a
> system-wide name space to a partially per-job namespace. You can
> never do that conversion entirely and will always have confusions
> around the edges, as pids relative to one virtual server are used,
> incorrectly, in the environment of another virtual server or system
> wide.
You're basically concerned about pids "leaking" across containers, and
confusing applications in the process? That's a pretty valid concern.
However, the long-term goal here is to virtualize more than pids. As
you noted, this will include thing like shm ids. Yes, I worry that
we'll end up modifying a _ton_ of stuff in the process of doing this.
As for passing confusing pids from different namespaces in the
filesystem, like in /var/run, there are solutions in the pipeline.
Private namespaces and versioned filesystems should be able to cope with
this kind of isolation very nicely.
-- Dave
On Mon, Nov 14, 2005 at 03:23:41PM -0600, Serge E. Hallyn wrote:
> --
>
> I'm part of a project implementing checkpoint/restart processes.
> After a process or group of processes is checkpointed, killed, and
> restarted, the changing of pids could confuse them. There are many
> other such issues, but we wanted to start with pids.
Can't you just build a restart preloader which intercepts system calls
and translates pids? Wouldn't this keep the kernel simpler and only
affect those applications that are being restarted? Christoph, I
added you since you seem to tirelessly promote using preloaders to
work around this type of issue. Is it possible?
Thanks,
Robin
On Tue, Nov 15, 2005 at 01:06:24AM -0800, Paul Jackson wrote:
> No - tasks get the pid the kernel gives them at fork, as always.
> The task keeps that exact same pid, across all checkpoints, restarts
> and migrations. Nothing that the application process has to worry
> about, either inside the kernel code or in userspace, beyond the fork
> code honoring the assigned pid range when allocating a new pid.
Paul, this approach seems very risky at best. How do you checkpoint,
stop, reboot the system, and restart? Does the system recall that a
checkpoint occured and then reserve those from early in boot? What about
a checkpointed task which is completed on a different system? How is
that handled? What if the checkpointed and terminated task is deemed
not worth restarting, how do you inform the system to reuse the pids.
Just seems like a hornets nest.
I would think _for_pids_and_not_everything, the checkpoint could write
out the core file and a restart file. The restart file would contain the
pid related information. Then the restart tool could use a preloader to
intercept system calls that specify pids and translate pids from old to
new. It might not be as easy as using the kernel, but it makes some sense
to from my limited point of view and makes the kernel code less polluted.
Robin
On Tue, 2005-11-15 at 05:17 -0600, Robin Holt wrote:
> Can't you just build a restart preloader which intercepts system calls
> and translates pids? Wouldn't this keep the kernel simpler and only
> affect those applications that are being restarted? Christoph, I
> added you since you seem to tirelessly promote using preloaders to
> work around this type of issue. Is it possible?
Statically linked applications really throw a pretty big monkey wrench
into that kind of plan. I'd hate to give up on _any_ statically linked
app from the beginning.
-- Dave
Quoting Paul Jackson ([email protected]):
> > But in the end isn't that more complicated than our approach? The
> > kernel still needs to be modified to let processes request their pids,
>
> No - getpid() works, as always. Perhaps I don't understand your
> comment.
>
>
> > and now processes have to worry *always* about the value or range of
> > their pids, both at startup and restart.
>
> No - tasks get the pid the kernel gives them at fork, as always.
> The task keeps that exact same pid, across all checkpoints, restarts
> and migrations. Nothing that the application process has to worry
> about, either inside the kernel code or in userspace, beyond the fork
> code honoring the assigned pid range when allocating a new pid.
Ok, so we have fork code to dole out pid ranges per vserver, I see where
the app doesn't need to request a pid on startup. But what about restart?
Surely the app still needs to be restarted with the same pid - just that
now we are more trusting that the pid remains available bc of the pid
ranges?
> No wide spread kernel code change, compared to yours. As now, tasks
Note that while the patch is large, so far its main purpose is to introduce
a clean concept rather than hack the vpid idea in. The latter has beem done
before, and only requires intercepting the points where pids go from user
to kernel. This leaves the question of which pid is which more ambiguous.
-serge
Quoting Paul Jackson ([email protected]):
> Have you crosstool'd built this for most arch's? I could imagine
Thanks for this suggestion. I've now done this for s390, and see how
to set it up easily for all arches - this should be a tremendous help!
thanks,
-serge
Serge E. Hallyn wrote:
> Quoting Paul Jackson ([email protected]):
>
There have been a few suggestions going fro and back.
Let me address them all at once.
(A) why a vpid?
For transparent checkpointing. Vserver for instance has not implemented
a checkpoint/restart yet, because without this concept it is not possible.
The moment you want transparent checkpoint, you need to deal with the fact
that the results of a getpid() are in register (worst case) and upon
restart the system must provide the same pid on the different machine.
That immediately suggest pid range reservation... but see point (B) below.
(B) syscall interception and LD_PRELOAD:
In principle that is possible, but it leads to potentially inefficient code
and at large leaves the issue of pid space creation and migration on the table.
However it makes clear that as long as I keep the transformation or mappings
consistent between virtual and real, that this is a quite useful concept.
The question now is how deep into the kernel do I have to drive it in order to
create an efficient implementation.
(C) Fixed PID range allocation:
That is completely unscalable and unnecessary:
First PID range allocation at a global level (e.g. cluster level) requires some agent.
Given that PID_MAX ~ 2**22 leaves us on 32-bit architectures with only 512 pidspaces (negative
range needs to be preserved I think).
However it is not unreasonable to assume that 512 different pidspaces per OS image is not
a restriction.
Hence, when a pidspace is migrated it will be assigned a different pidspace id.
Then going with kernelpid = (pidspace_id << 22) | vpid is an efficient means to
map between virtual pidspace and physical pidspace and vice versa.
All that needs to be managed is local pidspace allocation.
The translations from vpid <-> pid are very light weight as can be seen from the above
composition.
Take for example the vserver system. A local vserver agent could maintain the
pidspace allocation. On creation of a vserver it assigns the next available pidspace.
That pidspace id is internal to vserver and is not exported as a property of a vserver.
When a vserver is migrated to a different machine, a potentially different pidspace
is allocate, yet all the vpids remain the same.
(D) Cross compilation
I do all stuff on s390 so that space is covered.
If I missed some of the issues that were raised let me know and we will try to address
those.
I am part of Serge's team and have been working on intercepting the various places
where virtual to real pid translations have to occur in the kernel.
It's still in pretty bad shape, but it boots for the default pid space (:- ).
Of my head I say there are about 40 places each to do the translation.
Many are in the /proc/fs, some in the signal handling
I hope by end of the week I have something to post that gives idea how we are thinking
this could be realized.
On Mon, Nov 14, 2005 at 11:15:01PM -0600, Serge E. Hallyn wrote:
> Quoting Paul Jackson ([email protected]):
> > Serge wrote:
> > > the vserver model
> >
> > What's that?
>
> :) Well a vserver pretends to be a full system of its own, though you
> can have lots of vservers on one machine. Processes in each virtual
> server see only other processes in the same vserver. However in
> vserver the pids they see are the real kernel pids - except for one
> process per vserver which can be the fakeinit. Other processes in the
> same vserver see it as pid 1, but to the kernel it is still known by
> its real pid.
Why not just use Xen? It can handle process migration from one virtual
machine to another just fine.
thanks,
greg k-h
Quoting Greg KH ([email protected]):
> On Mon, Nov 14, 2005 at 11:15:01PM -0600, Serge E. Hallyn wrote:
> > Quoting Paul Jackson ([email protected]):
> > > Serge wrote:
> > > > the vserver model
> > >
> > > What's that?
> >
> > :) Well a vserver pretends to be a full system of its own, though you
> > can have lots of vservers on one machine. Processes in each virtual
> > server see only other processes in the same vserver. However in
> > vserver the pids they see are the real kernel pids - except for one
> > process per vserver which can be the fakeinit. Other processes in the
> > same vserver see it as pid 1, but to the kernel it is still known by
> > its real pid.
>
> Why not just use Xen? It can handle process migration from one virtual
> machine to another just fine.
It handles vm migration, but not process migration. The most compelling
thing (imo) which the latter allows you to do is live OS upgrade under
an application. Xen won't let you do that. Of course load-balancing is
also more fine-grained and powerful with process set migration, and
the overhead of a full OS per migrateable job is quite heavy.
That's not to say we don't also want to use Xen :) - it has it's own
advantages and I'm not intending to denigrate those. We just hope to
get both!
-serge
On Tue, 2005-11-15 at 08:47 -0800, Greg KH wrote:
> Why not just use Xen? It can handle process migration from one virtual
> machine to another just fine.
Xen is relatively slow compared to the approach that we want to use.
It's a pain in the neck to set up, especially if you want a _lot_ of
partitions. We were going to try to compare the relative performance of
the two approaches as as the number of vservers and Xen VMs is
increased. We haven't found anyone brave enough to set up 100 Xen
guests on a single system. :)
The overhead of storing the application snapshots that we're envisioning
can be quite tiny compared to Xen. This becomes horribly important if
you want to store the snapshots for a bit, and not simply keep one
around for long enough to restore the image elsewhere.
Xen doesn't share Linux caches between partitions. So, as you increase
the number of Xen partitions, the overhead of storing things like the
'/' dentry goes up pretty linearly. Keeping only one Linux instance
around makes such things much nicer to share.
The laundry-list of advantages is pretty long. This is starting to
sound like a good OLS paper :)
-- Dave
Well ... from your response, Dave, I think you understood what I was
saying. Thanks.
> The main issues I worry about with such a static allocation scheme are
> getting the allocation patterns right, without causing new restrictions
> on the containers.
Yes - the basic problem with pre-allocating static containers is that
you have to pre-allocate them ;).
> This kind of scheme is completely thrown out the
> window if someone wanted to start a process on their disconnected laptop
> and later migrate it to another machine when they connect back up to the
> network.
Transparent relocation without anticipating and preparing in _some_
way for that relocation prior to starting the job has the potential
to be one of those "Just say no to crazy requirements" moments. I am
sure some marketing folks don't agree.
> You're basically concerned about pids "leaking" across containers, and
> confusing applications in the process? That's a pretty valid concern.
Partly that, yes. Pids have been a system-wide notion since forever.
They get buried in lots of places and uses.
There is a natural tendency in these virtualization efforts to put
blinders on, and be encouraged by the potential ease of solving the
first 80% or 90% of the problem.
The kernel needs to be clear and consistent in what notions it
supports, avoiding getting caught between two inconsistent models
without a clear definition of which model applies when.
> we'll end up modifying a _ton_ of stuff in the process of doing this.
That's the kicker. Pid remapping seems useless by itself unless the
rest of this stuff is modified as well, to make a workable solution.
I'm looking for the larger design, before deciding on individual
patches.
My intuition is that somehow or other, jobs that have the potential
to be restarted or relocated have to start their life in a container
that isolates them.
My mind is wandering now to something Xen like, perhaps. Something
Open Source, available to us all, but with key virtualizations in
middleware rather then the kernel. See also the results of a Google
search on "checkpoint restart comparison zap" and the pods of Zap:
http://www.cs.cmu.edu/~sosman/publications/osdi2002/ZAP-OSDI.html.
This Zap paper is a good example of the design overview required to
motivate the individual patches needed to provide such a solution.
Their pod construct is not exactly what I was concocting in my
replies on this thread so far, but is far better thought out and more
persuasive, even to me.
A proposal that integrated a next generation Zap into the kernel
would be most interesting. We don't have to keep it isolated to a
loadable kernel module that hacks the system call table, which may
give us leverage on improving it in other ways.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401
I don't think that the checkpoint/restart/relocation design should be
driven by mico-optimizations of getpid(). It needs to be driven by
a design that addresses (for better or worse) the many larger questions
encountered in such an effort.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401
Paul Jackson wrote:
> I don't think that the checkpoint/restart/relocation design should be
> driven by mico-optimizations of getpid(). It needs to be driven by
> a design that addresses (for better or worse) the many larger questions
> encountered in such an effort.
>
Neither did we suggest one.
My note, was simply addressing some of the issues that were thrown out
with the alternative articulated in the many responses.
-- Hubertus
Quoting Paul Jackson ([email protected]):
> An additional per-task attribute, set by a batch manager typically
> when it starts a job on a checkpointable, restartable, movable
> "virtual server" connects the job parent task, and hence all its
> descendents in that job, with a named kernel object that has among its
> attributes a pid range. When fork is handing out new pids, it honors
> that pid range. User level code, in the batch manager or system
> administration layer manages a set of these named virtual servers,
> including assigning pid ranges to them, and puts what is usually the
> same such configuration on each server in the farm.
I guess the one thing I really don't see supported here (apart from the
system/laptop joins the network after spawning a job which has been
mentioned) is restarting multiple simulatenous instances of a single
checkpoint. In the pidspace approach, each restarted instance would
have a different pidspace id, but use the same vpids. In the
preallocation scheme, only one pid has been reserved at checkpoint for
each process instance.
Is there a simple way to solve this? (And how valid a concern is
this? :)
Other than that, I guess your approach is growing on me...
-serge
Serge E. Hallyn wrote:
> Quoting Paul Jackson ([email protected]):
>
>>An additional per-task attribute, set by a batch manager typically
>>when it starts a job on a checkpointable, restartable, movable
>>"virtual server" connects the job parent task, and hence all its
>>descendents in that job, with a named kernel object that has among its
>>attributes a pid range. When fork is handing out new pids, it honors
>>that pid range. User level code, in the batch manager or system
>>administration layer manages a set of these named virtual servers,
>>including assigning pid ranges to them, and puts what is usually the
>>same such configuration on each server in the farm.
>
And that's how we implement this.
The difference is that the pidrange-id is assigned on the fly,
that is when the virtual server is created or recreated after restart.
This, as described in my previous note, is more scalable, because I don't have
to do a global pidrange partitioning.
global pidrange partitioning has implications, for instance what if I simply
want to freeze an app only to restart it much later. This would freeze that
range autmatically.
On process restart, we force fork to use a particular <pidspace/pid> for its
kernel pid assignment, rather then searching for a free one.
-- Hubertus
On Monday 14 November 2005 15:23, Serge E. Hallyn wrote:
> --
>
> I'm part of a project implementing checkpoint/restart processes.
> After a process or group of processes is checkpointed, killed, and
> restarted, the changing of pids could confuse them. There are many
> other such issues, but we wanted to start with pids.
>
I've read through the rest of this thread, but it seems to me that the real
problems are in the basic assumptions you are making that are driving the
rest of this effort and perhaps we should be examining those assumptions
instead of your patch.
For example, from what I've read (particularly Hubertus's post that the pid
could be in a register), I'm inferring that what you want to do is to be able
to checkpoint/restart an arbitrary process at an arbitrary time and without
any special support for checkpoint/restart in that process.
Also (c. f. Dave Hansen's post on the number of Xen virtual machines
required), you appear to think that the number of processes on the system
for which checkpoint/restart should be enabled is large (more or less the
same as the number of processes on the system).
Am I reading this correctly?
--
Ray Bryant
AMD Performance Labs Austin, Tx
512-602-0038 (o) 512-507-7807 (c)
Quoting Ray Bryant ([email protected]):
> On Monday 14 November 2005 15:23, Serge E. Hallyn wrote:
> > --
> >
> > I'm part of a project implementing checkpoint/restart processes.
> > After a process or group of processes is checkpointed, killed, and
> > restarted, the changing of pids could confuse them. There are many
> > other such issues, but we wanted to start with pids.
> >
>
> I've read through the rest of this thread, but it seems to me that the real
> problems are in the basic assumptions you are making that are driving the
> rest of this effort and perhaps we should be examining those assumptions
> instead of your patch.
Ok.
> For example, from what I've read (particularly Hubertus's post that the pid
> could be in a register), I'm inferring that what you want to do is to be able
> to checkpoint/restart an arbitrary process at an arbitrary time and without
> any special support for checkpoint/restart in that process.
Yes.
> Also (c. f. Dave Hansen's post on the number of Xen virtual machines
> required), you appear to think that the number of processes on the system
> for which checkpoint/restart should be enabled is large (more or less the
> same as the number of processes on the system).
Right.
> Am I reading this correctly?
As far as I can see, yes.
-serge
On Tuesday 15 November 2005 13:41, Serge E. Hallyn wrote:
> Quoting Ray Bryant ([email protected]):
> > On Monday 14 November 2005 15:23, Serge E. Hallyn wrote:
> > > --
> > >
> > > I'm part of a project implementing checkpoint/restart processes.
> > > After a process or group of processes is checkpointed, killed, and
> > > restarted, the changing of pids could confuse them. There are many
> > > other such issues, but we wanted to start with pids.
> >
> > I've read through the rest of this thread, but it seems to me that the
> > real problems are in the basic assumptions you are making that are
> > driving the rest of this effort and perhaps we should be examining those
> > assumptions instead of your patch.
>
> Ok.
>
> > For example, from what I've read (particularly Hubertus's post that the
> > pid could be in a register), I'm inferring that what you want to do is to
> > be able to checkpoint/restart an arbitrary process at an arbitrary time
> > and without any special support for checkpoint/restart in that process.
>
> Yes.
>
> > Also (c. f. Dave Hansen's post on the number of Xen virtual machines
> > required), you appear to think that the number of processes on the
> > system for which checkpoint/restart should be enabled is large (more or
> > less the same as the number of processes on the system).
>
> Right.
>
> > Am I reading this correctly?
>
> As far as I can see, yes.
>
> -serge
Personally, I think that these assumptions are incorrect for a
checkpoint/restart facility. I think that:
(1) It is really only possible to checkpoint/restart a cooperative process.
For this to work with uncooperative processes you have to figure out (for
example) how to save and restore the file system state. (e. g. how do you
get the file position set correctly for an open file in the restored program
instance?) And this doesn't even consider what to do with open network
connections.
Similarly, what does one do about the content of System V shared memory
regions or the contents of System V semaphores? I'm sure there are many
more such problems we can come up with a careful study of the Linux/Unix API.
(Note that "cooperation" in this context can also mean "willing to run inside
of a container of some kind that supports checkpoint/restart".)
So you can probably only checkpoint the process at certain points in its
lifetime, points which the application should be willing to identify in some
way. And I would argue that at such points in time, you can require that
the current register state doesn't include the results of a system call such
as getpid(), couldn't you?
(2) Checkpoint/Restart really only makes sense for a long running, resource
intensive job. (e. g. for a job that is doing a lot of work and hence, for
which, recovery is hard -- perhaps as hard as re-running the entire job).
By their very nature, there are probably only a few such jobs running on the
system. If there are lots of such jobs on the system, then re-running each
one can't be that hard, can it?
So, I guess my question is wrt the task_pid API is the following: Given that
there are a lot of other problems to solve before transparent checkpointing
of uncooperative processes is possible, why should this partial solution be
accepted into the main line kernel and "set in stone" so to speak?
Don't get me wrong, I would love for there to be a commonly accepted
checkpoint/restart API. But I don't think that this can be done
transparently at the kernel level and without some cooperation from the
target task.
--
Ray Bryant
AMD Performance Labs Austin, Tx
512-602-0038 (o) 512-507-7807 (c)
Quoting Ray Bryant ([email protected]):
> On Tuesday 15 November 2005 13:41, Serge E. Hallyn wrote:
> > Quoting Ray Bryant ([email protected]):
> > > On Monday 14 November 2005 15:23, Serge E. Hallyn wrote:
> > > > --
> > > >
> > > > I'm part of a project implementing checkpoint/restart processes.
> > > > After a process or group of processes is checkpointed, killed, and
> > > > restarted, the changing of pids could confuse them. There are many
> > > > other such issues, but we wanted to start with pids.
> > >
> > > I've read through the rest of this thread, but it seems to me that the
> > > real problems are in the basic assumptions you are making that are
> > > driving the rest of this effort and perhaps we should be examining those
> > > assumptions instead of your patch.
> >
> > Ok.
> >
> > > For example, from what I've read (particularly Hubertus's post that the
> > > pid could be in a register), I'm inferring that what you want to do is to
> > > be able to checkpoint/restart an arbitrary process at an arbitrary time
> > > and without any special support for checkpoint/restart in that process.
> >
> > Yes.
> >
> > > Also (c. f. Dave Hansen's post on the number of Xen virtual machines
> > > required), you appear to think that the number of processes on the
> > > system for which checkpoint/restart should be enabled is large (more or
> > > less the same as the number of processes on the system).
> >
> > Right.
> >
> > > Am I reading this correctly?
> >
> > As far as I can see, yes.
> >
> > -serge
>
> Personally, I think that these assumptions are incorrect for a
> checkpoint/restart facility. I think that:
>
> (1) It is really only possible to checkpoint/restart a cooperative process.
> For this to work with uncooperative processes you have to figure out (for
> example) how to save and restore the file system state. (e. g. how do you
> get the file position set correctly for an open file in the restored program
> instance?) And this doesn't even consider what to do with open network
> connections.
Many of these problems have been solved before. See for instance Zap
and ckpt (http://www.cs.wisc.edu/~zandy/ckpt), and more examples at
http://www.checkpointing.org/
Certainly there are pieces of state which are harder to correclty
restore than others, but even network connections have been shown to be
migrateable (see zap, and tpccp for cryopid). In fact IIUC one of the
hardest things to deal with are ttys.
> Similarly, what does one do about the content of System V shared memory
> regions or the contents of System V semaphores? I'm sure there are many
> more such problems we can come up with a careful study of the Linux/Unix API.
Yup, sysv shmem has been handled before... And yes, there are plenty
more :) /proc files, device files, pipes...
> (Note that "cooperation" in this context can also mean "willing to run inside
> of a container of some kind that supports checkpoint/restart".)
>
> So you can probably only checkpoint the process at certain points in its
> lifetime, points which the application should be willing to identify in some
This has been demonstrated to be not true. Again, see ckpt for a simple
example.
Oh, right, well willingness to run inside of a container *is* something we
would require :) Not needed for checkpointing a single process,
however - see ckpt.
> So, I guess my question is wrt the task_pid API is the following: Given that
> there are a lot of other problems to solve before transparent checkpointing
> of uncooperative processes is possible, why should this partial solution be
> accepted into the main line kernel and "set in stone" so to speak?
It shouldn't. This was a request for comment, not for inclusion :) As
you say there are lots of pieces to put together, and we simply decided
to start with trying to solve this one.
In fact what I sent out doesn't even do the stuff we've been talking
about lately - that will come later this week. This patchset merely lays
the groundwork for that.
-serge
Serge wrote:
> restarting multiple simulatenous instances of a single
> checkpoint. ... Is there a simple way to solve this?
> (And how valid a concern is this? :)
Offhand, I don't see a way to resolve it in the preallocation scheme.
That's not on my list of stuff to worry about. But, who knows,
someone else might find that a valid concern.
> Other than that, I guess your approach is growing on me...
Oh dear. I'm drifting away from advocating a pid-range preallocation
and toward thinking we need a more systematic approach, design and
architecture. This isn't just pids. Simple range based preallocation
won't help much on some of the other resources that we need to virtualize.
The Zap pods are sounding good to me right now, properly embedded
in the kernel rather than hacking the syscall table via a module.
In any case, I am suspecting that starting the job in some sort
of nice container should be a prerequisite for relocating or
checkpoint/restarting the job.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401
> This has been demonstrated to be not true. Again, see ckpt for a simple
> example.
>
> Oh, right, well willingness to run inside of a container *is* something we
> would require :) Not needed for checkpointing a single process,
> however - see ckpt.
Be careful not to assume that some set of requirements on our result
is an appropriate set because, for each requirement, someone else has
demonstrated a solution that meets that requirement.
Sometimes there are tradeoffs. For example, ckpt will checkpoint/restart
a single task without kernel support, but doesn't preserve (from its README
at http://www.cs.wisc.edu/~zandy/ckpt/README):
- File descriptors of open files, special devices,
pipes, and sockets;
- Interprocess communication state (such as shared memory, semaphores,
mutex, messages);
- Kernel-level thread state;
- Process identifiers, including process id, process group;
id, user id, or group id.
and doesn't work with static bound programs (uses PRELOAD).
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401
Ray Bryant wrote:
> Personally, I think that these assumptions are incorrect for a
> checkpoint/restart facility. I think that:
>
> (1) It is really only possible to checkpoint/restart a cooperative process.
What do you mean by cooperative ? that the code should be modified to
cooperate with a checkpoint/restart tool ? do you have something else in mind ?
> For this to work with uncooperative processes you have to figure out (for
> example) how to save and restore the file system state.
Files are definitely very difficult to handle efficiently but there are
ways to deal with them. One way is not to deal with them at all and let the
application organize its data in such a way that we don't have to
checkpoint the file system, share storage is one solution, copying files
from a checkpointed node to another node is an other. It can be very
inefficient but it works.
But, I agree with you, we don't want to checkpoint a filesystem.
> (e. g. how do you get the file position set correctly for an open file in
> the restored program instance?)
well, if the file is available, lseek() should do the job. Pipes are more
difficult to handle than regular files in fact.
> And this doesn't even consider what to do with open network connections.
network connections are indeed very tricky. The network code is very
complex, very large, plenty of protocols to handle but it can be done for
TCP/IP by blocking the traffic, checkpointing the data and checkpointing
the PCBs. But tell me more, what are the main issues for you ?
Private interconnect are a challenge.
> Similarly, what does one do about the content of System V shared memory
> regions or the contents of System V semaphores?
Well, they have to be constrained in a known set of processes, or a
container, to make sure we are not checkpointing a moving target.
> I'm sure there are many more such problems we can come up with a careful
> study of the Linux/Unix API.
Oh yes, the UNIX API is very large but in checkpoint/restart we care more
about implementation. This can be tricky.
> (Note that "cooperation" in this context can also mean "willing to run inside
> of a container of some kind that supports checkpoint/restart".)
Indeed !
We need an isolation mecanism to make sure we control the boundaries of an
application. We don't want any leaks when we initiate a checkpoint.
> So you can probably only checkpoint the process at certain points in its
> lifetime, points which the application should be willing to identify in some
> way.
We do need to reach a quiescience point. a SIGSTOP is enough or a container
wide schedule(), a la software suspend. But no more.
> And I would argue that at such points in time, you can require that
> the current register state doesn't include the results of a system call such
> as getpid(), couldn't you?
Well, what if that register holds a virtualized pid, this is no more an
issue, nop ?
> (2) Checkpoint/Restart really only makes sense for a long running, resource
> intensive job. (e. g. for a job that is doing a lot of work and hence, for
> which, recovery is hard -- perhaps as hard as re-running the entire job).
HPC industry is indeed an obvious target.
However, we have successfully checkpinted desktop applications like
openoffice, thunderbird, mozilla, emacs, etc. We are still working on pty
in order to migrate terminals. We think in can also be useful in that area
and others.
> By their very nature, there are probably only a few such jobs running on the
> system. If there are lots of such jobs on the system, then re-running each
> one can't be that hard, can it?
hmm, didn't get your point here ? can you elaborate ?
> So, I guess my question is wrt the task_pid API is the following: Given that
> there are a lot of other problems to solve before transparent checkpointing
> of uncooperative processes is possible, why should this partial solution be
> accepted into the main line kernel and "set in stone" so to speak?
Well, let's say that we want to present this one step after the other and
try to have each step brings some interesting value to the linux kernel.
Process aggregation is the first big step, other projects have shown
interest in this area, PAGG for instance. Isolation is another. The
virtualization step could be thought as dedicated to checkpoint/restart but
we're pretty sure it should help some projects like vserser that need to
virtualize some ancestor pid. On that subject, having a way to manage
cluster wide pids could be useful to HPC batch managers.
> Don't get me wrong, I would love for there to be a commonly accepted
> checkpoint/restart API. But I don't think that this can be done
> transparently at the kernel level and without some cooperation from the
> target task.
Well, we've already migrated some pretty ugly applications, database
engines, without modifying them :)
C.
Paul Jackson wrote:
> Oh dear. I'm drifting away from advocating a pid-range preallocation
> and toward thinking we need a more systematic approach, design and
> architecture. This isn't just pids. Simple range based preallocation
> won't help much on some of the other resources that we need to virtualize.
Ah ! you said the word: "virtualize".
> The Zap pods are sounding good to me right now, properly embedded
> in the kernel rather than hacking the syscall table via a module.
hacking the syscall table via a module is evil and does not work. You can't
hack pids in a signal siginfo that way, you won't support NPTL, etc.
> In any case, I am suspecting that starting the job in some sort
> of nice container should be a prerequisite for relocating or
> checkpoint/restarting the job.
Indeed. Did you ever think about using PAGG as a foundation for a
checkpoint/restart container ?
Aggregation and isolation are key requirements for checkpoint/restart. And
then, the next one that comes on the list is private namespace or
virtualization, depends how you call it :)
C.
Cedric wrote:
> Indeed. Did you ever think about using PAGG as a foundation for a
> checkpoint/restart container ?
Other than the name, I didn't realize that PAGG provided a
good foundation for this work. However, I should let my
PAGG colleagues address that further if worthwhile.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401
Cedric wrote:
> Well, we've already migrated some pretty ugly applications, database
> engines, without modifying them :)
You will have to teach us how it is done.
As you likely know by now, Linux doesn't incorporate new technology
en mass. We sniff and poke at it, break it down into its constituent
elements and reconstitute it in ways that seem to fit Linux better.
It's all part of how we keep the body Linux healthy.
Especially for things like this that touch many of the more interesting
kernel constructs, the end result seldom resembles the initial input.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401
I did some checkpoint/restart work on Linux about 5 years ago (you may
still be able to google "CRAK"), so I'm jumping in with my 2 cents.
> Personally, I think that these assumptions are incorrect for a
> checkpoint/restart facility. I think that:
>
> (1) It is really only possible to checkpoint/restart a
> cooperative process.
It's hard, but not impossible, at least theoretically.
> For this to work with uncooperative processes you have to
> figure out (for example) how to save and restore the file
> system state. (e.g. how do you
> get the file position set correctly for an open file in the
> restored program instance?)
This is actually one of the simplest problems in checkpoint/restart.
You'd need kernel support to save the state, and restart could be done
entirely in user space to restore the file descriptors.
> And this doesn't even consider what to do with open network
> connections.
Right, this is really hard. I played with it 5 years ago and I had semi
success on restoring network connections (with my limited understanding
on Linux networking and some really ugly hacks). I could restart a
killed remote Emacs X session with about 50% success rate.
> Similarly, what does one do about the content of System V
> shared memory regions or the contents of System V semaphores? I'm
sure
> there are many more such problems we can come up with a careful study
of the
> Linux/Unix API.
>
> (Note that "cooperation" in this context can also mean
> "willing to run inside of a container of some kind that supports
checkpoint/restart".)
>
> So you can probably only checkpoint the process at certain
> points in its lifetime, points which the application should be willing
to
> identify in some way. And I would argue that at such points in
time, you
> can require that the current register state doesn't include the
results of a
> system call such as getpid(), couldn't you?
Again, it IS very hard, but I don't think it's impossible to have
transparent checkpoint/restart. I mean, it cant be more difficult than
writing Linux from scratch, can it? :-)
> So, I guess my question is wrt the task_pid API is the
> following: Given that there are a lot of other problems to solve
before transparent
> checkpointing of uncooperative processes is possible, why should this
> partial solution be accepted into the main line kernel and "set in
stone" so to speak?
I agree with this. Before we see a mature checkpoint/restart solution
already implemented, there is no point in doing the vpid thing.
On Tue, Nov 15, 2005 at 06:24:44PM -0800, Hua Zhong (hzhong) wrote:
> I did some checkpoint/restart work on Linux about 5 years ago (you may
> still be able to google "CRAK"), so I'm jumping in with my 2 cents.
Ditto - CryoPID.
> > Personally, I think that these assumptions are incorrect for a
> > checkpoint/restart facility. I think that:
> >
> > (1) It is really only possible to checkpoint/restart a
> > cooperative process.
I agree that some processes are, but the majority are not.
> It's hard, but not impossible, at least theoretically.
Not that hard. Most of the information, if not exported by the
kernel through other means, can be ascertained from within the
process itself. For example, FD offsets can be obtained with lseek,
network connection endpoints with get{sock,peer}name, etc. With a
little help from ptrace, it's trivial.
> > For this to work with uncooperative processes you have to
> > figure out (for example) how to save and restore the file
> > system state. (e.g. how do you get the file position set
> > correctly for an open file in the restored program instance?)
>
> This is actually one of the simplest problems in checkpoint/restart.
Remote syscalls is how CryoPID does it. Inject some code into the
target and execute it. CryoPID can also scrape out the contents of
unlinked (eg, temporary) files (in svn version). You can establish
what FIFOs joined which FDs of what processes through /proc, and
capture the in-flight buffers with some more ptracing.
> You'd need kernel support to save the state, and restart could be
> done entirely in user space to restore the file descriptors.
Just for file offsets? I disagree =)
> > And this doesn't even consider what to do with open network
> > connections.
>
> Right, this is really hard. I played with it 5 years ago and I had semi
> success on restoring network connections (with my limited understanding
> on Linux networking and some really ugly hacks). I could restart a
> killed remote Emacs X session with about 50% success rate.
TCP connections can be done with tcpcp (tcpcp.sf.net) and CryoPID
already supports it (although the patch hasn't been ported past
2.6.11).
UDP connections are not a hassle being stateless (though CryoPID
doesn't yet, because it's too easy :)
Unix sockets can be reconnected, but of course protocols might get
hopelessly confused. However, I am working with the Gtk+ display
migration code to freeze Gtk+ applications. gtk-demo freezes quite
happily as does gvim. The Gtk+ guys are working hard to squash some
remaining bugs to make more apps supported.
However with some prethought, you could hook your X app up to
something like Xmove and migrate any X application that way.
> > Similarly, what does one do about the content of System V shared
> > memory regions or the contents of System V semaphores?
The contents are not so much an issue, as the ids themselves. They
face much the same problem as PIDs - you attach/detach to a SHM
segment by its shmid. These are allocated by the kernel, but cached
in the process. Some method of requesting a particular shmid would
make life easier for checkpointing. Ditto semaphores and message
queues.
> > I'm sure there are many more such problems we can come up with a
> > careful study of the Linux/Unix API.
For many processes, there isn't all that much that can't be saved
from userspace. Help from kernel space would certainly make things
easier/faster/more reliable. This task_pid API being one of them.
> > So, I guess my question is wrt the task_pid API is the
> > following: Given that there are a lot of other problems to
> > solve before transparent checkpointing of uncooperative
> > processes is possible, why should this partial solution be
> > accepted into the main line kernel and "set in stone" so to
> > speak?
>
> I agree with this. Before we see a mature checkpoint/restart solution
> already implemented, there is no point in doing the vpid thing.
Fair enough. I'm actually implementing multithreading support for
CryoPID at the moment. It currently resumes with the original PIDs
by editing last_pid in /dev/kmem and forking - a temporary racy
hack, until a better solution (such as this) appeared. (It fails if
the PID is in use, or if anybody else on the system fork()s before
you do). I'll give the task_pid patches a try and see how much
easier life is.
I'm delighted to see Serge and the vserver guys putting the time
into this! Thanks!
Regards,
Bernard.
> > Have you crosstool'd built this for most arch's? I could imagine
> > some piece of code having a local or other struct variable named 'pid'
> > that would be broken by a mistake in this change. This could be so
> > whether the change was done by a script, or by hand. Probably need
> > to test 'allyesconfig' too.
>
> Argh - in fact it appears I compiled and booted my 2.6.14 version,
> not this 2.6.15-rc1 version. Another patch is needed for this to
> compile and boot (on a power5 system, in addition to a patch pending
> for -mm to make rpaphp_pci compile). Sorry.
> @@ -2925,7 +2925,7 @@ void submit_bio(int rw, struct bio *bio)
> if (unlikely(block_dump)) {
> char b[BDEVNAME_SIZE];
> printk(KERN_DEBUG "%s(%d): %s block %Lu on %s\n",
> - current->comm, current->pid,
> + current->comm, task_pid(current),
> (rw & WRITE) ? "WRITE" : "READ",
> (unsigned long long)bio->bi_sector,
> bdevname(bio->bi_bdev,b));
...and now printk is close to useless, because uer can't know to which
pidspace that pid belongs. Oops.
--
64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms
On Nov 13, 2005, at 10:22, Pavel Machek wrote:
>> @@ -2925,7 +2925,7 @@ void submit_bio(int rw, struct bio *bio)
>> if (unlikely(block_dump)) {
>> char b[BDEVNAME_SIZE];
>> printk(KERN_DEBUG "%s(%d): %s block %Lu on %s\n",
>> - current->comm, current->pid,
>> + current->comm, task_pid(current),
>> (rw & WRITE) ? "WRITE" : "READ",
>> (unsigned long long)bio->bi_sector,
>> bdevname(bio->bi_bdev,b));
>
> ...and now printk is close to useless, because uer can't know to
> which pidspace that pid belongs. Oops.
Uhh, this patch doesn't introduce any kind of virtualization yet.
When that happens, _this_ code will remain the same (it wants the
real pid), but *other* code will switch to use task_vpid(current)
instead. This is an extremely literal translation of current->pid to
task_pid(current), both of which do exactly the same thing.
Cheers,
Kyle Moffett
--
There is no way to make Linux robust with unreliable memory
subsystems, sorry. It would be like trying to make a human more
robust with an unreliable O2 supply. Memory just has to work.
-- Andi Kleen
On Sun, 2005-11-13 at 15:22 +0000, Pavel Machek wrote:
> > @@ -2925,7 +2925,7 @@ void submit_bio(int rw, struct bio *bio)
> > if (unlikely(block_dump)) {
> > char b[BDEVNAME_SIZE];
> > printk(KERN_DEBUG "%s(%d): %s block %Lu on %s\n",
> > - current->comm, current->pid,
> > + current->comm, task_pid(current),
> > (rw & WRITE) ? "WRITE" : "READ",
> > (unsigned long long)bio->bi_sector,
> > bdevname(bio->bi_bdev,b));
>
> ...and now printk is close to useless, because uer can't know to which
> pidspace that pid belongs. Oops.
That is true, but only if we print the virtualized pid. Before we go
and actually implement the pid virtualization, we probably need a
thorough audit of this kind of stuff to see what we really want.
There will always be a "real pid" (the real thing in tsk->__pid) backing
whatever is virtualized and presented to getpid(). I would imagine that
this is the same that needs to go into dmesg.
-- Dave
Hi!
> >>@@ -2925,7 +2925,7 @@ void submit_bio(int rw, struct bio *bio)
> >> if (unlikely(block_dump)) {
> >> char b[BDEVNAME_SIZE];
> >> printk(KERN_DEBUG "%s(%d): %s block %Lu on %s\n",
> >>- current->comm, current->pid,
> >>+ current->comm, task_pid(current),
> >> (rw & WRITE) ? "WRITE" : "READ",
> >> (unsigned long long)bio->bi_sector,
> >> bdevname(bio->bi_bdev,b));
> >
> >...and now printk is close to useless, because uer can't know to
> >which pidspace that pid belongs. Oops.
>
> Uhh, this patch doesn't introduce any kind of virtualization yet.
> When that happens, _this_ code will remain the same (it wants the
> real pid), but *other* code will switch to use task_vpid(current)
> instead. This is an extremely literal translation of current->pid to
> task_pid(current), both of which do exactly the same thing.
Hmm... it is hard to judge a patch without context. Anyway, can't we
get process snasphot/resume without virtualizing pids? Could we switch
to 128-bits so that pids are never reused or something like that?
Pavel
--
Thanks, Sharp!
On Wed, 2005-11-16 at 21:36 +0100, Pavel Machek wrote:
> Hmm... it is hard to judge a patch without context. Anyway, can't we
> get process snasphot/resume without virtualizing pids? Could we switch
> to 128-bits so that pids are never reused or something like that?
That might work fine for a managed cluster, but it wouldn't be a good
fit if you ever wanted to support something like a laptop in
disconnected operation, or if you ever want to restore the same snapshot
more than once. There may also be some practical userspace issues
making pids that large.
I also hate bloating types and making them sparse just for the hell of
it. It is seriously demoralizing to do a ps and see
7011827128432950176177290 staring back at you. :)
-- Dave
> Could we switch
> to 128-bits so that pids are never reused or something like that?
Not easily. We've got a very cool pid-dispenser at this point that
has excellent performance and scalability, but requires a bit map,
one bit per potential pid. That bitmap can't exceed a small percentage
of main memory on most any configuration, constraining us to perhaps
20 to 30 bits. The code currently has a 22 bit arbitrary limit.
Something like 30 bits would usually only make sense on the terabyte
NUMA monster boxes.
128-bit UUID technology scales fine, but adds quite a few compute
cycles per allocation, and would blow out a whole lot of user code
expecting to put a pid in a machine word.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401
Hi!
> > Hmm... it is hard to judge a patch without context. Anyway, can't we
> > get process snasphot/resume without virtualizing pids? Could we switch
> > to 128-bits so that pids are never reused or something like that?
>
> That might work fine for a managed cluster, but it wouldn't be a good
> fit if you ever wanted to support something like a laptop in
> disconnected operation, or if you ever want to restore the same snapshot
> more than once. There may also be some practical userspace issues
> making pids that large.
>
> I also hate bloating types and making them sparse just for the hell of
> it. It is seriously demoralizing to do a ps and see
> 7011827128432950176177290 staring back at you. :)
Well, doing cat /var/something/foo.pid, and seeing pid of unrelated process
is wrong, too... especially if you try to kill it....
--
64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms
Quoting Pavel Machek ([email protected]):
> Hi!
>
> > > Hmm... it is hard to judge a patch without context. Anyway, can't we
> > > get process snasphot/resume without virtualizing pids? Could we switch
> > > to 128-bits so that pids are never reused or something like that?
> >
> > That might work fine for a managed cluster, but it wouldn't be a good
> > fit if you ever wanted to support something like a laptop in
> > disconnected operation, or if you ever want to restore the same snapshot
> > more than once. There may also be some practical userspace issues
> > making pids that large.
> >
> > I also hate bloating types and making them sparse just for the hell of
> > it. It is seriously demoralizing to do a ps and see
> > 7011827128432950176177290 staring back at you. :)
>
> Well, doing cat /var/something/foo.pid, and seeing pid of unrelated process
> is wrong, too... especially if you try to kill it....
Good point. However the foo.pid scheme is incompatible with
checkpoint/restart and migration regardless.
a. what good is trying to kill something using such a file if
the process is checkpointed+killed, to be restarted later?
b. it is expected that any files used by a checkpointable
processes exist on a network fs, so that the fd can be moved.
What good is foo.pid if it's on a network filesystem?
So if you wanted to checkpoint and restart/migrate a process with a
foo.pid type of file, you might need to start it with a private
tmpfs in a private namespace. That part is trivial to do as part
of the management tools, though checkpointing a whole tmpfs per process
could be unfortunate.
-serge
On 20 Nov 2005, Pavel Machek stated:
> Hi!
>> I also hate bloating types and making them sparse just for the hell of
>> it. It is seriously demoralizing to do a ps and see
>> 7011827128432950176177290 staring back at you. :)
>
> Well, doing cat /var/something/foo.pid, and seeing pid of unrelated process
> is wrong, too... especially if you try to kill it....
I'd venture to say that anything using multiple pid namespaces should be
mounting its own private /var/run subtree (and similar for other such
directories). Consider it as similar to the problem of colliding PIDs on
multiple physical hosts, or on a host and a pile of UMLs it's running:
the solution there, too, is to unshare /var/run (by not NFS-exporting it
in the latter two cases, by private mounts in the former).
(Isn't this PID namespace stuff meant for chroots anyway?)
--
`Y'know, London's nice at this time of year. If you like your cities
freezing cold and full of surly gits.' --- David Damerell
"Serge E. Hallyn" <[email protected]> writes:
> --
>
> I'm part of a project implementing checkpoint/restart processes.
> After a process or group of processes is checkpointed, killed, and
> restarted, the changing of pids could confuse them. There are many
> other such issues, but we wanted to start with pids.
>
> Does something like this, presumably after much working over, seem
> mergeable?
This set of patches looks like a global s/current->pid/task_pid(current)/
Which may be an interesting exercise but I don't see how this
helps your problem. And as has been shown by a few comments
this process making all of these changes is subject to human error.
Many of the interesting places that deal with pids and where you
want translation are not where the values are read from current->pid,
but where the values are passed between functions. Think about
the return value of do_fork.
There are also a lot of cases you haven't even tried to address.
You haven't touched process groups, or sessions.
At the current time the patch definitely fails the no in kernel
users test because it doesn't go as far as taking advantage
of the abstraction it attempts to introduce. Which means
other people can't read through the code and make sense
of what you are trying to do or to see if there is a better way.
I will also contend that walking down a path that does not cause
compilation to fail when the subtle things like which flavor of
pid you want to see is a problem.
Another question is how do your pid spaces nest. Currently
it sounds like you are taking the vserver model and allowing
everyone outside your pid space to see all of your internal
pids. Is this really what you want? Who do you report as
the source of your signal.
What pid does waitpid return when the parent of your pidspace exits?
What pid does waitpid return when both processes are in the same pidspace?
How does /proc handle multiple pid spaces?
While something allowing multiple pidspaces may be mergeable,
unnecessary and incomplete changes rarely are. This is a fundamental
change to the unix API so it will take a lot of scrutiny to get
merged.
Eric
"Serge E. Hallyn" <[email protected]> writes:
> Quoting Pavel Machek ([email protected]):
>> Hi!
>>
>> > > Hmm... it is hard to judge a patch without context. Anyway, can't we
>> > > get process snasphot/resume without virtualizing pids? Could we switch
>> > > to 128-bits so that pids are never reused or something like that?
>> >
>> > That might work fine for a managed cluster, but it wouldn't be a good
>> > fit if you ever wanted to support something like a laptop in
>> > disconnected operation, or if you ever want to restore the same snapshot
>> > more than once. There may also be some practical userspace issues
>> > making pids that large.
>> >
>> > I also hate bloating types and making them sparse just for the hell of
>> > it. It is seriously demoralizing to do a ps and see
>> > 7011827128432950176177290 staring back at you. :)
>>
>> Well, doing cat /var/something/foo.pid, and seeing pid of unrelated process
>> is wrong, too... especially if you try to kill it....
>
> Good point. However the foo.pid scheme is incompatible with
> checkpoint/restart and migration regardless.
Well if you look at the other uses vserver and bsd style jail
mechanisms the concept is not nearly so ridiculous.
The funny thing though is that this is a trivial thing to continue
to make be sensible. Run the process in a chroot or a private
namespace and the /var/something/foo.pid works fine.
>
> So if you wanted to checkpoint and restart/migrate a process with a
> foo.pid type of file, you might need to start it with a private
> tmpfs in a private namespace. That part is trivial to do as part
> of the management tools, though checkpointing a whole tmpfs per process
> could be unfortunate.
The way you describe it I bet that tmpfs will likely be a small
fraction of the size of the processes that you are thinking
about checkpointing.
Eric
On Wed, 2005-12-07 at 07:46 -0700, Eric W. Biederman wrote:
> This set of patches looks like a global s/current->pid/task_pid(current)/
> Which may be an interesting exercise but I don't see how this
> helps your problem. And as has been shown by a few comments
> this process making all of these changes is subject to human error.
As with any good set of kernel changes, this is step one. Step two will
include calling something _other_ than task_pid(). But, in the
interests of small, incremental changes, this is what we decided to do
first.
> Many of the interesting places that deal with pids and where you
> want translation are not where the values are read from current->pid,
> but where the values are passed between functions. Think about
> the return value of do_fork.
Exactly. The next phase will focus on such places. Hubertus has some
stuff working that's probably not ready for LKML, but could certainly be
shared.
> There are also a lot of cases you haven't even tried to address.
> You haven't touched process groups, or sessions.
I preferred to keep the number of patches at 13, rather than 130. Those
are in the pipeline, but pids are the most important first step which
gets the most functionality.
> At the current time the patch definitely fails the no in kernel
> users test because it doesn't go as far as taking advantage
> of the abstraction it attempts to introduce. Which means
> other people can't read through the code and make sense
> of what you are trying to do or to see if there is a better way.
This isn't excatly a new feature, nor does it add any appreciable code
or complexity. I'm not sure that test even applies.
> I will also contend that walking down a path that does not cause
> compilation to fail when the subtle things like which flavor of
> pid you want to see is a problem.
I agree. I'm trying to figure out which way is best to go about this.
I have the feeling that using sparse tags like __user and __kernel is
the way to go, but we might also want to take the embedded struct
approach like atomic_t.
> Another question is how do your pid spaces nest.
They don't, and thankfully there is anybody asking for it. It adds
loads of complexity, and nobody apparently needs it.
> Currently
> it sounds like you are taking the vserver model and allowing
> everyone outside your pid space to see all of your internal
> pids. Is this really what you want?
For our application, yes. For vserver, maybe not. We'd like things
like 'top' to still work like normal, even though there are processes in
their own pidspace around.
> Who do you report as the source of your signal.
I've never dealt with signal enough from userspace to give you a good
answer. Can you explain the mechanics of how you would go about doing
this?
> What pid does waitpid return when the parent of your pidspace exits?
> What pid does waitpid return when both processes are in the same pidspace?
The pids coming out of system calls are always in the context of the
process doing the call.
> How does /proc handle multiple pid spaces?
I'm working on it :)
Right now, there's basically a hack in d_hash() to get new dentries for
each pidspace. It is horrible and causes a 50x decrease in performance
on some benchmarks like dbench.
I think the long-term solution is to make multiple, independent proc
mounts, and give each pidspace a separate filesystem view. That
requires some of the nifty new bind mount functionality and a chroot
when a new pidspace is created, but I think it works.
> While something allowing multiple pidspaces may be mergeable,
> unnecessary and incomplete changes rarely are. This is a fundamental
> change to the unix API so it will take a lot of scrutiny to get
> merged.
Lots of good questions. I think we need to take some of our initial,
private discussions and get them out on an open list somewhere. Any
suggestions? I hate creating new sourceforge projects :)
-- Dave
> > Many of the interesting places that deal with pids and where you
> > want translation are not where the values are read from current->pid,
> > but where the values are passed between functions. Think about
> > the return value of do_fork.
>
> Exactly. The next phase will focus on such places. Hubertus has some
> stuff working that's probably not ready for LKML, but could certainly be
> shared.
>
hmm wonder if it's not just a lot simpler to introduce a split in
"kernel pid" and "userspace pid", and have current->pid and
current->user_pid for that.
Using accessor macros doesn't sound like it gains much here.. (but then
I've not seen the full picture and you have)
On Wed, 2005-12-07 at 18:55 +0100, Arjan van de Ven wrote:
> > > Many of the interesting places that deal with pids and where you
> > > want translation are not where the values are read from current->pid,
> > > but where the values are passed between functions. Think about
> > > the return value of do_fork.
> >
> > Exactly. The next phase will focus on such places. Hubertus has some
> > stuff working that's probably not ready for LKML, but could certainly be
> > shared.
>
> hmm wonder if it's not just a lot simpler to introduce a split in
> "kernel pid" and "userspace pid", and have current->pid and
> current->user_pid for that.
>
> Using accessor macros doesn't sound like it gains much here.. (but then
> I've not seen the full picture and you have)
My first instinct was to introduce functions like get_user_pid() and
get_kernel_pid() which would effectively introduce the same split.
Doing that, we could keep from even referencing ->user_pid in normal
code, and keep things small and simpler for people like the embedded
folks.
For the particular application that we're thinking of, we really don't
want "user pid" and "kernel pid" we want "virtualized" and
"unvirtualized", or "regular old pid" and "fancy new virtualized pid".
So, like in the global pidspace (which can see all pids and appears to
applications to be just like normal) you end up returning "kernel" pids
to userspace. That didn't seem to make sense.
-- Dave
> > hmm wonder if it's not just a lot simpler to introduce a split in
> > "kernel pid" and "userspace pid", and have current->pid and
> > current->user_pid for that.
> >
> > Using accessor macros doesn't sound like it gains much here.. (but then
> > I've not seen the full picture and you have)
>
> My first instinct was to introduce functions like get_user_pid() and
> get_kernel_pid() which would effectively introduce the same split.
> Doing that, we could keep from even referencing ->user_pid in normal
> code, and keep things small and simpler for people like the embedded
> folks.
well I don't see the point for the abstraction... get_kernel_pid() is no
better or worse than using current->pid directly, unless you want to do
"deep magic".
> For the particular application that we're thinking of, we really don't
> want "user pid" and "kernel pid" we want "virtualized" and
> "unvirtualized", or "regular old pid" and "fancy new virtualized pid".
same thing, different name :)
> So, like in the global pidspace (which can see all pids and appears to
> applications to be just like normal) you end up returning "kernel" pids
> to userspace. That didn't seem to make sense.
hmm this is scary. If you don't have "unique" pids inside the kernel a
lot of stuff will subtly break. DRM for example (which has the pid
inside locking to track ownership and recursion), but I'm sure there's
many many cases like that. I guess the address of the task struct is the
ultimate unique pid in this sense.... but I suspect the way to get there
is first make a ->user_pid field, and switch all userspace visible stuff
to that, and then try to get rid of ->pid users one by one by
eliminating their uses...
but I'm really afraid that if you make the "fake" pid visible to normal
kernel code, too much stuff will go bonkers and end up with an eternal
stream of security hazards. "Magic" hurts here, and if you don't do
magic I don't see a reason to add an abstraction which in itself doesn't
mean anything or doesn't abstract anything....
Dave Hansen <[email protected]> writes:
> On Wed, 2005-12-07 at 07:46 -0700, Eric W. Biederman wrote:
>> There are also a lot of cases you haven't even tried to address.
>> You haven't touched process groups, or sessions.
>
> I preferred to keep the number of patches at 13, rather than 130. Those
> are in the pipeline, but pids are the most important first step which
> gets the most functionality.
Process groups are also pids, and there are direct relationships
between pids and process group ids and session ids. I agree keeping
the focus tight make sense but not so tight that you miss pieces.
>> At the current time the patch definitely fails the no in kernel
>> users test because it doesn't go as far as taking advantage
>> of the abstraction it attempts to introduce. Which means
>> other people can't read through the code and make sense
>> of what you are trying to do or to see if there is a better way.
>
> This isn't excatly a new feature, nor does it add any appreciable code
> or complexity. I'm not sure that test even applies.
A very common comment on the thread up to now is that people haven't
seen the big picture so they can't comment.
>> I will also contend that walking down a path that does not cause
>> compilation to fail when the subtle things like which flavor of
>> pid you want to see is a problem.
>
> I agree. I'm trying to figure out which way is best to go about this.
> I have the feeling that using sparse tags like __user and __kernel is
> the way to go, but we might also want to take the embedded struct
> approach like atomic_t.
You can also make the kernel functions that take a pidspace argument
and you will have instant compile failures :)
>> Another question is how do your pid spaces nest.
>
> They don't, and thankfully there is anybody asking for it. It adds
> loads of complexity, and nobody apparently needs it.
So only very special pids can generate a pidspace. That will
tend to reduce the generality of the solution. What do you do
if I am running your code in a vserver?
There enough possibilities to the solution space a few extra
constraints I think would help in this case.
>> Currently
>> it sounds like you are taking the vserver model and allowing
>> everyone outside your pid space to see all of your internal
>> pids. Is this really what you want?
>
> For our application, yes. For vserver, maybe not. We'd like things
> like 'top' to still work like normal, even though there are processes in
> their own pidspace around.
I can see the desire. But top is already strongly challenged if
you are talking hpc computing. It only works on one node. Anything
that teaches proc about multiple node mounts ought to work just
as well for an internal checkpoint restart implementation.
>> Who do you report as the source of your signal.
>
> I've never dealt with signal enough from userspace to give you a good
> answer. Can you explain the mechanics of how you would go about doing
> this?
Look at siginfo_t si_pid....
>> What pid does waitpid return when the parent of your pidspace exits?
>> What pid does waitpid return when both processes are in the same pidspace?
>
> The pids coming out of system calls are always in the context of the
> process doing the call.
This is of course the definition. But how you implement those cases
is interesting.
>> How does /proc handle multiple pid spaces?
>
> I'm working on it :)
>
> Right now, there's basically a hack in d_hash() to get new dentries for
> each pidspace. It is horrible and causes a 50x decrease in performance
> on some benchmarks like dbench.
>
> I think the long-term solution is to make multiple, independent proc
> mounts, and give each pidspace a separate filesystem view. That
> requires some of the nifty new bind mount functionality and a chroot
> when a new pidspace is created, but I think it works.
I think you will ultimately want a new filesystem namespace
not just a chroot, so you can ``virtualize'' your filesystem namespace
as well.
>> While something allowing multiple pidspaces may be mergeable,
>> unnecessary and incomplete changes rarely are. This is a fundamental
>> change to the unix API so it will take a lot of scrutiny to get
>> merged.
>
> Lots of good questions. I think we need to take some of our initial,
> private discussions and get them out on an open list somewhere. Any
> suggestions? I hate creating new sourceforge projects :)
I wonder if you could hook up with the linux vserver project. The
requirements are strongly similar, and making a solution that
would work for everyone has a better chance of getting in.
Eric
Arjan van de Ven <[email protected]> writes:
>> So, like in the global pidspace (which can see all pids and appears to
>> applications to be just like normal) you end up returning "kernel" pids
>> to userspace. That didn't seem to make sense.
>
> hmm this is scary. If you don't have "unique" pids inside the kernel a
> lot of stuff will subtly break. DRM for example (which has the pid
> inside locking to track ownership and recursion), but I'm sure there's
> many many cases like that. I guess the address of the task struct is the
> ultimate unique pid in this sense.... but I suspect the way to get there
> is first make a ->user_pid field, and switch all userspace visible stuff
> to that, and then try to get rid of ->pid users one by one by
> eliminating their uses...
>
> but I'm really afraid that if you make the "fake" pid visible to normal
> kernel code, too much stuff will go bonkers and end up with an eternal
> stream of security hazards. "Magic" hurts here, and if you don't do
> magic I don't see a reason to add an abstraction which in itself doesn't
> mean anything or doesn't abstract anything....
Thanks, you said that better that I did :)
Eric
On Wed, 2005-12-07 at 12:19 -0700, Eric W. Biederman wrote:
> Process groups are also pids, and there are direct relationships
> between pids and process group ids and session ids. I agree keeping
> the focus tight make sense but not so tight that you miss pieces.
There's a "reference implementation" that the kernel community hasn't
seen which is certainly not mergeable, but shows all of the pieces.
Personally, I really want to share it, and I hope that we can soon.
> >> At the current time the patch definitely fails the no in kernel
> >> users test because it doesn't go as far as taking advantage
> >> of the abstraction it attempts to introduce. Which means
> >> other people can't read through the code and make sense
> >> of what you are trying to do or to see if there is a better way.
> >
> > This isn't excatly a new feature, nor does it add any appreciable code
> > or complexity. I'm not sure that test even applies.
>
> A very common comment on the thread up to now is that people haven't
> seen the big picture so they can't comment.
Yup, this is a big issue. I think getting that "other code" out there
is part of filling you guys in. The other part is discussions like
this. :)
>From your comments, I can see that you have a much bigger piece of the
picture in your head than you think.
> >> Another question is how do your pid spaces nest.
> >
> > They don't, and thankfully there is anybody asking for it. It adds
> > loads of complexity, and nobody apparently needs it.
>
> So only very special pids can generate a pidspace. That will
> tend to reduce the generality of the solution. What do you do
> if I am running your code in a vserver?
I don't think it would be a good idea to stack these containers within
vservers, either. vserver uses different pidspaces, and will use the
same infrastructure. The only difference is that they only have a very
small change to the different pidspaces for init.
> >> Who do you report as the source of your signal.
> >
> > I've never dealt with signal enough from userspace to give you a good
> > answer. Can you explain the mechanics of how you would go about doing
> > this?
>
> Look at siginfo_t si_pid....
Are those things that are exported outside of the kernel? It's not
immediately obvious.
> >> While something allowing multiple pidspaces may be mergeable,
> >> unnecessary and incomplete changes rarely are. This is a fundamental
> >> change to the unix API so it will take a lot of scrutiny to get
> >> merged.
> >
> > Lots of good questions. I think we need to take some of our initial,
> > private discussions and get them out on an open list somewhere. Any
> > suggestions? I hate creating new sourceforge projects :)
>
> I wonder if you could hook up with the linux vserver project. The
> requirements are strongly similar, and making a solution that
> would work for everyone has a better chance of getting in.
Already hooked up. They need the same stuff we want, just in smaller
quantities. They can easily stack on top of whatever we do.
-- Dave
On Wed, 2005-12-07 at 20:00 +0100, Arjan van de Ven wrote:
> > So, like in the global pidspace (which can see all pids and appears to
> > applications to be just like normal) you end up returning "kernel" pids
> > to userspace. That didn't seem to make sense.
>
> hmm this is scary. If you don't have "unique" pids inside the kernel a
> lot of stuff will subtly break. DRM for example (which has the pid
> inside locking to track ownership and recursion), but I'm sure there's
> many many cases like that. I guess the address of the task struct is the
> ultimate unique pid in this sense.... but I suspect the way to get there
> is first make a ->user_pid field, and switch all userspace visible stuff
> to that, and then try to get rid of ->pid users one by one by
> eliminating their uses...
OK, what I'm talking about here is the way that it is done now with
existing code. It seems to work and make people happy, but it certainly
isn't the only possible way to do it. I'm very open to suggestions. :)
There really are two distinct pid spaces. Instead of vservers, we tend
to call the different partitioned areas containers.
Each container can only see processes in its own container. The
exception is the "global container", which has a view of all of the
system processes. Having the global container allows you to do things
like see all of the processes on the whole system with top.
So, the current tsk->pid is still unique. However, there is also a
tsk->virtual_pid (or some name) that is unique _inside_ of a container.
These two pids are completely unrelated. Having this virtualized pid
allows you to have the real tsk->pid change without userspace ever
knowing.
For example, that tsk->pid might change if you checkpointed a process,
it crashed, and you restarted it later from the checkpoint.
> but I'm really afraid that if you make the "fake" pid visible to normal
> kernel code, too much stuff will go bonkers and end up with an eternal
> stream of security hazards. "Magic" hurts here, and if you don't do
> magic I don't see a reason to add an abstraction which in itself doesn't
> mean anything or doesn't abstract anything....
99% of the time, the kernel can deal with the same old tsk->pid that
it's always dealt with. Generally, the only times the kernel has to
worry about the virtualized one is where (as Eric noted) it cross the
user<->kernel boundary.
-- Dave
Eric W. Biederman wrote:
>>>Who do you report as the source of your signal.
>>
>>I've never dealt with signal enough from userspace to give you a good
>>answer. Can you explain the mechanics of how you would go about doing
>>this?
>
> Look at siginfo_t si_pid....
the siginfo is queued when a process is killed and si_pid is updated using
the pidspace of the killing process. Processes parent of a pidspace are of
a special kind : the init kind.
>>>What pid does waitpid return when the parent of your pidspace exits?
Well, a process doing waitpid on a parent of a pidspace, is not part
of that pidspace so waitpid would return the 'real pid'.
Am i getting your point correctly ?
>>>What pid does waitpid return when both processes are in the same pidspace?
hmm, please elaborate.
There are indeed issues when a process is the parent of different
namespaces. This case that should be avoided.
>>>How does /proc handle multiple pid spaces?
>>
>>I'm working on it :)
>>
>>Right now, there's basically a hack in d_hash() to get new dentries for
>>each pidspace. It is horrible and causes a 50x decrease in performance
>>on some benchmarks like dbench.
>>
>>I think the long-term solution is to make multiple, independent proc
>>mounts, and give each pidspace a separate filesystem view. That
>>requires some of the nifty new bind mount functionality and a chroot
>>when a new pidspace is created, but I think it works.
>
> I think you will ultimately want a new filesystem namespace
> not just a chroot, so you can ``virtualize'' your filesystem namespace
> as well.
"virtualize" the mount points but not necessarily the whole filesystem.
> I wonder if you could hook up with the linux vserver project. The
> requirements are strongly similar, and making a solution that
> would work for everyone has a better chance of getting in.
We feel the same.
C.
Dave Hansen <[email protected]> writes:
> On Wed, 2005-12-07 at 12:19 -0700, Eric W. Biederman wrote:
>> >> Another question is how do your pid spaces nest.
>> >
>> > They don't, and thankfully there is anybody asking for it. It adds
>> > loads of complexity, and nobody apparently needs it.
>>
>> So only very special pids can generate a pidspace. That will
>> tend to reduce the generality of the solution. What do you do
>> if I am running your code in a vserver?
>
> I don't think it would be a good idea to stack these containers within
> vservers, either. vserver uses different pidspaces, and will use the
> same infrastructure. The only difference is that they only have a very
> small change to the different pidspaces for init.
Well that depends on the implementation. The first concern with
the implementation is of course maintainability.
But beyond that a general test to see if you have done a good
job of virtualizing something is to see if you can recurse.
One of my wish list items would be to run my things like my
web browser in a container with only access to a subset of
the things I can normally access. That way I could be less
concerned about the latest browser security bug.
Although I do expect that just like private namespaces it will
take a while to figure out how to allow non-privileged access
to these kinds of powerful concepts.
In the bsdjail paper the point is made that as systems grow
more complex creating minimal privilege containers is easy
and simple compared to what it takes to get a complex system
up and going today. (I expressed that badly)
And then of course there is the other pipe dream if you can
consider the whole system in a container then you can implement
the equivalent of swsuspend by checkpointing the top level
container.
At least this should solve the classic complaint about job
control. That it wasn't transparent to processes.
>> >> Who do you report as the source of your signal.
>> >
>> > I've never dealt with signal enough from userspace to give you a good
>> > answer. Can you explain the mechanics of how you would go about doing
>> > this?
>>
>> Look at siginfo_t si_pid....
>
> Are those things that are exported outside of the kernel? It's not
> immediately obvious.
Sorry do a man sigaction. Basically the signal handler
needs to be configured with SA_SIGINFO but it should get
that information.
I believe you have to
Eric
> 99% of the time, the kernel can deal with the same old tsk->pid that
> it's always dealt with. Generally, the only times the kernel has to
> worry about the virtualized one is where (as Eric noted) it cross the
> user<->kernel boundary.
that's fair enough. I don't see the need for the macro abstractions
though; a current->pid and current->user_pid (or visible_pid or any
other good name) split makes sense. no need for macro abstractions at
all, just add ->user_pid in patch 1, in patch 2 assign it default to
->pid as well and patch 3 converts the places where ->pid is now given
to userspace ;)
again the DRM layer needs an audit, I'm not entirely sure if it doesn't
get pids from userspace. THe rest of the kernel mostly ought to cope
just fine.
On Wed, 2005-12-07 at 15:17 -0700, Eric W. Biederman wrote:
> But beyond that a general test to see if you have done a good
> job of virtualizing something is to see if you can recurse.
I admit it would be interesting at the very least. But, using that
definition, we haven't done any good virtualization in Linux that I can
think of. Besides some vague ranting I heard about zSeries (the real
IBM mainframes) I can't think of anything that does this today.
I don think any of Solaris containers, ppc64 LPARs, Xen, UML, or
vservers can recurse.
Can you think of any?
-- Dave
Dave Hansen <[email protected]> writes:
> On Wed, 2005-12-07 at 15:17 -0700, Eric W. Biederman wrote:
>> But beyond that a general test to see if you have done a good
>> job of virtualizing something is to see if you can recurse.
>
> I admit it would be interesting at the very least. But, using that
> definition, we haven't done any good virtualization in Linux that I can
> think of. Besides some vague ranting I heard about zSeries (the real
> IBM mainframes) I can't think of anything that does this today.
>
> I don think any of Solaris containers, ppc64 LPARs, Xen, UML, or
> vservers can recurse.
>
> Can you think of any?
There is Xnest that allows X to run on X.
There are process groups and sessions that while they may
not strictly nest you don't loose the ability to create new
ones.
There is the CLONE_NEWNS and just about any of the other
clone flags in linux.
There is bochs that emulates the whole machine.
I am actually a little surprised that UML can't run UML. I
suspect it is an address space conflict and not something fundamental.
With pidspaces as long as the parent isn't required to send
signals to arbitrary children I don't think nesting pids spaces
is hard. Or more properly have a process in one pidspace be
the parent of a process in another. Although I grant there
are a few boundary issues, that have to be handled carefully.
Eric
On Wed, Dec 07, 2005 at 02:31:25PM -0800, Dave Hansen wrote:
> I don think any of Solaris containers, ppc64 LPARs, Xen, UML, or
> vservers can recurse.
UML can, but it's not a heavily exercised option, and it needs some fixes in
order to recurse in the currently favored mode of operation.
Jeff
Dave Hansen <[email protected]> writes:
>
> Can you think of any?
qemu can afaik. I've also heard about simnow in qemu and
Xen in qemu, although that's not true recursion. And VMware/qemu/
simnow/UML/... will all probably run fine in Xen native guests.
I wouldn't be surprised if UML supported true recursion too.
But then for what do you really need recursion? It might be nice
theory, but in practice it's probably not too relevant. I guess it
was useful long ago for debugging VM itself when mainframes were
really expensive so you couldn't just buy a development machine and
test your VM on raw iron. But that's not really true today anymore.
Ok one weak reason to still use it might be if your test machine takes
too long to reboot. But then Hypervisor hackers are a pretty narrow
target group for features like this.
-Andi
>
> again the DRM layer needs an audit, I'm not entirely sure if it doesn't
> get pids from userspace. THe rest of the kernel mostly ought to cope
> just fine.
>
Yes yet again, if you can think of it, the drm will have found a way
to do it :-)
the drmGetClient ioctl passes pids across the user/kernel boundary,
its the only place I can see in a quick look at the interfaces.... but
it isn't used for anything as far as I can see except for the dristat
testing utility..
Dave.
Hi!
> One of my wish list items would be to run my things like my
> web browser in a container with only access to a subset of
> the things I can normally access. That way I could be less
> concerned about the latest browser security bug.
subterfugue.sf.net (using ptrace), but yes, nicer solution
would be welcome.
--
Thanks, Sharp!
On Tue, 2004-12-14 at 15:23 +0000, Pavel Machek wrote:
> Hi!
>
> > One of my wish list items would be to run my things like my
> > web browser in a container with only access to a subset of
> > the things I can normally access. That way I could be less
> > concerned about the latest browser security bug.
>
> subterfugue.sf.net (using ptrace), but yes, nicer solution
> would be welcome.
selinux too, as well as andrea's syscall filter thing and many others.
the hardest is the balance between security and usability. You don't
want your browser to be able to read files in your home dir (Except
maybe a few selected ones in the browsers own dir)... until you want to
upload a file via a webform.
Quoting Arjan van de Ven ([email protected]):
> On Tue, 2004-12-14 at 15:23 +0000, Pavel Machek wrote:
> > Hi!
> >
> > > One of my wish list items would be to run my things like my
> > > web browser in a container with only access to a subset of
> > > the things I can normally access. That way I could be less
> > > concerned about the latest browser security bug.
> >
> > subterfugue.sf.net (using ptrace), but yes, nicer solution
> > would be welcome.
>
> selinux too, as well as andrea's syscall filter thing and many others.
>
> the hardest is the balance between security and usability. You don't
> want your browser to be able to read files in your home dir (Except
> maybe a few selected ones in the browsers own dir)... until you want to
> upload a file via a webform.
Yup, right now I use a separate account (not in wheel) for web browsing,
which using Janak's unshare() patch and a small pam library gets its own
namespace which can't see my dmcrypted home partition and has private
/tmp. File sharing is done through a non-standard tmp, just to prevent
scripts from using it.
Pretty convenient, but it really wants some stronger isolation. You'd
think I'd at least use my bsdjail to keep unix sockets and such safe...
Anyway, real containers would indeed be far more convenient, or at least
prettier.
-serge
Dave Airlie <[email protected]> writes:
>>
>> again the DRM layer needs an audit, I'm not entirely sure if it doesn't
>> get pids from userspace. THe rest of the kernel mostly ought to cope
>> just fine.
>>
>
> Yes yet again, if you can think of it, the drm will have found a way
> to do it :-)
>
> the drmGetClient ioctl passes pids across the user/kernel boundary,
> its the only place I can see in a quick look at the interfaces.... but
> it isn't used for anything as far as I can see except for the dristat
> testing utility..
There are crazier cases in the kernel. Netlink is my favorite example
it's default ports are the process pid and some locations in the kernel
even assume you are using the default port.
Eric