(cc'ing lkml too)
Hello,
On 11/02/2010 08:30 PM, Oren Laadan wrote:
> Following the discussion yesterday, here is a linux-cr diff that
> that is limited to changes to existing code.
>
> The diff doesn't include the eclone() patches. I also tried to strip
> off the new c/r code (either code in new files, or new code within
> #ifdef CONFIG_CHECKPOINT in existing files).
>
> I left a few such snippets in, e.g. c/r syscalls templates and
> declaration of c/r specific methods in, e.g. file_operations.
>
> The remaining changes in this patch include new freezer state
> ("CHECKPOINTING"), mostly refactoring of exsiting code, and a bit
> of new helpers.
>
> Disclaimer: don't try to compile (or apply) - this is only intended
> to give a ballpark of how the c/r patches change existing code.
The patch size itself isn't too big but I still think it's one scary
patch mostly because the breadth of the code checkpointing needs to
modify and I suspect that probably is the biggest concern regarding
checkpoint-restart from implementation point of view.
FWIW, I'm not quite convinced checkpoint-restart can be something
which can be generally useful. In controlled environments where the
target application behavior can be relatively well defined and
contained (including actions necessary to rollback in case something
goes bonkers), it would work and can be quite useful, but I'm afraid
the states which need to be saved and restored aren't defined well
enough to be generally applicable. Not only is it a difficult
problem, it actually is impossible to define common set of states to
be saved and restored - it depends on each application.
As such, I have difficult time believing it can be something generally
useful. IOW, I think talking about its usage in complex environments
like common desktops is mostly handwaving. What about X sessions,
network connections, states established in other applications via dbus
or whatnot? Which files need to be snapshotted together? What about
shared mmaps? These questions are not difficult to answer in generic
way, they are impossible.
There is a very distinctive difference between system wide
suspend/hibernation and process checkpointing. Most programs are
already written with the conditions in mind which can be caused by
system level suspend/hibernation. Most programs don't expect to be
scheduled and run in any definite amount of time. There usually
are provisions for loss or failure of resources which are out of the
local system. There are corner cases which are affected and those
programs contain code to respond to suspend/hibernation. Please note
that this is about userland application behavior but not
implementation detail in the kernel. It is a much more fundamental
property.
So, although checkpoint-restart can be very useful for certain
circumstances, I don't believe there can be a general implementation.
It inevitably needs to put somewhat strict restrictions on what the
applications being checkpointed are allowed to do. And after my
train of thought reaches there, I fail to see what the advantages of
in-kernel implementation would be compared to something like the
following.
http://dmtcp.sourceforge.net/
Sure, in-kernel implementation would be able to fake it better, but I
don't think it's anything major. The coverage would be slightly
better but breaking the illusion wouldn't take much. Just push it a
bit further and it will break all the same. In addition, to be
useful, it would need userland framework or set of workarounds which
are aware of and can manipulate userland states anyway. For workloads
for which checkpointing would be most beneficial (HPC for example), I
think something like the above would do just fine and it would make
much more sense to add small features to make userland checkpointing
work better than doing the whole thing in the kernel.
I think in-kernel checkpointing is in awkward place in terms of
tradeoff between its benefits and the added complexities to implement
it. If you give up coverage slightly, userland checkpointing is
there. If you need reliable coverage, proper virtualization isn't too
far away. As such, FWIW, I fail to see enough justification for the
added complexity. I'll be happy to be proven wrong tho. :-)
Thank you.
--
tejun
Thanks Tejun,
your writeup brought up a lot of the same issues that I see with
the in-kernel C/R. Various C/R implementations that are entirely
in userspace or with limited kernel assistance have been in production
in HPC environments for years. I think especially for these workloads
C/R is an extremly useful feature, and a standard implementation would
do Linux well.
But I think the "transparent" in-kernel one is the wrong approach. It
tries to give the illusion that C/R will just work, while a lot of
things are simply not support. In this case whitelisting the allowed
state by requiring special APIs for all I/O (or even just standard
APIs as long as they are supposed by the C/R lib you're linked against)
is the more pragmatic, and I think faithful aproach. In addition to
the amount of state not supported despite looking transparant the
other big problem with the patchset is that it saves the kernel internal
state which changes all the time from one release to another. The
handwaiving is that a userspace tool will solve it. I'm pretty sure
that's not the case; it might solve a few cases but the general
version n to version m conversion is impossible to maintain. Just look
at the problem qemu has migration between just a handfull of version
of the relatively well (compared to random kernel state) defined vmstate
format.
On Tue, 2010-11-02 at 22:47 +0100, Christoph Hellwig wrote:
> Thanks Tejun,
>
> your writeup brought up a lot of the same issues that I see with
> the in-kernel C/R. Various C/R implementations that are entirely
> in userspace or with limited kernel assistance have been in production
> in HPC environments for years.
FWIW there are a couple of kernel-based C/R implementations (BLCR,
OpenVZ) in use in various contexts (not just HPC).
> I think especially for these workloads
> C/R is an extremly useful feature, and a standard implementation would
> do Linux well.
>
> But I think the "transparent" in-kernel one is the wrong approach. It
> tries to give the illusion that C/R will just work, while a lot of
> things are simply not support.
I think this is somewhat true of the implementation under consideration
here (although generally it should fail checkpoints that it can't
restart), but it needn't be true of all possible kernel-based
implementations.
> In this case whitelisting the allowed
> state by requiring special APIs for all I/O (or even just standard
> APIs as long as they are supposed by the C/R lib you're linked against)
> is the more pragmatic, and I think faithful aproach.
I don't think users will go for it. They'll continue to use dodgy
out-of-tree kernel modules and/or LD_PRELOAD hacks instead of porting
their applications to a new library. I think a C/R library is an
"ideal" solution, but it's one that nobody would use - especially in
HPC, unless the library somehow provides better performance.
The namespace/isolation features of Linux (CLONE_NEWPID et al) already
provide a pretty workable basis for creating tractably checkpoint-
and-restartable jobs, with a minimum of performance overhead and
application modification.
> In addition to
> the amount of state not supported despite looking transparant the
> other big problem with the patchset is that it saves the kernel internal
> state which changes all the time from one release to another.
Most of the objects that the patchset saves and restores are right at
the "border" of the user/kernel interface, and they're not apt to change
much quickly (e.g. vma start and end, task sigaltstack info). The
patchset certainly isn't serializing deep internal state such as wait
queues, locks, or reference counts.
> The handwaiving is that a userspace tool will solve it. I'm pretty sure
> that's not the case; it might solve a few cases but the general
> version n to version m conversion is impossible to maintain.
With this I agree, though. But if a change in kernel implementation
details forces an incompatible change in the checkpoint image format, is
that really a big deal? Would it be so bad to say that a checkpoint
image may be restarted only on the same kernel version that created it?
With -stable or enterprise kernels I suspect the issue is unlikely to
come up.
(Sorry for resending the message; the last message contained some html
tags and was rejected by server)
We would like to thank the previous post for bringing up the topic
of kernel C/R versus userland C/R. We are two of the developers of DMTCP
(userland checkpointing): Distributed MultiThreaded CheckPointing .
http://dmtcp.sourceforge.net
We had waited to write to the kernel developers because we had wanted
to ensure that DMTCP is sufficiently robust before wasting the time of the
kernel developers. This thread seems like a good opportunity to begin
a dialogue.
In fact, we only became aware of Linux kernel C/R this September.
Of course, we were aware of Oren Laadan's fine earlier work on ZapC
for distributed checkpointing using the Linux kernel (CLUSTER-2005).
We have a high respect for Oren Laadan and the other Linux C/R developers,
as well as for the developers of BLCR (a C/R kernel module with a userland
component that is widely used in HPC batch faciliites).
By coincidence, when we became aware of Linux C/R, we were already in
the middle of development for a major new release of DMTCP (from version
1.1.x to 1.2.0). We just finished that release. Among other features,
this release supports checkpointing of GNU 'screen', and we have tested
screen in some common use cases (with vim, with emacs, etc.). While it
supports ssh (e.g. checkpointing OpenMPI, which uses ssh), it doesn't yet
support _interactive_ ssh sessions. That will come in the next release.
We believe that both Linux C/R and DMTCP are becoming quite mature, and
that in general, one can achieve good application coverage with either.
In our personal view, a key difference between in-kernel and userland
approaches is the issue of security. The Linux C/R developers state
the issue very well in their FAQ (question number 7):
> https://ckpt.wiki.kernel.org/index.php/Faq :
> 7. Can non-root users checkpoint/restart an application ?
>
> For now, only users with CAP_SYSADMIN privileges can C/R an
> application. This is to ensure that the checkpoint image has not been
> tampered with and will be treated like a loadable kernel-module.
The previous posts also brought up the issue of external connections.
While DMTCP has been developed over six years, in the last year we
have concentrated especially on the issue of external connections.
While we've accumulated many war stories, one will illustrate the point.
Most Linux distros link vi to vim. Vim supports mouse and other operations
via the X11 server. When vim starts up, it connects to the X11
server (which may be local, or remote if ssh uses X11 forwarding).
On transparent checkpoint and restart, vim expects to continue
talking to the X11 server. Currently, DMTCP recognizes such
X11 server connections and refuses them. Vim still survives without
its mouse and other X11 services. For the future, we are considering
a more flexible approach that will take account of the X11 protocol.
Strategies like these are easily handled in userspace. We suspect
that while one may begin with a pure kernel approach, eventually,
one will still want to add a userland component to achieve this kind
of flexibility, just as BLCR has already done.
Best wishes,
- Gene Cooperman and Kapil Arya
from the DMTCP team
On Tue, Nov 2, 2010 at 5:35 PM, Tejun Heo <[email protected]> wrote:
> (cc'ing lkml too)
> Hello,
>
> On 11/02/2010 08:30 PM, Oren Laadan wrote:
>> Following the discussion yesterday, here is a linux-cr diff that
>> that is limited to changes to existing code.
>>
>> The diff doesn't include the eclone() patches. I also tried to strip
>> off the new c/r code (either code in new files, or new code within
>> #ifdef CONFIG_CHECKPOINT in existing files).
>>
>> I left a few such snippets in, e.g. c/r syscalls templates and
>> declaration of c/r specific methods in, e.g. file_operations.
>>
>> The remaining changes in this patch include new freezer state
>> ("CHECKPOINTING"), mostly refactoring of exsiting code, and a bit
>> of new helpers.
>>
>> Disclaimer: don't try to compile (or apply) - this is only intended
>> to give a ballpark of how the c/r patches change existing code.
>
> The patch size itself isn't too big but I still think it's one scary
> patch mostly because the breadth of the code checkpointing needs to
> modify and I suspect that probably is the biggest concern regarding
> checkpoint-restart from implementation point of view.
>
> FWIW, I'm not quite convinced checkpoint-restart can be something
> which can be generally useful. ?In controlled environments where the
> target application behavior can be relatively well defined and
> contained (including actions necessary to rollback in case something
> goes bonkers), it would work and can be quite useful, but I'm afraid
> the states which need to be saved and restored aren't defined well
> enough to be generally applicable. ?Not only is it a difficult
> problem, it actually is impossible to define common set of states to
> be saved and restored - it depends on each application.
>
> As such, I have difficult time believing it can be something generally
> useful. ?IOW, I think talking about its usage in complex environments
> like common desktops is mostly handwaving. ?What about X sessions,
> network connections, states established in other applications via dbus
> or whatnot? ?Which files need to be snapshotted together? ?What about
> shared mmaps? ?These questions are not difficult to answer in generic
> way, they are impossible.
>
> There is a very distinctive difference between system wide
> suspend/hibernation and process checkpointing. ?Most programs are
> already written with the conditions in mind which can be caused by
> system level suspend/hibernation. ?Most programs don't expect to be
> scheduled and run in any definite amount of time. ?There usually
> are provisions for loss or failure of resources which are out of the
> local system. ?There are corner cases which are affected and those
> programs contain code to respond to suspend/hibernation. ?Please note
> that this is about userland application behavior but not
> implementation detail in the kernel. ?It is a much more fundamental
> property.
>
> So, although checkpoint-restart can be very useful for certain
> circumstances, I don't believe there can be a general implementation.
> It inevitably needs to put somewhat strict restrictions on what the
> applications being checkpointed are allowed to do. ?And after my
> train of thought reaches there, I fail to see what the advantages of
> in-kernel implementation would be compared to something like the
> following.
>
> ?http://dmtcp.sourceforge.net/
>
> Sure, in-kernel implementation would be able to fake it better, but I
> don't think it's anything major. ?The coverage would be slightly
> better but breaking the illusion wouldn't take much. ?Just push it a
> bit further and it will break all the same. ?In addition, to be
> useful, it would need userland framework or set of workarounds which
> are aware of and can manipulate userland states anyway. ?For workloads
> for which checkpointing would be most beneficial (HPC for example), I
> think something like the above would do just fine and it would make
> much more sense to add small features to make userland checkpointing
> work better than doing the whole thing in the kernel.
>
> I think in-kernel checkpointing is in awkward place in terms of
> tradeoff between its benefits and the added complexities to implement
> it. ?If you give up coverage slightly, userland checkpointing is
> there. ?If you need reliable coverage, proper virtualization isn't too
> far away. ?As such, FWIW, I fail to see enough justification for the
> added complexity. ?I'll be happy to be proven wrong tho. ?:-)
>
> Thank you.
>
> --
> tejun
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at ?http://www.tux.org/lkml/
>
>
>
Hi,
(disclaimer: you may want to grab a cup of your favorite coffee)
On 11/02/2010 05:35 PM, Tejun Heo wrote:
> (cc'ing lkml too)
> Hello,
>
> On 11/02/2010 08:30 PM, Oren Laadan wrote:
>> Following the discussion yesterday, here is a linux-cr diff that
>> that is limited to changes to existing code.
>>
>> The diff doesn't include the eclone() patches. I also tried to strip
>> off the new c/r code (either code in new files, or new code within
>> #ifdef CONFIG_CHECKPOINT in existing files).
>>
>> I left a few such snippets in, e.g. c/r syscalls templates and
>> declaration of c/r specific methods in, e.g. file_operations.
>>
>> The remaining changes in this patch include new freezer state
>> ("CHECKPOINTING"), mostly refactoring of exsiting code, and a bit
>> of new helpers.
>>
>> Disclaimer: don't try to compile (or apply) - this is only intended
>> to give a ballpark of how the c/r patches change existing code.
>
> The patch size itself isn't too big but I still think it's one scary
> patch mostly because the breadth of the code checkpointing needs to
> modify and I suspect that probably is the biggest concern regarding
> checkpoint-restart from implementation point of view.
I agree, it *looks* scary. But that's mostly because it's a dumb
diff out of context, rather than a standard "patch" as set of
logical incremental changes. So posting this diff is probably the
worst way to present the impact on existing code. It merely gives
a ballpark of that.
However, please keep in mind that this diff is really an aggregate
of multiple unrelated, structured, small changes, including:
- cleanups (e.g. x86 ptrace)
- refactoring (e.g. ipc, eventpoll, user-ns)
- new features/enhancements (e,g. splice, freezer, mm)
I'm confident that each of these will make more sense when presented
in the proper context.
>
> FWIW, I'm not quite convinced checkpoint-restart can be something
In the ksummit presentation I gave an extensive list of real
use-cases (existing and future). The slides are here:
http://www.cs.columbia.edu/~orenl/talks/ksummit-2010.pdf
For more technical details there is also the OLS-2010 paper here:
http://www.cs.columbia.edu/~orenl/papers/ols2010-linuxcr.pdf
presentation slide from there are here:
http://www.cs.columbia.edu/~orenl/talks/ols2010-linuxcr.pdf
> which can be generally useful. In controlled environments where the
> target application behavior can be relatively well defined and
> contained (including actions necessary to rollback in case something
> goes bonkers), it would work and can be quite useful, but I'm afraid
> the states which need to be saved and restored aren't defined well
> enough to be generally applicable. Not only is it a difficult
> problem, it actually is impossible to define common set of states to
> be saved and restored - it depends on each application.
I'm unsure which states you have in mind that will not be well defined.
It is a difficult problem, and C/R has limitations, but I think we've
got it pretty right this time :)
* we save and restores *all* *execution* state of the applications
(except for well-defined unsupported features; hardware devices
are one such example).
* we don't save FS state (use filesystem snapshots for that); but
we do save runtime FS state (e.g. open files, etc).
* we don't save state of peers (applications/systems) over network;
but we do save network connections for proper live-migration.
(Of course, there is a supporting userspace ecosystem, like utilities
to do the checkpoint/restart, to freeze/thaw the application, to
snapshot the filesystem etc).
So unless the applications uses unsupported resource - it will be
possible to checkpoint that application and restart successfully.
>
> As such, I have difficult time believing it can be something generally
> useful. IOW, I think talking about its usage in complex environments
> like common desktops is mostly handwaving. What about X sessions,
> network connections, states established in other applications via dbus
> or whatnot? Which files need to be snapshotted together? What about
> shared mmaps? These questions are not difficult to answer in generic
> way, they are impossible.
I have a cool demo (and I gave one today!) that shows how I run one
desktop session and restart an older desktop session that then runs
in parallel to my existing session, in another windows -> so I have
both current and older session running side by side. (it's an version
of C/R as kernel module for older kernel, we're not yet there with
linux-cr). Hand-waving ? maybe, but a pretty convincing one ;)
To be clear, C/R is more generic than save/restore a single process:
rather, it works on process hierarchies (and complete containers).
So a checkpoint will typically capture the state of e.g. a VNC server
(X session) and the applications (xterm, win-manager etc), and the
dbus daemon, and all their open files, and sockets etc.
(BTW, if you were to live-migrate that X session to another host,
we'd save the TCP state as well; otherwise, we save the sockets in
CLOSED state - analogous to what happens when your applications run
again after the laptop was suspended for a long time).
Likewise, in my demo, files are not snapshotted independently. Instead,
the entire file system is snapshotted at once.
Bottom line - it's simpler than what it sounds. Let's compare this to
the save/restore of an entire VM: in VM you bundle all the state inside
as a single big package (and this makes life much easier). Likewise, in
C/R, we bundle all the necessary processes, e.g. an entire container,
in a single big package - we pack all the data necessary to make the
checkpoint self-sufficient.
>
> There is a very distinctive difference between system wide
> suspend/hibernation and process checkpointing. Most programs are
> already written with the conditions in mind which can be caused by
> system level suspend/hibernation. Most programs don't expect to be
> scheduled and run in any definite amount of time. There usually
> are provisions for loss or failure of resources which are out of the
> local system. There are corner cases which are affected and those
> programs contain code to respond to suspend/hibernation. Please note
> that this is about userland application behavior but not
> implementation detail in the kernel. It is a much more fundamental
> property.
Exactly. This means that the same applications would not be upset
after they are checkpointed/restarted, for the exact same reason -
they know how to "recover" from that. For instance, firefox will
re-establish a network connection to the web server, for instance.
C/R is as *transparent* as suspend/hibernation. Applications will
normally not be able to tell the difference between just having
experienced a suspend/hibernation or a checkpoint/restart.
> So, although checkpoint-restart can be very useful for certain
> circumstances, I don't believe there can be a general implementation.
> It inevitably needs to put somewhat strict restrictions on what the
> applications being checkpointed are allowed to do. And after my
Let me try to rephrase: there are restrictions to what applications
do if they are to be successfully checkpointed. Examples:
* tasks that use hardware devices (e.g. sound card),
* tasks that use unsupported sockets (e.g. netlink),
* tasks that use yet-unsupported feature (e.g. ptraced tasks)
That said, I'm quite confident that the set of features we support
(now or within easy reach) already cover a wide range of real
applications and use-cases.
> train of thought reaches there, I fail to see what the advantages of
> in-kernel implementation would be compared to something like the
> following.
>
> http://dmtcp.sourceforge.net/
>
> Sure, in-kernel implementation would be able to fake it better, but I
> don't think it's anything major. The coverage would be slightly
> better but breaking the illusion wouldn't take much. Just push it a
> bit further and it will break all the same. In addition, to be
I beg to differ.
DMTCP is indeed a very cool project. It's based on MTCP, a userspace
C/R tool, and as such, is restricted like all userspace implementations.
That is not to say that it isn't useful, but it is limited in what it
can do.
It is not my intention to bash their great work, but it's important to
understand its limitations, so just a few examples:
* Transparency: their papers says that it's required to link against
their library, or modify the binary; they overload some signals (so
the application can't use them)
* Completeness: many real resources are not supported, e.g. eventpoll,
ipc, pending signals, etc.
* Complexity: they technically implement a virtual pid-namespace in
userspace by intercepting calls to clone(). I wonder if they consider
e.g. pid's saved on file owners or in afunix creds ? I'll just say
it's nearly impossible with their 20K lines of code - I know because
I did it in a kernel module ...
* Efficiency: from userspace it can't tell which mapped pages are dirty
and which aren't, not to mention doing incremental checkpoints.
* Usefulness: can they live-migrate mysql server between two hosts
prior to a kernel upgrade ? can they checkpoint stopped processes
which cannot cooperate ? can they checkpoint/restart postgresql ?
In contrast, the kernel C/R is:
* much more complete and feature-rich,
* entirely transparent to applications (does not need their cooperation,
can even do debugged tasks)
* can be highly optimized and do incremental c/r
* can do live migration
* is easier to maintain in the long run (because you don't need to cheat
applications by intercepting their kernel calls from userspace!)
* flexible to allow smart userspace to also be c/r aware, if they so wish
* can provide a guarantee that a checkpoint is self-contained and can
be later restarted
In fact, DMTCP will be much more useful if it builds on linux-cr
as its chekcpoint-restart engine ;)
> useful, it would need userland framework or set of workarounds which
> are aware of and can manipulate userland states anyway. For workloads
What user space "state" needs to be worked-around and manipulated ?
If you are referring to the file system - then a snapshot is necessary
in either method, userspace or kernel. If other, then please elaborate.
> for which checkpointing would be most beneficial (HPC for example), I
> think something like the above would do just fine and it would make
> much more sense to add small features to make userland checkpointing
> work better than doing the whole thing in the kernel.
Actually, because of the huge optimization potential that exists only
in kernel based C/R, the HPC applications are likely to benefit
tremendously too from it. Think about things like incremental
checkpoint, pre-copy to minimize downtime (like live-migration),
using COW to defer disk IO until after the application can resume
execution, and more. None of these is possible with userspace C/R.
I know of several places that do not use C/R because they can't
stop their long running processes for longer than a few milliseconds.
I know how to solve their problems with linux-cr. I doubt if any
userspace mechanism can get there.
> I think in-kernel checkpointing is in awkward place in terms of
> tradeoff between its benefits and the added complexities to implement
> it. If you give up coverage slightly, userland checkpointing is
> there. If you need reliable coverage, proper virtualization isn't too
> far away. As such, FWIW, I fail to see enough justification for the
> added complexity. I'll be happy to be proven wrong tho. :-)
There is a huge gap between what you can (and want) to do with
checkpoint-restart between userspace and kernel implementations.
Linux can profit from this feature along multiple axes, in terms
of the HPC market, VPS solutions, desktop mobility, and much more.
I think the added complexity is more than manageable. If you take
a look at the patch-set (http://www.linux-cr.org/git) you'll see
for that most of the code is straightforward, just full of details,
and definitely tangent to the existing kernel code. The changes
seen in this "naked" diff make more sense when they appear orderly
in the context of that logic.
We have shown that the mission is at reach and C/R can be more than
a toy implementation. To reduce the complexity of *reviwing*, it's
time to post the patch-set in small pieces that one can digest ...
Thanks,
Oren.
Hi Christoph,
I really wish you would have raised these concerns during the
ksummit or thereafter. I'm here (LPC) until Friday, and would be
happy to discuss any aspect of the linux-cr while at it (and if
needed can post a summary to the list).
On 11/02/2010 05:47 PM, Christoph Hellwig wrote:
> Thanks Tejun,
>
> your writeup brought up a lot of the same issues that I see with
> the in-kernel C/R. Various C/R implementations that are entirely
> in userspace or with limited kernel assistance have been in production
> in HPC environments for years. I think especially for these workloads
> C/R is an extremly useful feature, and a standard implementation would
> do Linux well.
>
> But I think the "transparent" in-kernel one is the wrong approach. It
> tries to give the illusion that C/R will just work, while a lot of
> things are simply not support.
The fact is that an in-kernel implementation can and does support
a significantly larger feature-set.
Linux-cr does not and will not support everything. Nearly all driver
devices won't be supported in the near future (but interested vendors
could builds such functionality into their drivers!). Also, pseudo
file systems like sysfs, procfs, debugfs will at most get partial
support.
But apart for that, it really covers (or will soon) nearly everything.
So I do wonder what concretely is "a lot of things" ?
> In this case whitelisting the allowed
> state by requiring special APIs for all I/O (or even just standard
> APIs as long as they are supposed by the C/R lib you're linked against)
> is the more pragmatic, and I think faithful aproach. In addition to
> the amount of state not supported despite looking transparant the
"Transparent" means that applications don't know that they are
being checkpointed, nor do they need to cooperate. So linux-cr
is *completely* transparent to applications that are checkpointable.
Perhaps you can elaborate on the "state not supported despite
looking transparent" - beyond what I mentioned above ?
> other big problem with the patchset is that it saves the kernel internal
> state which changes all the time from one release to another.
It is our experience that the format is pretty immune to changes
that occur to in-kernel (and not user/ABI visible) structures.
It mainly changes when we add new features - and I expect that
to happen less frequently once the patchset finds its way to the
mainline.
> The
> handwaiving is that a userspace tool will solve it. I'm pretty sure
> that's not the case; it might solve a few cases but the general
> version n to version m conversion is impossible to maintain. Just look
> at the problem qemu has migration between just a handfull of version
> of the relatively well (compared to random kernel state) defined vmstate
> format.
The problem space is smaller, because we are aiming at a simpler
goal. We need to always know how to convert from version N to
version N+1. Then conversion from N to N+k is a series of these
conversions.
QEMU has a broader goal: IIUC, both QEMU and KVM versions may
change, they are not tied to each other. So the problem is harder.
In linux-cr, the format is tied to the version of objects that
the kernel that outputs/inputs the data knows. That makes things
much simpler.
Oren.
Hello,
On 11/04/2010 02:47 AM, Nathan Lynch wrote:
>> In this case whitelisting the allowed
>> state by requiring special APIs for all I/O (or even just standard
>> APIs as long as they are supposed by the C/R lib you're linked against)
>> is the more pragmatic, and I think faithful aproach.
>
> I don't think users will go for it. They'll continue to use dodgy
> out-of-tree kernel modules and/or LD_PRELOAD hacks instead of porting
> their applications to a new library. I think a C/R library is an
> "ideal" solution, but it's one that nobody would use - especially in
> HPC, unless the library somehow provides better performance.
I hear that there are plans to integrate one of the userland
snapshotting implementations with HPC workload manager. ISTR the
combination to be condor + dmtcp but not sure. I think things like
that make a lot of sense. Scientists writing programs for HPC
clusters already work in given frameworks and what those applications
do and how to recover are pretty well confined/defined. If you
integrate snapshotting with such frameworks, it becomes pretty easy
for both the admins and users.
I'll talk about other issues in the reply to Oren's email.
Thanks.
--
tejun
Hello,
On 11/04/2010 04:40 AM, Kapil Arya wrote:
> (Sorry for resending the message; the last message contained some html
> tags and was rejected by server)
And please also don't top-post. Being the antisocial egomaniacs we
are, people on lkml prefer to dissect the messages we're replying to,
insert insulting comments right where they would be most effective and
remove the passages which can't yield effective insults. :-)
> In our personal view, a key difference between in-kernel and userland
> approaches is the issue of security. The Linux C/R developers state
> the issue very well in their FAQ (question number 7):
>> https://ckpt.wiki.kernel.org/index.php/Faq :
>> 7. Can non-root users checkpoint/restart an application ?
>>
>> For now, only users with CAP_SYSADMIN privileges can C/R an
>> application. This is to ensure that the checkpoint image has not been
>> tampered with and will be treated like a loadable kernel-module.
That's an interesting point but I don't think it's a dealbreaker.
Kernel CR is gonna require userland agent anyway and access control
can be done there. Being able to snapshot w/o root privieldge
definitely is a plust but it's not like CR is gonna be deployed on
majority of desktops and servers (if so, let's talk about it then).
> Strategies like these are easily handled in userspace. We suspect
> that while one may begin with a pure kernel approach, eventually,
> one will still want to add a userland component to achieve this kind
> of flexibility, just as BLCR has already done.
Yeap, agreed. There gotta be user agents which can monitor and
manipulate userland states. It's a fundamentally nasty job, that of
collecting and applying application-specific workarounds. I've only
glanced the dmtcp paper so my understanding is pretty superficial.
With that in mind, can you please answer some of my curiosities?
* As Oren pointed out in another message, there are somethings which
could seem a bit too visible to the target application. Like the
manager thread (is it visible to the application or is it hidden by
the libc wrapper?) and reserved signal. Also, while it's true that
all programs should be ready to handle -EINTR failure from system
calls, it's something which is very difficult to verify and test and
could lead to once-in-a-blue-moon head scratchy kind of failures.
I think most of those issues can be tackled with minor narrow-scoped
changes to the kernel. Do you guys have things on mind which the
kernel can do to make these things more transparent or safer?
* The feats dmtcp achieves with its set of workarounds are impressive
but at the same time look quite hairy. Christoph said that having a
standard userland C-R implementation would be quite useful and IMHO
it would be helpful in that direction if the implementation is
modularized enough so that the core functionality and the set of
workarounds can be easily separated. Is it already so?
Thanks.
--
tejun
Hello, Oren.
On 11/04/2010 05:03 AM, Oren Laadan wrote:
> (disclaimer: you may want to grab a cup of your favorite coffee)
Alright, going to get my morning cup of coffee now. :-)
> On 11/02/2010 05:35 PM, Tejun Heo wrote:
>> The patch size itself isn't too big but I still think it's one scary
>> patch mostly because the breadth of the code checkpointing needs to
>> modify and I suspect that probably is the biggest concern regarding
>> checkpoint-restart from implementation point of view.
>
> I agree, it *looks* scary. But that's mostly because it's a dumb
> diff out of context, rather than a standard "patch" as set of
> logical incremental changes. So posting this diff is probably the
> worst way to present the impact on existing code. It merely gives
> a ballpark of that.
>
> However, please keep in mind that this diff is really an aggregate
> of multiple unrelated, structured, small changes, including:
> - cleanups (e.g. x86 ptrace)
> - refactoring (e.g. ipc, eventpoll, user-ns)
> - new features/enhancements (e,g. splice, freezer, mm)
>
> I'm confident that each of these will make more sense when presented
> in the proper context.
Yeah, could be so but I wasn't really referring to the scariness of
the patch per-se but rather how many subsystems CR needs to interact
with.
>> FWIW, I'm not quite convinced checkpoint-restart can be something
>
> In the ksummit presentation I gave an extensive list of real
> use-cases (existing and future). The slides are here:
> http://www.cs.columbia.edu/~orenl/talks/ksummit-2010.pdf
> For more technical details there is also the OLS-2010 paper here:
> http://www.cs.columbia.edu/~orenl/papers/ols2010-linuxcr.pdf
> presentation slide from there are here:
> http://www.cs.columbia.edu/~orenl/talks/ols2010-linuxcr.pdf
Alright, reading...
> I'm unsure which states you have in mind that will not be well defined.
>
> It is a difficult problem, and C/R has limitations, but I think we've
> got it pretty right this time :)
>
> * we save and restores *all* *execution* state of the applications
> (except for well-defined unsupported features; hardware devices
> are one such example).
>
> * we don't save FS state (use filesystem snapshots for that); but
> we do save runtime FS state (e.g. open files, etc).
>
> * we don't save state of peers (applications/systems) over network;
> but we do save network connections for proper live-migration.
If you think only about target processes, yeah sure, you can cover
most of the stuff but that's not the impossible part. What's not
defined is interaction with the rest of the system and userland.
Userland ecosystem is crazy complex. You simply cannot stop, say,
banshee or even pidgin, let it mingle with the rest of the system and
restore it later in any safe way.
> (Of course, there is a supporting userspace ecosystem, like utilities
> to do the checkpoint/restart, to freeze/thaw the application, to
> snapshot the filesystem etc).
>
> So unless the applications uses unsupported resource - it will be
> possible to checkpoint that application and restart successfully.
I'm afraid I can't agree with that. You can store and restore the
states which kernel is aware of but that's a very small fraction of
the whole picture.
>> As such, I have difficult time believing it can be something generally
>> useful. IOW, I think talking about its usage in complex environments
>> like common desktops is mostly handwaving. What about X sessions,
>> network connections, states established in other applications via dbus
>> or whatnot? Which files need to be snapshotted together? What about
>> shared mmaps? These questions are not difficult to answer in generic
>> way, they are impossible.
>
> I have a cool demo (and I gave one today!) that shows how I run one
> desktop session and restart an older desktop session that then runs
> in parallel to my existing session, in another windows -> so I have
> both current and older session running side by side. (it's an version
> of C/R as kernel module for older kernel, we're not yet there with
> linux-cr). Hand-waving ? maybe, but a pretty convincing one ;)
>
> To be clear, C/R is more generic than save/restore a single process:
> rather, it works on process hierarchies (and complete containers).
> So a checkpoint will typically capture the state of e.g. a VNC server
> (X session) and the applications (xterm, win-manager etc), and the
> dbus daemon, and all their open files, and sockets etc.
Sure, you can freeze whole tree of related processes and move them
around, but if you think about it, it's an already broken scenario.
For example, dbus (or rather agents listening to it) doesn't only
carry states specific to the set of applications being snapshotted.
It also carries whole bunch of system-wide states or states for other
applications. As soon as the system goes on executing after
checkpointing, the checkpointed image of dbus and its agents become
inconsistent and useless. You can't restore it later. You don't know
what happened to other parts of the system inbetween.
And this problem doesn't stem from technical details of the
implementation. It's fundamental. CR tries to snapshot subset of a
big state machine and then use the snapshot later or elsewhere. It
doesn't and can't have full visibility into how the subset of states
have and are going to interact with the rest of the states. As soon
as the whole state machine makes progress, there is no guarantee of
consistency.
Without explicit provisions for specific applications, it just can't
work in generic manner. Can I move my banshee or gwibber to my next
machine transparently with in-kernel CR or even restore it later? In
many cases, even I (the user) can't define what the desired states
are.
> (BTW, if you were to live-migrate that X session to another host,
> we'd save the TCP state as well; otherwise, we save the sockets in
> CLOSED state - analogous to what happens when your applications run
> again after the laptop was suspended for a long time).
>
> Likewise, in my demo, files are not snapshotted independently. Instead,
> the entire file system is snapshotted at once.
>
> Bottom line - it's simpler than what it sounds. Let's compare this to
> the save/restore of an entire VM: in VM you bundle all the state inside
> as a single big package (and this makes life much easier). Likewise, in
> C/R, we bundle all the necessary processes, e.g. an entire container,
> in a single big package - we pack all the data necessary to make the
> checkpoint self-sufficient.
So, that's why it comes down to containers and namespaces. You need
to preemptively put the target applications in separate boxes so that
they don't have much to do with the rest of the system. So that the
states aren't intermixed and can be safely snapshotted without
worrying about the rest of the system.
I'm afraid that's not general or transparent at all. It's extremely
invasive to how a system is setup and used. It basically is poor
man's virtualization or rather partitioning without hardware support
and at this point I find it very difficult to justify the added
complexity. Let's just make virtualization better.
>> So, although checkpoint-restart can be very useful for certain
>> circumstances, I don't believe there can be a general implementation.
>> It inevitably needs to put somewhat strict restrictions on what the
>> applications being checkpointed are allowed to do. And after my
>
> Let me try to rephrase: there are restrictions to what applications
> do if they are to be successfully checkpointed. Examples:
> * tasks that use hardware devices (e.g. sound card),
> * tasks that use unsupported sockets (e.g. netlink),
> * tasks that use yet-unsupported feature (e.g. ptraced tasks)
>
> That said, I'm quite confident that the set of features we support
> (now or within easy reach) already cover a wide range of real
> applications and use-cases.
I think my points are clear now. I'm not really talking about kernel
resources the hierarchy of checkpointed processes are using. I'm
talking about interaction with the rest of the system and how that
can't be solved in general manner.
> In contrast, the kernel C/R is:
>
> * much more complete and feature-rich,
> * entirely transparent to applications (does not need their cooperation,
> can even do debugged tasks)
> * can be highly optimized and do incremental c/r
> * can do live migration
> * is easier to maintain in the long run (because you don't need to cheat
> applications by intercepting their kernel calls from userspace!)
> * flexible to allow smart userspace to also be c/r aware, if they so wish
> * can provide a guarantee that a checkpoint is self-contained and can
> be later restarted
>
> In fact, DMTCP will be much more useful if it builds on linux-cr
> as its chekcpoint-restart engine ;)
Yeah, it would definitely be interesting to think about how userland
CR can be improved with some kernel support. That said, I don't think
the differences listed above are that large given the common use
cases.
>> useful, it would need userland framework or set of workarounds which
>> are aware of and can manipulate userland states anyway. For workloads
>
> What user space "state" needs to be worked-around and manipulated ?
>
> If you are referring to the file system - then a snapshot is necessary
> in either method, userspace or kernel. If other, then please elaborate.
I think dmtcp paper lists some of them. The message Kapil wrote in
this thread also talks about handling vim. They're inevitable if you
want to checkpoint subset of processes from a live system. The only
reason those haven't come up with in-kernel CR yet is because they are
hidden behind containers and namespaces.
>> for which checkpointing would be most beneficial (HPC for example), I
>> think something like the above would do just fine and it would make
>> much more sense to add small features to make userland checkpointing
>> work better than doing the whole thing in the kernel.
>
> Actually, because of the huge optimization potential that exists only
> in kernel based C/R, the HPC applications are likely to benefit
> tremendously too from it. Think about things like incremental
> checkpoint, pre-copy to minimize downtime (like live-migration),
> using COW to defer disk IO until after the application can resume
> execution, and more. None of these is possible with userspace C/R.
>
> I know of several places that do not use C/R because they can't
> stop their long running processes for longer than a few milliseconds.
> I know how to solve their problems with linux-cr. I doubt if any
> userspace mechanism can get there.
I'm sure there will be some benefits to in-kernel implementation but
the added complexity is crazy in comparison. I don't think it would
be wise to include this invasive amount of code for several places
which can't CR because they can't afford a few millisecs.
>> I think in-kernel checkpointing is in awkward place in terms of
>> tradeoff between its benefits and the added complexities to implement
>> it. If you give up coverage slightly, userland checkpointing is
>> there. If you need reliable coverage, proper virtualization isn't too
>> far away. As such, FWIW, I fail to see enough justification for the
>> added complexity. I'll be happy to be proven wrong tho. :-)
>
> There is a huge gap between what you can (and want) to do with
> checkpoint-restart between userspace and kernel implementations.
> Linux can profit from this feature along multiple axes, in terms
> of the HPC market, VPS solutions, desktop mobility, and much more.
>
> I think the added complexity is more than manageable. If you take
> a look at the patch-set (http://www.linux-cr.org/git) you'll see
> for that most of the code is straightforward, just full of details,
> and definitely tangent to the existing kernel code. The changes
> seen in this "naked" diff make more sense when they appear orderly
> in the context of that logic.
>
> We have shown that the mission is at reach and C/R can be more than
> a toy implementation. To reduce the complexity of *reviwing*, it's
> time to post the patch-set in small pieces that one can digest ...
I'm sorry to be in this position but the trade off just seems way off.
As I wrote earlier, the transparent part of in-kernel CR basically
boils down to implementing pseudo virtualization without hardware
support and given the not-too-glorious history of that and the much
higher focus on proper virtualization these days, I just don't think
it makes much sense. It's an extremely niche solution for niche use
cases. If it were a self contained feature, sure, but it's reaching
into a lot of core subsystems. Sorry, no.
Thank you.
--
tejun
> If you think only about target processes, yeah sure, you can cover
> most of the stuff but that's not the impossible part. What's not
> defined is interaction with the rest of the system and userland.
> Userland ecosystem is crazy complex. You simply cannot stop, say,
> banshee or even pidgin, let it mingle with the rest of the system and
> restore it later in any safe way.
This is why I think it is important to define the limits of
which kernel state features are covered (or going to be
covered) by checkpoint/restart - and then list applications
that are supported (Oren mentioned mysql server in this thread).
It will always be easy for someone to point at some application
like powertop and say "we can't migrate that, so checkpoint
restart is therefore useless" ... this just is not true. This
can be useful without having to be complete (as long as the
limits are well defined).
> I'm afraid I can't agree with that. You can store and restore the
> states which kernel is aware of but that's a very small fraction of
> the whole picture.
See above - it may be enough to cover a significant number of
useful cases.
> Sure, you can freeze whole tree of related processes and move them
> around, but if you think about it, it's an already broken scenario.
> For example, dbus (or rather agents listening to it) doesn't only
> carry states specific to the set of applications being snapshotted.
> It also carries whole bunch of system-wide states or states for other
> applications. As soon as the system goes on executing after
> checkpointing, the checkpointed image of dbus and its agents become
> inconsistent and useless. You can't restore it later. You don't know
> what happened to other parts of the system inbetween.
Okay - so "dbus" is in the list of "can't so that no, and will
never be able to checkpoint/restore that class" - big deal. I'm
getting repetitive no, but one last time: just because this can't
handle every conceivable case doesn't make it useless.
> I'm afraid that's not general or transparent at all. It's extremely
> invasive to how a system is setup and used. It basically is poor
> man's virtualization or rather partitioning without hardware support
> and at this point I find it very difficult to justify the added
> complexity. Let's just make virtualization better.
I don't think that you'll ever make virtualization good enough
to make the HPC people happy.
>> I know of several places that do not use C/R because they can't
>> stop their long running processes for longer than a few milliseconds.
>> I know how to solve their problems with linux-cr. I doubt if any
>> userspace mechanism can get there.
>
> I'm sure there will be some benefits to in-kernel implementation but
> the added complexity is crazy in comparison. I don't think it would
> be wise to include this invasive amount of code for several places
> which can't CR because they can't afford a few millisecs.
The CR cool-aid hasn't gotten so far into my system to accept
this claim. If these "can't stop for more than a few milli-seconds"
processes are HPC workloads, then I'm not seeing how you can do
much to help them. I think these applications are using almost
all of the RAM on the system, and most of the pages are anonymous.
Just how do you checkpoint several GB of dirty pages in a few
milli-seconds (when there is almost no free memory on the system)?
If you have something else in mind, then please explain a little more.
-Tony
Hello,
On 11/04/2010 01:48 PM, Luck, Tony wrote:
>> If you think only about target processes, yeah sure, you can cover
>> most of the stuff but that's not the impossible part. What's not
>> defined is interaction with the rest of the system and userland.
>> Userland ecosystem is crazy complex. You simply cannot stop, say,
>> banshee or even pidgin, let it mingle with the rest of the system and
>> restore it later in any safe way.
>
> This is why I think it is important to define the limits of
> which kernel state features are covered (or going to be
> covered) by checkpoint/restart - and then list applications
> that are supported (Oren mentioned mysql server in this thread).
> It will always be easy for someone to point at some application
> like powertop and say "we can't migrate that, so checkpoint
> restart is therefore useless" ... this just is not true. This
> can be useful without having to be complete (as long as the
> limits are well defined).
>
>> I'm afraid I can't agree with that. You can store and restore the
>> states which kernel is aware of but that's a very small fraction of
>> the whole picture.
>
> See above - it may be enough to cover a significant number of
> useful cases.
I was arguing that it is far from being _generally_ useful or
transparent. If you're saying that it is something useful for certain
use cases and application, yeah, sure. I never argued against that.
>> I'm afraid that's not general or transparent at all. It's extremely
>> invasive to how a system is setup and used. It basically is poor
>> man's virtualization or rather partitioning without hardware support
>> and at this point I find it very difficult to justify the added
>> complexity. Let's just make virtualization better.
>
> I don't think that you'll ever make virtualization good enough
> to make the HPC people happy.
If you think about HPC, userland implementation is enough. In 99% of
cases, those programs just read and write data files and burn a lot of
CPU cycles. You don't need a lot of fancy stuff to do that. More
important things would be integrating with job management so that
snapshots and rollbacks can be automatically done.
I agree that CR would be very useful for certain use cases and
applications. I just can't see where the giant patchset fits between
userland implementation which seems enough for the the most common use
case of HPC and virtualization which is maturing fast.
Thanks.
--
tejun
On Thu, Nov 04, 2010 at 12:34:51AM -0400, Oren Laadan wrote:
> Hi Christoph,
>
> I really wish you would have raised these concerns during the
> ksummit or thereafter. I'm here (LPC) until Friday, and would be
> happy to discuss any aspect of the linux-cr while at it (and if
> needed can post a summary to the list).
Discussion technical topics with slides in a big room is utterly
pointless. Just like during all the other such boring talks during KS I
was either asleep, working on something important or out of the room
doing the extended hallway track.
If you want to discuss invasive kernel changes with people do it by
email. The chance that anyone is going to listen to you is a lot
higher.
Yes, we are working with Condor to have them validate DMTCP. Time will tell.
- Gene
On Thu, Nov 04, 2010 at 08:36:16AM +0100, Tejun Heo wrote:
> Hello,
>
> On 11/04/2010 02:47 AM, Nathan Lynch wrote:
> >> In this case whitelisting the allowed
> >> state by requiring special APIs for all I/O (or even just standard
> >> APIs as long as they are supposed by the C/R lib you're linked against)
> >> is the more pragmatic, and I think faithful aproach.
> >
> > I don't think users will go for it. They'll continue to use dodgy
> > out-of-tree kernel modules and/or LD_PRELOAD hacks instead of porting
> > their applications to a new library. I think a C/R library is an
> > "ideal" solution, but it's one that nobody would use - especially in
> > HPC, unless the library somehow provides better performance.
>
> I hear that there are plans to integrate one of the userland
> snapshotting implementations with HPC workload manager. ISTR the
> combination to be condor + dmtcp but not sure. I think things like
> that make a lot of sense. Scientists writing programs for HPC
> clusters already work in given frameworks and what those applications
> do and how to recover are pretty well confined/defined. If you
> integrate snapshotting with such frameworks, it becomes pretty easy
> for both the admins and users.
>
> I'll talk about other issues in the reply to Oren's email.
>
> Thanks.
>
> --
> tejun
Thanks for your comments. We apologize for the top-post. It was accidental.
> > In our personal view, a key difference between in-kernel and userland
> > approaches is the issue of security.
> That's an interesting point but I don't think it's a dealbreaker.
> ... but it's not like CR is gonna be deployed on
> majority of desktops and servers (if so, let's talk about it then).
This is a good point to clarify some issues. C/R has several good
targets. For example, BLCR has targeted HPC batch facilities, and
does it well.
DMTCP started life on the desktop, and it's still a primary focus of DMTCP.
We worked to support screen on this release precisely so that advanced
desktop users have the option of putting their whole screen session
under checkpoint control. It complements the core goal of screen:
If you walk away from a terminal, you can get back the session elsewhere.
If your session crashes, you can get back the session elsewhere
(depending on where you save the checkpoint files, of course :-) ).
> * As Oren pointed out in another message, there are somethings which
> could seem a bit too visible to the target application. Like the
> manager thread (is it visible to the application or is it hidden by
> the libc wrapper?) and reserved signal. Also, while it's true that
> all programs should be ready to handle -EINTR failure from system
> calls, it's something which is very difficult to verify and test and
> could lead to once-in-a-blue-moon head scratchy kind of failures.
These are also some excellent points for discussion! The manager thread
is visible. For example, if you run a gdb session under checkpoint
control (only available in our unstable branch, currently), then
the gdb session will indeed see the checkpoint manager thread.
So, yes. We are not totally transparent, and a skilled user must
account for this. There are analogies (the manager thread in the
original LinuxThreads, the rare misfortune of gdb to lose
track of the stack frames).
We try to hid the reserved signal (SIGUSR2 by default, but the user can
configure it to anything else). We put wrappers around system calls
that might see our signal handler, but I'm sure there are cases where
we might not succeed --- and so a skilled user would have to configure
to use a different signal handler. And of course, there is the rare
application that repeatedly resets _every_ signal. We encountered
this in an earlier version of Maple, and the Maple developers worked
with us to open up a hole so that we could checkpoint Maple in future versions.
> [while] all programs should be ready to handle -EINTR failure from system
> calls, it's something which is very difficult to verify and test and
> could lead to once-in-a-blue-moon head scratchy kind of failures.
Exactly right! Excellent point. Perhaps this gets down to philosophy,
and what is the nature of a bug. :-) In some cases, we have encountered
this issue. Our solution was either to refuse to checkpoint within
certain system calls, or to check the return value and if there was
an -EINTR, then we would re-execute the system call. This works again,
because we are using wrappers around many (but not all) of the system calls.
> Do you guys have things on mind which the
> kernel can do to make these things more transparent or safer?
For the most part, we've always found a way to work within the current
design of the kernel. We consider this a tribute to the Linux kernel
design. They provided hooks in cases that userland C/R needs, even though
the hooks were there simply on general design principles.
But since you ask :-), there is one thing on our wish list. We
handle address space randomization, vdso, vsyscall, and so on quite well.
We do not turn off address space randomization (although on restart, we
map user segments back to their original addresses). Probably the
randomized value of brk (end-of-data or end of heap) is the thing that
gave us the most troubles and that's where the code is the most hairy.
> * The feats dmtcp achieves with its set of workarounds are impressive
> but at the same time look quite hairy. Christoph said that having a
> standard userland C-R implementation would be quite useful and IMHO
> it would be helpful in that direction if the implementation is
> modularized enough so that the core functionality and the set of
> workarounds can be easily separated. Is it already so?
The implementation is reasonably modularized. In the rush to address
bugs or feature requirements of users, we sometimes cut corners. We
intend to go back and fix those things. Roughly, the architecture of
DMTCP is to do things in two layers: MTCP handles a single
multi-threaded process. There is a separate library mtcp.so.
The higher layer (redundantly again called DMTCP) is implemented
in dmtcphijack.so. In a _very_ rough kind of way, MTCP does a lot
of what would be done within kernel C/R. But the higher DMTCP layer
takes on some of those responsibilities in places. For example,
DMTCP does part of analyzing the pseudo-ttys, since it's not always
easy to ensure that it's the controlling terminal of some process
that can checkpoint things in the MTCP layer.
Beyond that, the wrappers around system calls are essentially
perfectly modular. Some system calls go together to support a single
kernel feature, and those wrappers are kept in a common file.
There are some very few program-specific workarounds. If you look
at the main routine of dmtcp_checkpoint.cpp, you'll find most of them.
For example, if it's a setuid process, since we don't have root privilege,
we can't preload our dmtcphijack.so. So, we copy the setuid process
to our own /tmp, and execute it there without setuid. In the case
of screen, it wants to use /var/... (forgot the directory). But screen
has an option to use a different directory.
Similarly, if the distro is running an NSCD daemon, then gethostname
and similar calls go the NSCD daemon. On restart, we have to re-initialize
communication with the NSCD daemon.
I have to run to do some other things. But I'll check back on the
remaining (and any new) posts on this list later today. Thanks very
much for the interesting discussion. We've felt too isolated for too long.
But we didn't think we had something important enough before to disturb
the kernel developers with a discussion. I hope DMTCP is starting to become
mature enough that this discussion can now benefit everybody. We certainly
hope to learn a lot from it. Thanks again.
- Gene
====
On Thu, Nov 04, 2010 at 09:05:28AM +0100, Tejun Heo wrote:
> Hello,
>
> On 11/04/2010 04:40 AM, Kapil Arya wrote:
> > (Sorry for resending the message; the last message contained some html
> > tags and was rejected by server)
>
> And please also don't top-post. Being the antisocial egomaniacs we
> are, people on lkml prefer to dissect the messages we're replying to,
> insert insulting comments right where they would be most effective and
> remove the passages which can't yield effective insults. :-)
>
> > In our personal view, a key difference between in-kernel and userland
> > approaches is the issue of security. The Linux C/R developers state
> > the issue very well in their FAQ (question number 7):
> >> https://ckpt.wiki.kernel.org/index.php/Faq :
> >> 7. Can non-root users checkpoint/restart an application ?
> >>
> >> For now, only users with CAP_SYSADMIN privileges can C/R an
> >> application. This is to ensure that the checkpoint image has not been
> >> tampered with and will be treated like a loadable kernel-module.
>
> That's an interesting point but I don't think it's a dealbreaker.
> Kernel CR is gonna require userland agent anyway and access control
> can be done there. Being able to snapshot w/o root privieldge
> definitely is a plust but it's not like CR is gonna be deployed on
> majority of desktops and servers (if so, let's talk about it then).
>
> > Strategies like these are easily handled in userspace. We suspect
> > that while one may begin with a pure kernel approach, eventually,
> > one will still want to add a userland component to achieve this kind
> > of flexibility, just as BLCR has already done.
>
> Yeap, agreed. There gotta be user agents which can monitor and
> manipulate userland states. It's a fundamentally nasty job, that of
> collecting and applying application-specific workarounds. I've only
> glanced the dmtcp paper so my understanding is pretty superficial.
> With that in mind, can you please answer some of my curiosities?
>
> * As Oren pointed out in another message, there are somethings which
> could seem a bit too visible to the target application. Like the
> manager thread (is it visible to the application or is it hidden by
> the libc wrapper?) and reserved signal. Also, while it's true that
> all programs should be ready to handle -EINTR failure from system
> calls, it's something which is very difficult to verify and test and
> could lead to once-in-a-blue-moon head scratchy kind of failures.
>
> I think most of those issues can be tackled with minor narrow-scoped
> changes to the kernel. Do you guys have things on mind which the
> kernel can do to make these things more transparent or safer?
>
> * The feats dmtcp achieves with its set of workarounds are impressive
> but at the same time look quite hairy. Christoph said that having a
> standard userland C-R implementation would be quite useful and IMHO
> it would be helpful in that direction if the implementation is
> modularized enough so that the core functionality and the set of
> workarounds can be easily separated. Is it already so?
>
> Thanks.
>
> --
> tejun
On Thu, 2010-11-04 at 08:36 +0100, Tejun Heo wrote:
> Hello,
>
> On 11/04/2010 02:47 AM, Nathan Lynch wrote:
> >> In this case whitelisting the allowed
> >> state by requiring special APIs for all I/O (or even just standard
> >> APIs as long as they are supposed by the C/R lib you're linked against)
> >> is the more pragmatic, and I think faithful aproach.
> >
> > I don't think users will go for it. They'll continue to use dodgy
> > out-of-tree kernel modules and/or LD_PRELOAD hacks instead of porting
> > their applications to a new library. I think a C/R library is an
> > "ideal" solution, but it's one that nobody would use - especially in
> > HPC, unless the library somehow provides better performance.
>
> I hear that there are plans to integrate one of the userland
> snapshotting implementations with HPC workload manager. ISTR the
> combination to be condor + dmtcp but not sure. I think things like
> that make a lot of sense.
If you look at the C/R implementations of those two projects you'll see
that they don't implement what I take to be hch's suggestion - a library
or platform with special-purpose APIs to which applications are ported
in order to gain C/R ability. For all their good points, the projects
you mention do interposition for glibc's syscall wrappers and provide a
few optional hooks so apps can control certain aspects of C/R.
(Sorry for the length of this email, we are excited about being able
to discuss technical details.)
This is wonderful to have this exchange of techniques and visions. Oren, we
are guessing that you are at Columbia. If so, we would love to have you come up
here and give a talk in Boston. Alternatively, if you prefer, we would be happy
to go to Columbia and give a talk there.
In comparing functionality, one recent bug we had to overcome was with screen
with a hardstatus line and a scroll region for the terminal. We eventually
solved it in a subtle way by sending SIGWINCH, and then lying to screen about
changing the kernel window size, and then sending screen another SIGWINCH while
telling it the true window size. We were pleased to see that Linux C/R also
supports screen and we are curious how it handles this issue of restoring the
scroll region in the X11 terminal window. Thanks.
Oren noted that sometimes it's important to stop the process only for a few
miliseconds while one checkpoints. In DMTCP, we do that by configuring with
--enable-forked-checkpointing. This causes us to fork a child process taking
advantage of copy-on-write and then checkpoint the memory pages of the child
while the parent continues to execute.
> So a checkpoint will typically capture the state of e.g. a VNC server (X
> session) and the applications (xterm, win-manager etc), and the dbus daemon,
> and all their open files, and sockets etc.
This is a good example of distinct approaches when starting from Kernel C/R or
user-space C/R. We currently checkpoint VNC servers in a way similar to Linux
C/R. However, in the next few months, we want to directly checkpoint a single
X-windows application without the X11-server. The approach is easily understood
by analogy. Currently libc.so talks to the kernel. At checkpoint time, we
interrogate the kernel state and then "break" the connection to the kernel and
checkpoint. Similarly, libX11.so (or libX11-xcb.so) talks to the X11-server. At
checkpoint time, we will interrogate the state of the X11-server and then break
the connection and checkpoint.
> DMTCP is indeed a very cool project. ... It is not my intention to bash
> their great work, but it's important to understand its limitations, so just a
> few examples:
Thanks very much for bringing up these implementation questions. Its wonderful
to have someone interested in the low level technology to talk to. We would
like to share with you our current solutions and our plans for the future. We
will also add some of our question about Linux C/R inline. Thanks for the
answers in advance.
> required to link against their library, or modify the binary;
We currently use LD_PRELOAD to transparently preload our library. The user
doesn't see this. If the application is statically linked, then this doesn't
work. Until now, we haven't seen user requests to support statically linked
applications. If we do, there are other techniques to modify the call sites or
entry points for libc routines within the user binary.
> They overload some signals (so the application can't use them)
By default, DMTCP uses SIGUSR2. At process startup, the user can specify:
dmtcp_checkpoint --mtcp-checkpoint-signal <signum> a.out to change the DMTCP
signal. In an additional point we have found interesting, libc has a similar
policy of using several hardwired signal:
#define SIGCANCEL __SIGRTMIN
#define SIGTIMER SIGCANCEL
#define SIGSETXID (__SIGRTMIN + 1)
So there is a precedent for this approach.
> Completeness: many real resources are not supported, e.g. eventpoll, ipc,
> pending signals, etc.
IPC and pending signals are supported. We know how to do eventpoll but haven't
encountered a use case from our userbase and so haven't added it yet.
> * Complexity: they technically implement a virtual pid-namespace in userspace
> by intercepting calls to clone(). I wonder if they consider e.g. pid's saved
> on file owners or in afunix creds ? I'll just say it's nearly impossible with
> their 20K lines of code - I know because I did it in a kernel module ...
We do wrap clone and create a table from original PID/TID to current PID/TID
just as you say. To our knowledge, we have wrappers for all system calls
involving a PID/TID except fcntl. We are guessing that either Linux C/R also
keeps a translation table or else restores the original PID/TID. Which do you
do? In the latter case what do you do if a PID/TID is already used by another
process/thread?
> * Efficiency: from userspace it can't tell which mapped pages are dirty and
> which aren't, not to mention doing incremental checkpoints.
One of the DMTCP team, Artem Polyakov, has developed incremental checkpointing
for DMTCP and for BLCR. We are still evaluating it. It's at:
http://sourceforge.net/projects/hbict
> * Usefulness: can they live-migrate mysql server between two hosts prior to a
> kernel upgrade ?
We have not experimented with live-migration. Live-migration in user space is an
interesting topic but will take us into deep discussion outside of the current
scope. Of course VMware and others already do it. We would enjoy talking further
with you offline. It's certainly a cool use case.
> can they checkpoint stopped processes which cannot cooperate ?
We haven't had a user request for checkpointing stopped processes so far.
However one can use PTRACE (similar to doing gdb attach on stopped process) to
achieve this.
> can they checkpoint/restart postgresql ?
We don't know. We have succeeded on MySQL. We never tried postgresql. What are
the special issues there?
> In contrast, the kernel C/R is:
> ...
> * entirely transparent to applications (does not need their cooperation, can
> even do debugged tasks)
We are not sure what you are referring to by cooperation and debugged tasks. If
it helps, we can say that DMTCP can checkpoint an entire gdb session or just the
process being debugged by the gdb, according to the requirements. Our support
for PTRACE is in the unstable branch.
> * is easier to maintain in the long run (because you don't need to cheat
> applications by intercepting their kernel calls from userspace!)
We have to agree to disagree on this one. We see almost no new bugs or issues
with kernel upgrades. The most recent case was the need to add the wrapper for
pipe2 (2.6.27) and accept4 (2.6.28) and each wrapper was about 20 new lines of
code.
> * flexible to allow smart userspace to also be c/r aware, if they so wish
DMTCP also has a dmtcpaware facility by which applications can request
checkpoints for themselves or other processes. It also support user hook
functions for checkpoint, resume, and restart.
> * can provide a guarantee that a checkpoint is self-contained and can be
> later restarted
Could you tell us more about what do you mean by gurantee and self-contained?
> In fact, DMTCP will be much more useful if it builds on linux-cr as its
> chekcpoint-restart engine ;)
Your suggestion is an interesting one. One of our team members, Jason Ansel, has
made the same suggestion with respect to BLCR. This would be a great experiment
to try and we would be glad to work with you to get an initial version of DMTCP
on top of Linux C/R. DMTCP has a higher layer dmtcphijack.so and a lower layer
libmtcp.so (MTCP) which can be replaced by a modified single process
checkpointer with hooks for dmtcphijack.so. Unfortunately, our group
doesn't have
the resources to maintain and develop two branches: DMTCP/MTCP and
DMTCP/Linux C/R. Nevertheless, if you were interested in going forward on the
DMTCP/Linux C/R branch, we could share code and ideas.
> Actually, because of the huge optimization potential that exists only in
> kernel based C/R, the HPC applications are likely to benefit tremendously too
> from it. Think about things like incremental checkpoint, pre-copy to minimize
> downtime (like live-migration), using COW to defer disk IO until after the
> application can resume execution, and more. None of these is possible with
> userspace C/R.
BLCR is a kernel-based C/R package, and appears to be the current standard for
HPC. Are you saying that BLCR should be replaced by Linux C/R, if so, why?
Concerning user space C/R, please see our comments above.
> I know of several places that do not use C/R because they can't stop their
> long running processes for longer than a few milliseconds. I know how to
> solve their problems with linux-cr. I doubt if any userspace mechanism can
> get there.
DMTCP supports forked checkpointing as a configure option. A child is forked
using COW and it writes its memory to disk at leisure.
Thanks,
Gene Cooperman and Kapil Arya
Hello,
On 11/04/2010 05:44 PM, Gene Cooperman wrote:
>>> In our personal view, a key difference between in-kernel and userland
>>> approaches is the issue of security.
>>
>> That's an interesting point but I don't think it's a dealbreaker.
>> ... but it's not like CR is gonna be deployed on
>> majority of desktops and servers (if so, let's talk about it then).
>
> This is a good point to clarify some issues. C/R has several good
> targets. For example, BLCR has targeted HPC batch facilities, and
> does it well.
>
> DMTCP started life on the desktop, and it's still a primary focus of
> DMTCP. We worked to support screen on this release precisely so
> that advanced desktop users have the option of putting their whole
> screen session under checkpoint control. It complements the core
> goal of screen: If you walk away from a terminal, you can get back
> the session elsewhere. If your session crashes, you can get back
> the session elsewhere (depending on where you save the checkpoint
> files, of course :-) ).
Call me skeptical but I still don't see, yet, it being a mainstream
thing (for average sysadmin John and proverbial aunt Tilly). It
definitely is useful for many different use cases tho. Hey, but let's
see.
> These are also some excellent points for discussion! The manager thread
> is visible. For example, if you run a gdb session under checkpoint
> control (only available in our unstable branch, currently), then
> the gdb session will indeed see the checkpoint manager thread.
I don't think gdb seeing it is a big deal as long as it's hidden from
the application itself.
> We try to hid the reserved signal (SIGUSR2 by default, but the user
> can configure it to anything else). We put wrappers around system
> calls that might see our signal handler, but I'm sure there are
> cases where we might not succeed --- and so a skilled user would
> have to configure to use a different signal handler. And of course,
> there is the rare application that repeatedly resets _every_ signal.
> We encountered this in an earlier version of Maple, and the Maple
> developers worked with us to open up a hole so that we could
> checkpoint Maple in future versions.
>
>> [while] all programs should be ready to handle -EINTR failure from system
>> calls, it's something which is very difficult to verify and test and
>> could lead to once-in-a-blue-moon head scratchy kind of failures.
>
> Exactly right! Excellent point. Perhaps this gets down to
> philosophy, and what is the nature of a bug. :-) In some cases, we
> have encountered this issue. Our solution was either to refuse to
> checkpoint within certain system calls, or to check the return value
> and if there was an -EINTR, then we would re-execute the system
> call. This works again, because we are using wrappers around many
> (but not all) of the system calls.
I'm probably missing something but can't you stop the application
using PTRACE_ATTACH? You wouldn't need to hijack a signal or worry
about -EINTR failures (there are some exceptions but nothing really to
worry about). Also, unless the manager thread needs to be always
online, you can inject manager thread by manipulating the target
process states while taking a snapshot.
> But since you ask :-), there is one thing on our wish list. We
> handle address space randomization, vdso, vsyscall, and so on quite
> well. We do not turn off address space randomization (although on
> restart, we map user segments back to their original addresses).
> Probably the randomized value of brk (end-of-data or end of heap) is
> the thing that gave us the most troubles and that's where the code
> is the most hairy.
Can you please elaborate a bit? What do you want to see changed?
> The implementation is reasonably modularized. In the rush to
> address bugs or feature requirements of users, we sometimes cut
> corners. We intend to go back and fix those things. Roughly, the
> architecture of DMTCP is to do things in two layers: MTCP handles a
> single multi-threaded process. There is a separate library mtcp.so.
> The higher layer (redundantly again called DMTCP) is implemented in
> dmtcphijack.so. In a _very_ rough kind of way, MTCP does a lot of
> what would be done within kernel C/R. But the higher DMTCP layer
> takes on some of those responsibilities in places. For example,
> DMTCP does part of analyzing the pseudo-ttys, since it's not always
> easy to ensure that it's the controlling terminal of some process
> that can checkpoint things in the MTCP layer.
>
> Beyond that, the wrappers around system calls are essentially
> perfectly modular. Some system calls go together to support a
> single kernel feature, and those wrappers are kept in a common file.
I see. I just thought that it would be helpful to have the core part
- which does per-process checkpointing and restoring and corresponds
to the features implemented by in-kernel CR - as a separate thing. It
already sounds like that is mostly the case.
I don't have much idea about the scope of the whole thing, so please
feel free to hammer senses into me if I go off track. From what I
read, it seems like once the target process is stopped, dmtcp is able
to get most information necessary from kernel via /proc and other
methods but the paper says that it needs to intercept socket related
calls to gather enough information to recreate them later. I'm
curious what's missing from the current /proc. You can map socket to
inode from /proc/*/fd which can be matched to an entry in
/proc/*/net/PROTO to find out the addresses and most socket options
should be readable via getsockopt. Am I missing something?
I think this is why userland CR implementation makes much more sense.
Most of states visible to a userland process are rather rigidly
defined by standards and, ultimately, ABI and the kernel exports most
of those information to userland one way or the other. Given the
right set of needed features, most of which are probabaly already
implemented, a userland implementation should have access to most
information necessary to checkpoint without resorting to too messy
methods and then there inevitably needs to be some workarounds to make
CR'd processes behave properly w.r.t. other states on the system, so
userland workarounds are inevitable anyway unless it resorts to
preemtive separation using namespaces and containers, which I frankly
think isn't much of value already and more so going forward.
Thanks.
--
tejun
> Oren noted that sometimes it's important to stop the process only
> for a few milliseconds while one checkpoints. In DMTCP, we do that
> by configuring with --enable-forked-checkpointing. This causes us
> to fork a child process taking advantage of copy-on-write and then
> checkpoint the memory pages of the child while the parent continues
> to execute.
Interesting ... but while the process is only stopped for the duration
of the fork, it may be taking COW faults on almost every page it
touches. I think this will not work well for large HPC applications
that allocate most of physical memory as anonymous pages for the
application. It may even result in an OOM kill if you don't complete
the checkpoint of the child and have it exit in a timely manner.
-Tony
On Fri, Nov 05, 2010 at 04:57:33AM -0700, Luck, Tony wrote:
> > Oren noted that sometimes it's important to stop the process only
> > for a few milliseconds while one checkpoints. In DMTCP, we do that
> > by configuring with --enable-forked-checkpointing. This causes us
> > to fork a child process taking advantage of copy-on-write and then
> > checkpoint the memory pages of the child while the parent continues
> > to execute.
>
> Interesting ... but while the process is only stopped for the duration
> of the fork, it may be taking COW faults on almost every page it
> touches. I think this will not work well for large HPC applications
> that allocate most of physical memory as anonymous pages for the
> application. It may even result in an OOM kill if you don't complete
> the checkpoint of the child and have it exit in a timely manner.
>
> -Tony
>
I agree with you that forked checkpointing is probably not what you
want in the middle of an HPC computation. But isn't that part of
the nature of COW? Whether the COW is invoked within the kernel,
or from outside the kernel via fork --- in either case, when you have
mostly dirty pages, you will have to copy most of the pages.
Do I understand your point correctly? Thanks,
- Gene
On Thu, Nov 4, 2010 at 8:55 PM, Kapil Arya <[email protected]> wrote:
>> * Complexity: they technically implement a virtual pid-namespace in userspace
>> by intercepting calls to clone(). I wonder if they consider e.g. pid's saved
>> on file owners or in afunix creds ? I'll just say it's nearly impossible with
>> their 20K lines of code - I know because I did it in a kernel module ...
>
> We do wrap clone and create a table from original PID/TID to current PID/TID
> just as you say. To our knowledge, we have wrappers for all system calls
> involving a PID/TID except fcntl. We are guessing that either Linux C/R also
> keeps a translation table or else restores the original PID/TID. Which do you
> do? In the latter case what do you do if a PID/TID is already used by another
> process/thread?
>
Like Oren said, we run the application inside the container - which would have
its own pid namespace. When we restart, we again create a container, which
starts with a fresh pid namespace, so the pids will not be in use.
IOW, a process
has a virtual pid and a global pid. The virtual pid is what the
application sees
when it calls getpid() and that pid will be correctly restored when you create
the container.
Sukadev
On 11/04/2010 04:05 AM, Tejun Heo wrote:
> Hello,
>
> On 11/04/2010 04:40 AM, Kapil Arya wrote:
>> (Sorry for resending the message; the last message contained some html
>> tags and was rejected by server)
>
> And please also don't top-post. Being the antisocial egomaniacs we
> are, people on lkml prefer to dissect the messages we're replying to,
> insert insulting comments right where they would be most effective and
> remove the passages which can't yield effective insults. :-)
>
>> In our personal view, a key difference between in-kernel and userland
>> approaches is the issue of security. The Linux C/R developers state
>> the issue very well in their FAQ (question number 7):
>>> https://ckpt.wiki.kernel.org/index.php/Faq :
>>> 7. Can non-root users checkpoint/restart an application ?
>>>
>>> For now, only users with CAP_SYSADMIN privileges can C/R an
>>> application. This is to ensure that the checkpoint image has not been
>>> tampered with and will be treated like a loadable kernel-module.
>
> That's an interesting point but I don't think it's a dealbreaker.
> Kernel CR is gonna require userland agent anyway and access control
> can be done there.
Indeed, this is a restriction on the new eclone() syscall, and can
be addressed with proper userspace tools (including crypo-sign the
checkpoint image). There core of the c/r code allows a user to
restore anything within the user's privilege level.
> Being able to snapshot w/o root privieldge
> definitely is a plust but it's not like CR is gonna be deployed on
> majority of desktops and servers (if so, let's talk about it then).
Why not ? it has zero overhead when not in use, and a reasonable
code footprint (which can be reduced by modularizing some of it,
but that's outside the point).
>> Strategies like these are easily handled in userspace. We suspect
>> that while one may begin with a pure kernel approach, eventually,
>> one will still want to add a userland component to achieve this kind
>> of flexibility, just as BLCR has already done.
>
> Yeap, agreed. There gotta be user agents which can monitor and
> manipulate userland states. It's a fundamentally nasty job, that of
Are we talking about distributed checkpoint or "standalone" ?
DMTCP relies on user agents to allow distributed/remote execution
in a manner mostly transparent to the application. Many distributed
systems don't require (and do not use) user agents. Consider a
multi-tier system with web server, sql server and some applications
server. These are not suitable to DMTCP's mode or work.
(This is not to say DMTCP isn't useful - it's a clever piece of
software with specific goals and more geared towards HPC needs).
Now regarding "standalone" c/r, if you want to save/restore single
or a subset of processes of a system without the rest of it, then
you will always need user agents, regardless of userspace/kernel
method. Likewise, their work on those tools will be as useful
independently of which c/r 'engine' it uses.
When you include all the relevant processes (e.g. an entire VNC
session, a web server, HPC and batch jobs), you generally don't
need the user agents. The checkpoint is self-contained, and linux-cr
can provide you that guarantee at checkpoint time.
> collecting and applying application-specific workarounds. I've only
> glanced the dmtcp paper so my understanding is pretty superficial.
> With that in mind, can you please answer some of my curiosities?
>
> * As Oren pointed out in another message, there are somethings which
> could seem a bit too visible to the target application. Like the
> manager thread (is it visible to the application or is it hidden by
> the libc wrapper?) and reserved signal. Also, while it's true that
> all programs should be ready to handle -EINTR failure from system
> calls, it's something which is very difficult to verify and test and
> could lead to once-in-a-blue-moon head scratchy kind of failures.
If there is a will, there is (almost always) a way ;)
What MTCP does, IIUC, is wrap around the applications with a complete
pid-namespace (and more) in userspace. There are/were also commercial
products that do that. It's a tremendous effort and I'm impressed by
their (MTCP) work so far.
It is important to understand that it has a price tag: performance
and complexity. It's usually useful for HPC needs, but unsuitable
for the generic server/VPS space.
>
> I think most of those issues can be tackled with minor narrow-scoped
> changes to the kernel. Do you guys have things on mind which the
> kernel can do to make these things more transparent or safer?
Hmmm... the kernel already does much of it - for instance, we have
neat pid-namespace infrastructure; does it make sense to go into
the trouble of adding interfaces to provide for pid-virtalization
in userspace ? we should be past that ...
Moreover, your objection was based on the apparent complexity of
a badly presented aggregate diff (and I disagree: most of that
are simple refactoring and cleanups). However, that very set of
"narrow-scoped changes" to the kernel that you suggest, will take
life in the form of kernel patches that will do more than these
and will achieve less.
> * The feats dmtcp achieves with its set of workarounds are impressive
> but at the same time look quite hairy. Christoph said that having a
> standard userland C-R implementation would be quite useful and IMHO
> it would be helpful in that direction if the implementation is
> modularized enough so that the core functionality and the set of
> workarounds can be easily separated. Is it already so?
From what I understand, the 'wrapper' functionality to support
distributed operation is said to be well modularized from the
actual c/r engine - which will allow it to use better c/r engines;
and coincidentally, I have one in mind... ;)
Oren.
On 11/05/2010 05:28 AM, Tejun Heo wrote:
> Hello,
>
> On 11/04/2010 05:44 PM, Gene Cooperman wrote:
>>>> In our personal view, a key difference between in-kernel and userland
>>>> approaches is the issue of security.
>>>
>>> That's an interesting point but I don't think it's a dealbreaker.
>>> ... but it's not like CR is gonna be deployed on
>>> majority of desktops and servers (if so, let's talk about it then).
>>
>> This is a good point to clarify some issues. C/R has several good
>> targets. For example, BLCR has targeted HPC batch facilities, and
>> does it well.
>>
>> DMTCP started life on the desktop, and it's still a primary focus of
>> DMTCP. We worked to support screen on this release precisely so
>> that advanced desktop users have the option of putting their whole
>> screen session under checkpoint control. It complements the core
>> goal of screen: If you walk away from a terminal, you can get back
>> the session elsewhere. If your session crashes, you can get back
>> the session elsewhere (depending on where you save the checkpoint
>> files, of course :-) ).
>
> Call me skeptical but I still don't see, yet, it being a mainstream
> thing (for average sysadmin John and proverbial aunt Tilly). It
> definitely is useful for many different use cases tho. Hey, but let's
> see.
>
>> These are also some excellent points for discussion! The manager thread
>> is visible. For example, if you run a gdb session under checkpoint
>> control (only available in our unstable branch, currently), then
>> the gdb session will indeed see the checkpoint manager thread.
>
> I don't think gdb seeing it is a big deal as long as it's hidden from
> the application itself.
>
>> We try to hid the reserved signal (SIGUSR2 by default, but the user
>> can configure it to anything else). We put wrappers around system
>> calls that might see our signal handler, but I'm sure there are
>> cases where we might not succeed --- and so a skilled user would
>> have to configure to use a different signal handler. And of course,
>> there is the rare application that repeatedly resets _every_ signal.
>> We encountered this in an earlier version of Maple, and the Maple
>> developers worked with us to open up a hole so that we could
>> checkpoint Maple in future versions.
>>
>>> [while] all programs should be ready to handle -EINTR failure from system
>>> calls, it's something which is very difficult to verify and test and
>>> could lead to once-in-a-blue-moon head scratchy kind of failures.
>>
>> Exactly right! Excellent point. Perhaps this gets down to
>> philosophy, and what is the nature of a bug. :-) In some cases, we
>> have encountered this issue. Our solution was either to refuse to
>> checkpoint within certain system calls, or to check the return value
>> and if there was an -EINTR, then we would re-execute the system
>> call. This works again, because we are using wrappers around many
>> (but not all) of the system calls.
>
> I'm probably missing something but can't you stop the application
> using PTRACE_ATTACH? You wouldn't need to hijack a signal or worry
> about -EINTR failures (there are some exceptions but nothing really to
> worry about). Also, unless the manager thread needs to be always
> online, you can inject manager thread by manipulating the target
> process states while taking a snapshot.
This is an excellent example to demonstrate several points:
* To freeze the processes, you can use (quote) "hairy" signal
overload mechanism, or even more hairy ptrace; both by the way
have their performance problem with many processes/threads.
Or you can use the in-kernel freezer-cgroup, and forget about
workarounds, like linux-cr does. And ~200 lines in said diff
are dedicated exactly to that.
* Then, because both the workaround and the entire philosophy
of MTCP c/r engine is that affected processes _participate_ in
the checkpoint, their syscalls _must_ be interrupted. Contrastly,
linux-cr kernel approach allows not only to checkpoint processes
without collaboration, but also builds on the native signal
handling kernel code to restart the system calls (both after
unfreeze, and after restart), such that the original process
does not observe -EINTR.
>> But since you ask :-), there is one thing on our wish list. We
>> handle address space randomization, vdso, vsyscall, and so on quite
>> well. We do not turn off address space randomization (although on
>> restart, we map user segments back to their original addresses).
>> Probably the randomized value of brk (end-of-data or end of heap) is
>> the thing that gave us the most troubles and that's where the code
>> is the most hairy.
>
> Can you please elaborate a bit? What do you want to see changed?
Aha ... another great example: yet another piece of the suspect
diff in question is dedicated to allow a restarting process to
request a specific location for the vdso.
BTW, a real security expert (and I'm not one...) may argue that
this operation should only be allowed to privileged users. In fact,
if your code gets around the linux ASLR mechanisms, then someone
should fix the kernel ASLR code :)
>> The implementation is reasonably modularized. In the rush to
>> address bugs or feature requirements of users, we sometimes cut
>> corners. We intend to go back and fix those things. Roughly, the
>> architecture of DMTCP is to do things in two layers: MTCP handles a
>> single multi-threaded process. There is a separate library mtcp.so.
>> The higher layer (redundantly again called DMTCP) is implemented in
>> dmtcphijack.so. In a _very_ rough kind of way, MTCP does a lot of
>> what would be done within kernel C/R. But the higher DMTCP layer
>> takes on some of those responsibilities in places. For example,
>> DMTCP does part of analyzing the pseudo-ttys, since it's not always
>> easy to ensure that it's the controlling terminal of some process
>> that can checkpoint things in the MTCP layer.
>>
>> Beyond that, the wrappers around system calls are essentially
>> perfectly modular. Some system calls go together to support a
>> single kernel feature, and those wrappers are kept in a common file.
>
> I see. I just thought that it would be helpful to have the core part
> - which does per-process checkpointing and restoring and corresponds
> to the features implemented by in-kernel CR - as a separate thing. It
> already sounds like that is mostly the case.
FWIW, the restart portion of linux-cr is designed with this in
mind - it is flexible enough to accommodate for smart userspace
tools and wrappers that wish to mock with the processes and
their resource post-restart (but before the processes resume
execution). For example, a distributed checkpoint tool could,
at restart time, reestablish the necessary network connections
(which is much different than live migration of connections,
and clearly not a kernel task). This way, it is trivial to migrate
a distributed application from one set of hosts to another, on
different networks, with very little effort.
>
> I don't have much idea about the scope of the whole thing, so please
> feel free to hammer senses into me if I go off track. From what I
> read, it seems like once the target process is stopped, dmtcp is able
> to get most information necessary from kernel via /proc and other
> methods but the paper says that it needs to intercept socket related
> calls to gather enough information to recreate them later. I'm
> curious what's missing from the current /proc. You can map socket to
> inode from /proc/*/fd which can be matched to an entry in
> /proc/*/net/PROTO to find out the addresses and most socket options
> should be readable via getsockopt. Am I missing something?
So you'll need mechanisms not only to read the data at checkpoint
time but also to reinstate the data at restart time. By the time
you are done, the kernel all the c/r code (the suspect diff in
question _and_ the rest of the logic) in the form of new interfaces
and ABIs to usersapce...; the userspace code will grow some more
hair; and there will be zero maintainability gain. And at the same
you won't be able to leverage optimizations only possible in the
kernel.
>
> I think this is why userland CR implementation makes much more sense.
> Most of states visible to a userland process are rather rigidly
> defined by standards and, ultimately, ABI and the kernel exports most
> of those information to userland one way or the other. Given the
> right set of needed features, most of which are probabaly already
> implemented, a userland implementation should have access to most
> information necessary to checkpoint without resorting to too messy
> methods and then there inevitably needs to be some workarounds to make
> CR'd processes behave properly w.r.t. other states on the system, so
> userland workarounds are inevitable anyway unless it resorts to
To be precise, there are three types of userland workarounds:
1) userland workarounds to make a restarted application work when
peer processrs aren't saved - e.g, in distributed checkpoint you
need a workaround to rebuild the socket to the peer; or in his
example with the 'ncsd' daemon from earlier in the thread.
These are needed regardless of the c/r engine of choice. In many
cases they can be avoided if applications are run in containers.
(which can be as simple as running a program using 'nohup')
2) userland workarounds to duplicate virtualization logic already
done by the kernel - like the userspace pid-namespace and the
complex logic and hacks needed to make it work. This is completely
unnecessary when you do kernel c/r.
3) userland workarounds to compensate for the fact that userspace
can't get or set some state during checkpoint or restart. For
example, in the kernel it's trivial to track shared files. How
would you say, from userspace, if fd[0] of parent A and child B is
the same file opened and then inherited, or the same filename
opened twice individually ? For files, it is possible to figure
this out in user space, e.g. by intercepting and tracking all forks
and all file operations (including passing fd's via afunix sockets).
There are other hairy ways to do it, but not quite so for other
resources.
As another example, consider SIDs and PGIDs. With proper algorithms
you can ensure that your processes get the right SID at fork time.
But in the general case, you can't reproduce PGIDs accurately
without replaying what the processes (including those that had died
already) behaved.
And to track zombies at checkpoint, you'd need to actually collect
them, so you must do it in a hairy wrapper, and keep the secret
until the application calls wait(). But then, there may be some
side effects due to collecting zombies, e.g. the pid may be reused
against the application's expectation.
Some of these have workarounds, some not. Do you really think that
re-reimplementing linux and namespaces in userspace is the way to go ?
Then, you can add to the kernel endless amount of interfaces to
export all of this - both data, and the functionality to re-instate
this data at checkpoint. But ... wait -- isn't that what linux-cr
already does ?
> preemtive separation using namespaces and containers, which I frankly
> think isn't much of value already and more so going forward.
That is one opinion. Then there are people using VPSs in commercial
and private environments, for example.
VMs are wonderful (re)invention. Regardless of any one single
person's about VMs vs containers, both are here to stay, and both
have their use-cases and users. IMHO, it is wrong to ignore the
need for c/r and migration capabilities for containers, whether
they run full desktop environments, multiple applications or single
processes.
Oren.
> I'm probably missing something but can't you stop the application
> using PTRACE_ATTACH? You wouldn't need to hijack a signal or worry
> about -EINTR failures (there are some exceptions but nothing really to
> worry about). Also, unless the manager thread needs to be always
> online, you can inject manager thread by manipulating the target
> process states while taking a snapshot.
In fact CryoPid uses exactly the same approach and has been around for around 5
years. Not as much development effort has gone into CryoPid as DMTCP and so its
application coverage is not as broad. But the larger issue for using PTRACE is
that you can not have two superiors tracing the same inferior process. So if you
want to checkpoint a gdb session or valgrind or tmux or strace, then you can not
directly control and quiesce the inferior process being traced.
Beyond that, we also have a vision (not yet implemented) of process
virtualization by which one can change the behavior of a program. For example,
if a distributed computation runs over infiniband, can we migrate to a TCP/IP
cluster. For this, one needs the flexibility of wrappers around system calls.
This vision of process virtualization also motivates why our own research
project has steered away from in-kernel C/R.
> > But since you ask :-), there is one thing on our wish list. We
> > handle address space randomization, vdso, vsyscall, and so on quite
> > well. We do not turn off address space randomization (although on
> > restart, we map user segments back to their original addresses).
> > Probably the randomized value of brk (end-of-data or end of heap) is
> > the thing that gave us the most troubles and that's where the code
> > is the most hairy.
>
> Can you please elaborate a bit? What do you want to see changed?
Yes, we would love to elaborate :-). We began DMTCP with Linux kernel 2.6.3.
When Address Space Layout Randomization was added, we were forced to add some
hacks concerning VDSO location and end-of-data. end-of-data is the uglier part.
On restart, we directly map each memory segment into the original address at
checkpoint time. The issue comes in mapping heap back to its original location.
We call sbrk() to reset the end-of-data to the end of the original heap. This
fails if the randomized beginning-of-data/end-of-data given to us by the kernel
for the restarted process is too far away from where we want to remap the heap.
To get around this, we play games with legacy layout, other personality
parameters, and RLIMIT_STACK (since the kernel uses RLIMIT_STACK in choosing the
appropriate memory layout).
For our wish list, we would like a way of telling the kernel, where to set
beginning-of-data/end-of-data. Curiously enough, at the time at which Linux
started randomizing address space, there was discussion of offering exactly this
facility for the sake of legacy programs, but it turned out not to be needed.
Similarly, it would be nice to tell the kernel where we want the VDSO page.
Currently, we get around this by keeping two VDSO pages, the old one which we
restore and the new one specified to us by the kernel when the restart process
is created. This works well for, and so controlling the address of the VDSO page
is less important for us.
> I don't have much idea about the scope of the whole thing, so please
> feel free to hammer senses into me if I go off track. From what I
> read, it seems like once the target process is stopped, dmtcp is able
> to get most information necessary from kernel via /proc and other
> methods but the paper says that it needs to intercept socket related
> calls to gather enough information to recreate them later. I'm
> curious what's missing from the current /proc. You can map socket to
> inode from /proc/*/fd which can be matched to an entry in
> /proc/*/net/PROTO to find out the addresses and most socket options
> should be readable via getsockopt. Am I missing something?
The design of DMTCP was decided upon roughly during the period from Linux 2.6.3
through Linux 2.6.18. At that time, /proc/*/net did not exist. You are right
that this can provide much better design for DMTCP and eliminate some of our
wrappers. Thanks very much for pointing this out. We are now egar to implement a
new design based on /proc/*/net in the near future.
Since /proc/*/net provides a simpler design for sockets, we started wondering
what other simplifications may be possible. Here is one possibility, in the case
of shared file descriptors, DMTCP goes through two barriers in order to decide
which process will be responsible for checkpointing which shared-file
descriptor. It works and the overhead is reasonable, but if you have additional
suggestion for this case, we would be very interested.
> I think this is why userland CR implementation makes much more sense.
> Most of states visible to a userland process are rather rigidly
> defined by standards and, ultimately, ABI and the kernel exports most
> of those information to userland one way or the other. Given the
> right set of needed features, most of which are probabaly already
> implemented, a userland implementation should have access to most
> information necessary to checkpoint without resorting to too messy
> methods and then there inevitably needs to be some workarounds to make
> CR'd processes behave properly w.r.t. other states on the system, so
> userland workarounds are inevitable anyway unless it resorts to
> preemtive separation using namespaces and containers, which I frankly
> think isn't much of value already and more so going forward.
Its a very good point and we agree completely. Here are some examples where we
believe, a userland component is inevitable even if one begins with in-kernel
C/R:
1. NSCD deamon -- in calls to libc::gethostname() etc. libc arranges for
communication by sharing a memory segment with application process. Our code
recognized this shared memory because it starts with /var/*/nscd.
2. syslogd -- Application using syslog have a socket open to the syslog deamon.
DMTCP makes a system call to turnoff logging at checkpoint time.
3. X-windows terminals -- xterm/gnome-terminal/konsole all emulate ANSI
terminals. They support various ANSI features such as setting up scrolling
region above status line. GNU screen uses the scrolling region feature. On
restart, we have to convince GNU screen and similar programs to re-initialize
their ANSI terminal. We do this successfully by sending a SIGWINCH on
restart, since it has to re-initialize the ANSI terminal whenever the window
size changes. In fact we send one SIGWINCH and when the application calls
ioctl(), to get the window size, we lie and say that the window size changed,
and we then send another SIGWINCH from within the wrapper to force the
application to recheck the window size and discover that the window is back
to its original size.
4. X11 apps -- The current approach to checkpointing X-windows application is to
checkpoint them within a VNC server. We plan to add wrappers around calls to
libX11.so so that we can discover the state of an X11 window at checkpoint
time and then restart just the single X11 application. This avoids the need
to also checkpoint the X11 server which minimized the size of the the
checkpoint image that needs to be written to the disk.
5. GNU Screen -- DMTCP sets SCREEN_DIR to a temp directory in order to avoid the
issue that occurs when the setsuid screen process tries to across
/var/run/uscreen. Otherwise we would have difficulty at restart time when the
checkpoint image has no setsuid privilege. We don't know if there are similar
issues with an in-kernel C/R.
We really enjoyed this discussion. If you are interested, we would be happy to
talk further by phone in order to take advantage of the higher bandwidth.
Best,
-Gene and Kapil
On Fri, Nov 05, 2010 at 01:17:03PM -0400, Gene Cooperman wrote:
> On Fri, Nov 05, 2010 at 04:57:33AM -0700, Luck, Tony wrote:
> > > Oren noted that sometimes it's important to stop the process only
> > > for a few milliseconds while one checkpoints. In DMTCP, we do that
> > > by configuring with --enable-forked-checkpointing. This causes us
> > > to fork a child process taking advantage of copy-on-write and then
> > > checkpoint the memory pages of the child while the parent continues
> > > to execute.
> >
> > Interesting ... but while the process is only stopped for the duration
> > of the fork, it may be taking COW faults on almost every page it
> > touches. I think this will not work well for large HPC applications
> > that allocate most of physical memory as anonymous pages for the
> > application. It may even result in an OOM kill if you don't complete
> > the checkpoint of the child and have it exit in a timely manner.
> >
> > -Tony
> >
>
> I agree with you that forked checkpointing is probably not what you
> want in the middle of an HPC computation. But isn't that part of
> the nature of COW? Whether the COW is invoked within the kernel,
> or from outside the kernel via fork --- in either case, when you have
> mostly dirty pages, you will have to copy most of the pages.
The current linux-cr approach to handling [dirty] pages doesn't use COW.
The tasks are frozen using the cgroup freezer and thus unable to modify
the pages. So we don't have to mess with page tables nor do we pay
any extra overhead for page faults.
If we ever implement thawed checkpointing -- checkpointing while
the task isn't frozen -- then we'd probably use COW and see
the same faults. The difference then would be that in-kernel we
wouldn't have one extra task per mm being checkpointed.
Cheers,
-Matt Helsley
On 11/05/2010 09:16 PM, Matt Helsley wrote:
> On Fri, Nov 05, 2010 at 01:17:03PM -0400, Gene Cooperman wrote:
>> On Fri, Nov 05, 2010 at 04:57:33AM -0700, Luck, Tony wrote:
>>>> Oren noted that sometimes it's important to stop the process only
>>>> for a few milliseconds while one checkpoints. In DMTCP, we do that
>>>> by configuring with --enable-forked-checkpointing. This causes us
>>>> to fork a child process taking advantage of copy-on-write and then
>>>> checkpoint the memory pages of the child while the parent continues
>>>> to execute.
>>>
>>> Interesting ... but while the process is only stopped for the duration
>>> of the fork, it may be taking COW faults on almost every page it
>>> touches. I think this will not work well for large HPC applications
>>> that allocate most of physical memory as anonymous pages for the
>>> application. It may even result in an OOM kill if you don't complete
>>> the checkpoint of the child and have it exit in a timely manner.
>>>
>>> -Tony
>>>
>>
>> I agree with you that forked checkpointing is probably not what you
>> want in the middle of an HPC computation. But isn't that part of
>> the nature of COW? Whether the COW is invoked within the kernel,
>> or from outside the kernel via fork --- in either case, when you have
>> mostly dirty pages, you will have to copy most of the pages.
>
> The current linux-cr approach to handling [dirty] pages doesn't use COW.
> The tasks are frozen using the cgroup freezer and thus unable to modify
> the pages. So we don't have to mess with page tables nor do we pay
> any extra overhead for page faults.
The current linux-cr patchset leaves out any optimizations
for simplicity of reviewing - first get it working and reviewed.
We experienced with optimizations with previous systems.
>
> If we ever implement thawed checkpointing -- checkpointing while
> the task isn't frozen -- then we'd probably use COW and see
> the same faults. The difference then would be that in-kernel we
> wouldn't have one extra task per mm being checkpointed.
Thawed checkpointing can be done with any COW tax, by leveraging
the native hardware dirty bit in page tables. There is no need to
trigger additional checkpoints. Tracking modified pages using the
dirty bit is a feature also desired by the KVM community, and we
plan to work with them on implementing it.
Oren.
On Sat, Nov 06, 2010 at 12:06:09AM -0400, Oren Laadan wrote:
> On 11/05/2010 09:16 PM, Matt Helsley wrote:
> > On Fri, Nov 05, 2010 at 01:17:03PM -0400, Gene Cooperman wrote:
> >> On Fri, Nov 05, 2010 at 04:57:33AM -0700, Luck, Tony wrote:
> >>>> Oren noted that sometimes it's important to stop the process only
> >>>> for a few milliseconds while one checkpoints. In DMTCP, we do that
> >>>> by configuring with --enable-forked-checkpointing. This causes us
> >>>> to fork a child process taking advantage of copy-on-write and then
> >>>> checkpoint the memory pages of the child while the parent continues
> >>>> to execute.
> >>>
> >>> Interesting ... but while the process is only stopped for the duration
> >>> of the fork, it may be taking COW faults on almost every page it
> >>> touches. I think this will not work well for large HPC applications
> >>> that allocate most of physical memory as anonymous pages for the
> >>> application. It may even result in an OOM kill if you don't complete
> >>> the checkpoint of the child and have it exit in a timely manner.
<snip>
> > The current linux-cr approach to handling [dirty] pages doesn't use COW.
> > The tasks are frozen using the cgroup freezer and thus unable to modify
> > the pages. So we don't have to mess with page tables nor do we pay
> > any extra overhead for page faults.
>
> The current linux-cr patchset leaves out any optimizations
> for simplicity of reviewing - first get it working and reviewed.
> We experienced with optimizations with previous systems.
>
> > If we ever implement thawed checkpointing -- checkpointing while
> > the task isn't frozen -- then we'd probably use COW and see
> > the same faults. The difference then would be that in-kernel we
> > wouldn't have one extra task per mm being checkpointed.
>
> Thawed checkpointing can be done with any COW tax, by leveraging
> the native hardware dirty bit in page tables. There is no need to
> trigger additional checkpoints. Tracking modified pages using the
s/checkpoints/faults/
Cheers,
-Matt Helsley
On Fri, Nov 05, 2010 at 10:28:09AM +0100, Tejun Heo wrote:
> Hello,
>
> On 11/04/2010 05:44 PM, Gene Cooperman wrote:
> >>> In our personal view, a key difference between in-kernel and userland
> >>> approaches is the issue of security.
> >>
> >> That's an interesting point but I don't think it's a dealbreaker.
> >> ... but it's not like CR is gonna be deployed on
> >> majority of desktops and servers (if so, let's talk about it then).
> >
> > This is a good point to clarify some issues. C/R has several good
> > targets. For example, BLCR has targeted HPC batch facilities, and
> > does it well.
> >
> > DMTCP started life on the desktop, and it's still a primary focus of
> > DMTCP. We worked to support screen on this release precisely so
> > that advanced desktop users have the option of putting their whole
> > screen session under checkpoint control. It complements the core
> > goal of screen: If you walk away from a terminal, you can get back
> > the session elsewhere. If your session crashes, you can get back
> > the session elsewhere (depending on where you save the checkpoint
> > files, of course :-) ).
>
> Call me skeptical but I still don't see, yet, it being a mainstream
> thing (for average sysadmin John and proverbial aunt Tilly). It
> definitely is useful for many different use cases tho. Hey, but let's
> see.
Rightly so. It hasn't been widely proven as something that distros
would be willing to integrate into a normal desktop session. We've got
some demos of it working with VNC, twm, and vim. Oren has his own VNC,
twm, etc demos too. We haven't looked very closely at more advanced
desktop sessions like (in no particular order) KDE or Gnome. Nor have
we yet looked at working with any portions of X that were meant to provide
this but were never popular enough to do so (XSMP iirc).
Does DMTCP handle KDE/Gnome sessions? X too?
On the kernel side of things for the desktop, right now we think our
biggest obstacle is inotify. I've been working on kernel patches for
kernel-cr to do that and it seems fairly do-able. Does DMTCP handle
restarting inotify watches without dropping events that were present
during checkpoint?
The other problem for kernel c/r of X is likely to be DRM. Since the
different graphics chipsets vary so widely there's nothing we can do
to migrate DRM state of an NVIDIA chipset to DRM state of an ATI chipset
as far as I know. Perhaps if that would help hybrid graphics systems
then it's something that could be common between DRM and
checkpoint/restart but it's very much pie-in-the-sky at the moment.
kernel c/r of input devices might be alot easier. We just simulate
hot [un]plug of the devices and rely on X responding. We can even
checkpoint the events X would have missed and deliver them prior to hot
unplug.
Also, how does DMTCP handle unlinked files? They are important because
lots of process open a file in /tmp and then unlink it. And that's not
even the most difficult case to deal with. How does DMTCP handle:
link a to b
open a (stays open)
rm a
<checkpoint and restart>
open b
write to b
read from a (the write must appear)
?
>
> > These are also some excellent points for discussion! The manager thread
> > is visible. For example, if you run a gdb session under checkpoint
> > control (only available in our unstable branch, currently), then
> > the gdb session will indeed see the checkpoint manager thread.
>
> I don't think gdb seeing it is a big deal as long as it's hidden from
> the application itself.
Is the checkpoint control process hidden from the application? What
happens if it gets killed or dies in the middle of checkpoint? Can
a malicious task being checkpointed (perhaps for later analysis)
kill it? Or perhaps it runs as root or a user with special capabilities?
>
> > We try to hid the reserved signal (SIGUSR2 by default, but the user
Mess.
> > can configure it to anything else). We put wrappers around system
> > calls that might see our signal handler, but I'm sure there are
> > cases where we might not succeed --- and so a skilled user would
> > have to configure to use a different signal handler. And of course,
> > there is the rare application that repeatedly resets _every_ signal.
> > We encountered this in an earlier version of Maple, and the Maple
> > developers worked with us to open up a hole so that we could
> > checkpoint Maple in future versions.
> >
> >> [while] all programs should be ready to handle -EINTR failure from system
> >> calls, it's something which is very difficult to verify and test and
> >> could lead to once-in-a-blue-moon head scratchy kind of failures.
> >
> > Exactly right! Excellent point. Perhaps this gets down to
> > philosophy, and what is the nature of a bug. :-) In some cases, we
> > have encountered this issue. Our solution was either to refuse to
> > checkpoint within certain system calls, or to check the return value
> > and if there was an -EINTR, then we would re-execute the system
> > call. This works again, because we are using wrappers around many
> > (but not all) of the system calls.
>
> I'm probably missing something but can't you stop the application
> using PTRACE_ATTACH? You wouldn't need to hijack a signal or worry
Wouldn't checkpoint and gdb interfere then since the kernel only allows
one task to attach? So if DMTCP is checkpointing something and uses this
solution then you can't debug it. If a user is debugging their process then
DMTCP can't checkpoint it.
> about -EINTR failures (there are some exceptions but nothing really to
> worry about). Also, unless the manager thread needs to be always
> online, you can inject manager thread by manipulating the target
> process states while taking a snapshot.
Ugh. Frankly it sounds like we're being asked to pin our hopes on
a house of cards -- weird userspace hacks involving extra
processes, hodge-podge combinations of ptrace, LD_PRELOAD, signal
hijacking, brk hacks, scanning passes in /proc (possibly at numerous
times which begs for races), etc.
When all is said and done, my suspicion is all of it will be a mess
that shows races which none of the [added] kernel interfaces can fix.
In contrast, kernel-based cr is rather straight forward when you bother
to read the patches. It doesn't require using combinations of obscure
userspace interfaces to intercept and emulate those very same interfaces.
It doesn't add a scattered set of new ABIs. And any races would be in a
a syscall where they could likely be fixed without adding yet-more ABIs
all over the place.
> > But since you ask :-), there is one thing on our wish list. We
> > handle address space randomization, vdso, vsyscall, and so on quite
> > well. We do not turn off address space randomization (although on
> > restart, we map user segments back to their original addresses).
> > Probably the randomized value of brk (end-of-data or end of heap) is
> > the thing that gave us the most troubles and that's where the code
> > is the most hairy.
>
> Can you please elaborate a bit? What do you want to see changed?
>
> > The implementation is reasonably modularized. In the rush to
> > address bugs or feature requirements of users, we sometimes cut
> > corners. We intend to go back and fix those things. Roughly, the
> > architecture of DMTCP is to do things in two layers: MTCP handles a
> > single multi-threaded process. There is a separate library mtcp.so.
> > The higher layer (redundantly again called DMTCP) is implemented in
> > dmtcphijack.so. In a _very_ rough kind of way, MTCP does a lot of
> > what would be done within kernel C/R. But the higher DMTCP layer
> > takes on some of those responsibilities in places. For example,
> > DMTCP does part of analyzing the pseudo-ttys, since it's not always
> > easy to ensure that it's the controlling terminal of some process
> > that can checkpoint things in the MTCP layer.
> >
> > Beyond that, the wrappers around system calls are essentially
> > perfectly modular. Some system calls go together to support a
> > single kernel feature, and those wrappers are kept in a common file.
>
> I see. I just thought that it would be helpful to have the core part
> - which does per-process checkpointing and restoring and corresponds
> to the features implemented by in-kernel CR - as a separate thing. It
> already sounds like that is mostly the case.
>
> I don't have much idea about the scope of the whole thing, so please
> feel free to hammer senses into me if I go off track. From what I
> read, it seems like once the target process is stopped, dmtcp is able
> to get most information necessary from kernel via /proc and other
> methods but the paper says that it needs to intercept socket related
> calls to gather enough information to recreate them later. I'm
> curious what's missing from the current /proc. You can map socket to
> inode from /proc/*/fd which can be matched to an entry in
> /proc/*/net/PROTO to find out the addresses and most socket options
> should be readable via getsockopt. Am I missing something?
>
> I think this is why userland CR implementation makes much more sense.
One forseeable future is nested containers. How will this house of cards
work if we wish to checkpoint a container that is itself performing a
checkpoint? We've thought about the nested container case and designed
our interfaces so that they won't change for that case.
What happens if any of these new interfaces get used for non-checkpoint
purposes and then we wish to checkpoint those tasks? Will we need any
more interfaces for that? We definitely don't want two wind up with an
ABI that looks like a Russian Doll.
> Most of states visible to a userland process are rather rigidly
> defined by standards and, ultimately, ABI and the kernel exports most
> of those information to userland one way or the other. Given the
> right set of needed features, most of which are probabaly already
> implemented, a userland implementation should have access to most
> information necessary to checkpoint without resorting to too messy
So you agree it will be a mess (Just not "too messy"). I have no
idea what you think "too messy" is, but given all the stuff proposed
so far I'd say you've reached that point already.
> methods and then there inevitably needs to be some workarounds to make
> CR'd processes behave properly w.r.t. other states on the system, so
> userland workarounds are inevitable anyway unless it resorts to
> preemtive separation using namespaces and containers, which I frankly
Huh? I am not sure what you mean by "preemptive separation using
namespaces and containers".
Cheers,
-Matt Helsley
On Thu, Nov 04, 2010 at 03:45:37PM -0500, Nathan Lynch wrote:
> On Thu, 2010-11-04 at 08:36 +0100, Tejun Heo wrote:
> > Hello,
> >
> > On 11/04/2010 02:47 AM, Nathan Lynch wrote:
> > >> In this case whitelisting the allowed
> > >> state by requiring special APIs for all I/O (or even just standard
> > >> APIs as long as they are supposed by the C/R lib you're linked against)
> > >> is the more pragmatic, and I think faithful aproach.
> > >
> > > I don't think users will go for it. They'll continue to use dodgy
> > > out-of-tree kernel modules and/or LD_PRELOAD hacks instead of porting
> > > their applications to a new library. I think a C/R library is an
> > > "ideal" solution, but it's one that nobody would use - especially in
> > > HPC, unless the library somehow provides better performance.
> >
> > I hear that there are plans to integrate one of the userland
> > snapshotting implementations with HPC workload manager. ISTR the
> > combination to be condor + dmtcp but not sure. I think things like
> > that make a lot of sense.
>
> If you look at the C/R implementations of those two projects you'll see
> that they don't implement what I take to be hch's suggestion - a library
> or platform with special-purpose APIs to which applications are ported
> in order to gain C/R ability. For all their good points, the projects
And even if they did, I don't think asking application developers to use
such a broad API -- one that requires special APIs for all I/O -- is
practical for many of the purposes outlined at kernel summit.
I think DMTCP is better off for not attempting to mandate such APIs.
How rare is it for an application or library to change the underlying
APIs it uses? How many applications have been ported say from Gnome to
KDE (or vice-versa) over the lifetime of the project? Relative to all
the other applications? I would hazard a guess that most were rewritten
rather than ported and that those that were ported are an utterly
insignificant fraction of what's out there.
It's much better to offer tools that, as much as possible, don't care
which APIs the applications use.
Cheers,
-Matt Helsley
On Thu, Nov 04, 2010 at 10:43:15AM +0100, Tejun Heo wrote:
<snip>
>
> I'm afraid that's not general or transparent at all. It's extremely
> invasive to how a system is setup and used. It basically is poor
> man's virtualization or rather partitioning without hardware support
> and at this point I find it very difficult to justify the added
> complexity. Let's just make virtualization better.
<snip>
> I'm sorry to be in this position but the trade off just seems way off.
> As I wrote earlier, the transparent part of in-kernel CR basically
> boils down to implementing pseudo virtualization without hardware
> support and given the not-too-glorious history of that and the much
> higher focus on proper virtualization these days, I just don't think
> it makes much sense. It's an extremely niche solution for niche use
If you think specialized hardware acceleration is necessary for
containers then perhaps you have a poor understanding of what a container
is. Chances are if you're running a container with namespaces configured
then you're already paying the performance costs of running in a
container. If you've compared the performance of that kernel to your
virtualization hardware then you already know how they compare.
For containers everything is native. You're not emulating instructions.
You're not running most instructions and trapping some. You're not
running whole other kernels, coordinating sharing of pages and cpu
with those kernels, etc. You're not emulating devices, busses,
interrupts, etc. And you're also not then circumventing every
virtualization mechanism you just added in order to provide decent
performance.
I rather doubt you'll see a difference between "native" hardware and...
native hardware. And I expect you'll see much better performance in one of
your containers than you'll ever see in some hand-waved
hypothetically-improved virtualization that your response implored us to
work on instead.
Our checkpoint/restart patches do *NOT* implement containers. They
sometimes work with containers to make use of checkpoint/restart simple.
In fact they are the strategy we use to enable "generic"
checkpoint/restart that you seem to think we lack. Everything else is
an optimization choice that we give userspace which virtualization
notably lacks.
Like above, I expect that your virtualization hardware will compare
unfavorably to kernel-based checkpoint/restart of containers. Imagine
checkpointing "ls" or "sleep 10" in a VM. Then imagine doing so for a
container. It takes way less time and way less disk for the container.
(It's also going to be easier to manage since you won't have to do
lots of special steps to get at the information in a container which is
shutdown or even one that's running. If "mycontainer" is running then
simply do:
lxc-attach -n mycontainer /bin/bash
Alternately, you can go through all the effort you normally do for
a VM -- set up a serial console, setup getty, setup sshd, etc. I don't
care -- it's more complicated than the above commandline.)
So please stop asserting that a purported lack of hardware support
is significant. Also please remember that we're not implementing containers
in this patch set -- they're already in.
Yes, our patches touch a wide variety of kernel code. You have just failed
to appreciate how "wide" the kernel ABI truly is. You can't really count
it by number of syscalls, number of pseudo-filesystems, etc. There's
also the intended behavior of those interfaces to consider. Each piece
of checkpoint/restart code is relatively self-contained. This can be
confirmed merely by looking at many of the patches we've already posted
enabling checkpoint/restart of that feature. Until you've tried to
implement checkpoint/restart for an interface or until you've bothered
to review a patch for one of them (my favorite on is eventfd:
http://www.mail-archive.com/[email protected]/msg21565.html ) please
don't tell us it's too complex. Then compare that with your proposed
ghastly stack of userspace cards -- ptrace (really more like strace) +
LD_PRELOAD + a daemon...
Incidentally, 20k lines of code is less than many pieces of the kernel.
It's less than many:
Filesystems (I've selected ones designed for rotating media or networks usually..)
ext4, nfs, ocfs2, xfs, reiserfs, ntfs, gfs2, jfs, cifs, ubifs, nilfs2, btrfs
Non-filesystem file-system support code:
nfsd, nls
It's less than one of the simpler DRM graphics drivers -- i915:
$ cd drivers/gpu/drm/i915
$ wc -l *.[ch]
...
41481 total
It's less than any one of the lpfc, bfa, aic7xxx, qla2xxx, and mpt2sas
drivers I see under scsi. Perhaps a more fair comparison might be to compare
a single driver to a single checkpointable kernel interface but it's
a more-fair comparison that skews even more in our favor.
Yes, when you *add it all up* it's more than half the size of the kernel/
directory. Bear in mind that the portions we add to kernel/checkpoint though
are only 4603 lines long -- about the same size as many kernel/*.c files.
The rest is for each kernel interface that adds/manipulates state we need to
be able to checkpoint. Or arch code.. etc.
So please don't base your assessment of our code on your apparently
flawed notion of containers nor on the summary line of a diffstat you saw.
Cheers,
-Matt Helsley
Hello,
On 11/06/2010 12:18 AM, Oren Laadan wrote:
>> I'm probably missing something but can't you stop the application
>> using PTRACE_ATTACH? You wouldn't need to hijack a signal or worry
>> about -EINTR failures (there are some exceptions but nothing really to
>> worry about). Also, unless the manager thread needs to be always
>> online, you can inject manager thread by manipulating the target
>> process states while taking a snapshot.
>
> This is an excellent example to demonstrate several points:
>
> * To freeze the processes, you can use (quote) "hairy" signal
> overload mechanism, or even more hairy ptrace; both by the way have
> their performance problem with many processes/threads. Or you can
> use the in-kernel freezer-cgroup, and forget about workarounds, like
> linux-cr does. And ~200 lines in said diff are dedicated exactly to
> that.
>
> * Then, because both the workaround and the entire philosophy
> of MTCP c/r engine is that affected processes _participate_ in
> the checkpoint, their syscalls _must_ be interrupted. Contrastly,
> linux-cr kernel approach allows not only to checkpoint processes
> without collaboration, but also builds on the native signal
> handling kernel code to restart the system calls (both after
> unfreeze, and after restart), such that the original process
> does not observe -EINTR.
The above problems can be solved for userland C/R with small
self-contained modification to a small part of the kernel. You're
insisting that because currently some obscure corner cases aren't
handled, the whole thing should be shoved in the kernel and the kernel
should be serializing and deserializing its internal data structures
for everything visible in the userland. That's silly at best. Note
the "visible in the userland" part. Most of those parts are already
discoverable without further modifications to kernel. The only sane
approach would be add missing pieces which would not only benefit CR
but other applications too.
Also, you said the patches didn't have to change much because the data
structures facing userland didn't change much over different kernel
versions, which of course is true as it's so close to the userland
visible ABI. That is _NOT_ a selling point for kernel CR. That's a
BIG GLOWING SIGN telling you that you're on the frigging wrong side of
the wall.
> BTW, a real security expert (and I'm not one...) may argue that
> this operation should only be allowed to privileged users. In fact,
> if your code gets around the linux ASLR mechanisms, then someone
> should fix the kernel ASLR code :)
ASLR is to protect a program from itself not from outside. If you can
ptrace a process, ASLR doesn't mean a thing.
>> I see. I just thought that it would be helpful to have the core part
>> - which does per-process checkpointing and restoring and corresponds
>> to the features implemented by in-kernel CR - as a separate thing. It
>> already sounds like that is mostly the case.
>
> FWIW, the restart portion of linux-cr is designed with this in
> mind - it is flexible enough to accommodate for smart userspace
> tools and wrappers that wish to mock with the processes and
> their resource post-restart (but before the processes resume
> execution). For example, a distributed checkpoint tool could,
> at restart time, reestablish the necessary network connections
> (which is much different than live migration of connections,
> and clearly not a kernel task). This way, it is trivial to migrate
> a distributed application from one set of hosts to another, on
> different networks, with very little effort.
Yeap, that was the reason why I asked how modularized that part of
dmtcp was as it would directly compare with the in-kernel
implementation. If they can be well separated, I think it would even
be possible to switch between the two while keeping the upper set of
workarounds the same.
>> I don't have much idea about the scope of the whole thing, so please
>> feel free to hammer senses into me if I go off track. From what I
>> read, it seems like once the target process is stopped, dmtcp is able
>> to get most information necessary from kernel via /proc and other
>> methods but the paper says that it needs to intercept socket related
>> calls to gather enough information to recreate them later. I'm
>> curious what's missing from the current /proc. You can map socket to
>> inode from /proc/*/fd which can be matched to an entry in
>> /proc/*/net/PROTO to find out the addresses and most socket options
>> should be readable via getsockopt. Am I missing something?
>
> So you'll need mechanisms not only to read the data at checkpoint
> time but also to reinstate the data at restart time. By the time
> you are done, the kernel all the c/r code (the suspect diff in
> question _and_ the rest of the logic) in the form of new interfaces
> and ABIs to usersapce...; the userspace code will grow some more
> hair; and there will be zero maintainability gain. And at the same
> you won't be able to leverage optimizations only possible in the
> kernel.
Unfortunately, for most things which matter, everything is already in
place and if you just concentrate on the core part the hackiness seems
quite manageable and I think it wouldn't be too difficult to reduce it
further. I don't see why userland implementation wouldn't be able to
snapshot any random process without LD_PRELOADs or whatever
cooperation from it. And, if the COW thing is so important, we can
collect the information and export it to userland via proc or
ringbuffer. That's what qemu-kvm would need anyway, right? I don't
think kvm guys would be so crazy as putting the whole snapshotter into
the kernel.
> To be precise, there are three types of userland workarounds:
>
> 1) userland workarounds to make a restarted application work when
> peer processrs aren't saved - e.g, in distributed checkpoint you
> need a workaround to rebuild the socket to the peer; or in his
> example with the 'ncsd' daemon from earlier in the thread.
>
> These are needed regardless of the c/r engine of choice. In many
> cases they can be avoided if applications are run in containers.
> (which can be as simple as running a program using 'nohup')
>
> 2) userland workarounds to duplicate virtualization logic already
> done by the kernel - like the userspace pid-namespace and the
> complex logic and hacks needed to make it work. This is completely
> unnecessary when you do kernel c/r.
No, that's primarily not the feature of kerne CR. It's of namespaces
and containers.
> 3) userland workarounds to compensate for the fact that userspace
> can't get or set some state during checkpoint or restart. For
> example, in the kernel it's trivial to track shared files. How
> would you say, from userspace, if fd[0] of parent A and child B is
> the same file opened and then inherited, or the same filename
> opened twice individually? For files, it is possible to figure
> this out in user space, e.g. by intercepting and tracking all forks
> and all file operations (including passing fd's via afunix sockets).
Or, if it's a regular file, lseek() and see whether the offsets change
together, or, even better, just toggle O_NDELAY with fcntl.
> There are other hairy ways to do it, but not quite so for other
> resources.
If you think toggling O_NDELAY is hairy, let's add a noop flag bit or
export whatever via /proc/*/fdinfo. We already have all that stuff
for a reason.
> As another example, consider SIDs and PGIDs. With proper algorithms
> you can ensure that your processes get the right SID at fork time.
> But in the general case, you can't reproduce PGIDs accurately
> without replaying what the processes (including those that had died
> already) behaved.
>
> And to track zombies at checkpoint, you'd need to actually collect
> them, so you must do it in a hairy wrapper, and keep the secret
> until the application calls wait(). But then, there may be some
> side effects due to collecting zombies, e.g. the pid may be reused
> against the application's expectation.
>
> Some of these have workarounds, some not. Do you really think that
> re-reimplementing linux and namespaces in userspace is the way to go ?
No, I think you're blowing corner cases, which are in Syberia cold
paths, way out of proportion. None of the above justifies putting the
whole thing in the kernel. Solve each problem with local solutions.
You're basically doing the same thing with in-kernel implementation,
the only difference being you side stepping ABI issues by saying that
kernel CR format would stay _mostly_ stable and what changes would be
dealt with from userland tools. Everything visible from usual
userland applications should be (and is for the most part) defined by
ABI. And if every state worthy of saving is well defined and visible
from userland, there's no reason to do it from kernel.
> Then, you can add to the kernel endless amount of interfaces to
> export all of this - both data, and the functionality to re-instate
> this data at checkpoint. But ... wait -- isn't that what linux-cr
> already does ?
I hope that's what linux-cr did. It unfortunately serializes and
de-serializes in-kernel data structures which are already mostly
visible from userland instead of hunting down and improving missing
pieces.
>> preemtive separation using namespaces and containers, which I frankly
>> think isn't much of value already and more so going forward.
>
> That is one opinion. Then there are people using VPSs in commercial
> and private environments, for example.
>
> VMs are wonderful (re)invention. Regardless of any one single
> person's about VMs vs containers, both are here to stay, and both
> have their use-cases and users. IMHO, it is wrong to ignore the
> need for c/r and migration capabilities for containers, whether
> they run full desktop environments, multiple applications or single
> processes.
Sure, I'm not ignoring them. I'm just saying in-kernel CR doesn't
make a good trade off with its limited benefits and extensive
complexity all across the kernel, and the reason why its benefits are
limited is because it's sandwiched pretty tightly between userland CR
and proper virtualization. Moreover, the space in-kernel CR tries
occupy is getting smaller day by day. It just can't justify its
complexity.
Thanks.
--
tejun
Hello,
On 11/06/2010 11:12 AM, Matt Helsley wrote:
> If you think specialized hardware acceleration is necessary for
> containers then perhaps you have a poor understanding of what a container
> is. Chances are if you're running a container with namespaces configured
> then you're already paying the performance costs of running in a
> container. If you've compared the performance of that kernel to your
> virtualization hardware then you already know how they compare.
I was talking about virtualization when referring to hardware support.
> So please stop asserting that a purported lack of hardware support
> is significant. Also please remember that we're not implementing containers
> in this patch set -- they're already in.
Sure, that was my point. So, let's drop the handwaving about being
transparent.
> Incidentally, 20k lines of code is less than many pieces of the kernel.
> It's less than many:
>
> Filesystems (I've selected ones designed for rotating media or networks usually..)
> ext4, nfs, ocfs2, xfs, reiserfs, ntfs, gfs2, jfs, cifs, ubifs, nilfs2, btrfs
>
> Non-filesystem file-system support code:
> nfsd, nls
>
> It's less than one of the simpler DRM graphics drivers -- i915:
> $ cd drivers/gpu/drm/i915
> $ wc -l *.[ch]
> ...
> 41481 total
>
> It's less than any one of the lpfc, bfa, aic7xxx, qla2xxx, and mpt2sas
> drivers I see under scsi. Perhaps a more fair comparison might be to compare
> a single driver to a single checkpointable kernel interface but it's
> a more-fair comparison that skews even more in our favor.
Yeah, and imagine what people would say if ext4, or heaven forbid,
aic7xxx code was scattered all over the kernel.
> Yes, when you *add it all up* it's more than half the size of the kernel/
> directory. Bear in mind that the portions we add to kernel/checkpoint though
> are only 4603 lines long -- about the same size as many kernel/*.c files.
> The rest is for each kernel interface that adds/manipulates state we need to
> be able to checkpoint. Or arch code.. etc.
>
> So please don't base your assessment of our code on your apparently
> flawed notion of containers nor on the summary line of a diffstat
> you saw.
I don't believe my notion of containers was or is flawed and already
said that the diffstat per-se didn't look too bad. With enough
benefits, I wouldn't be opposed against the rather invasive changes.
It's just that the whole thing is conceived backwards and there are
already working alternatives which may be somewhat messy now but
nevertheless achieve about the same effect without the craziness of
serializing in-kernel data structures which are already mostly visible
to userland to begin with.
Thanks.
--
tejun
On 11/06/2010 01:32 AM, Matt Helsley wrote:
> On Fri, Nov 05, 2010 at 10:28:09AM +0100, Tejun Heo wrote:
>> Hello,
>>
>> On 11/04/2010 05:44 PM, Gene Cooperman wrote:
>>>>> In our personal view, a key difference between in-kernel and userland
>>>>> approaches is the issue of security.
>>>>
>>>> That's an interesting point but I don't think it's a dealbreaker.
>>>> ... but it's not like CR is gonna be deployed on
>>>> majority of desktops and servers (if so, let's talk about it then).
>>>
>>> This is a good point to clarify some issues. C/R has several good
>>> targets. For example, BLCR has targeted HPC batch facilities, and
>>> does it well.
>>>
>>> DMTCP started life on the desktop, and it's still a primary focus of
>>> DMTCP. We worked to support screen on this release precisely so
>>> that advanced desktop users have the option of putting their whole
>>> screen session under checkpoint control. It complements the core
>>> goal of screen: If you walk away from a terminal, you can get back
>>> the session elsewhere. If your session crashes, you can get back
>>> the session elsewhere (depending on where you save the checkpoint
>>> files, of course :-) ).
>>
>> Call me skeptical but I still don't see, yet, it being a mainstream
>> thing (for average sysadmin John and proverbial aunt Tilly). It
>> definitely is useful for many different use cases tho. Hey, but let's
>> see.
>
> Rightly so. It hasn't been widely proven as something that distros
> would be willing to integrate into a normal desktop session. We've got
> some demos of it working with VNC, twm, and vim. Oren has his own VNC,
> twm, etc demos too. We haven't looked very closely at more advanced
> desktop sessions like (in no particular order) KDE or Gnome. Nor have
> we yet looked at working with any portions of X that were meant to provide
> this but were never popular enough to do so (XSMP iirc).
Actually, I do have a demo of Zap (linux-cr predecessor) with a _full_
gnome desktop running under VNC with:
* a movie player,
* firefox,
* thunderbird,
* openoffice,
* kernel make,
* gdb debugging something,
* WINE with microsoft office (oops)
all of these checkpointed with < 25ms of downtime and resumed an
arbitrary time later, successfully.
I even have witnesses that saw it ;)
>
> Does DMTCP handle KDE/Gnome sessions? X too?
>
> On the kernel side of things for the desktop, right now we think our
> biggest obstacle is inotify. I've been working on kernel patches for
> kernel-cr to do that and it seems fairly do-able. Does DMTCP handle
> restarting inotify watches without dropping events that were present
> during checkpoint?
>
At the very least userspace would need to interpose on all
inotify related syscalls to track (log) what the user did to
be able to redo it at restart. (And I'm sure there will be
crazy to impossible races and corner cases there).
Does it make sense to replicate in userspace everything already done
in the kernel ?
> The other problem for kernel c/r of X is likely to be DRM. Since the
> different graphics chipsets vary so widely there's nothing we can do
> to migrate DRM state of an NVIDIA chipset to DRM state of an ATI chipset
> as far as I know. Perhaps if that would help hybrid graphics systems
> then it's something that could be common between DRM and
> checkpoint/restart but it's very much pie-in-the-sky at the moment.
DRM is hardware, and is complex for both userspace and kernel. Let's
assume it isn't support until it's properly virtualized.
(In the long-long run, I'd envision hardware manufacturers providing
c/r support within their drivers - e.g. a checkpoint() and restart()
kernel methods. But that's only if they care about it, and in any
event, pretty far down the road...)
> kernel c/r of input devices might be alot easier. We just simulate
> hot [un]plug of the devices and rely on X responding. We can even
> checkpoint the events X would have missed and deliver them prior to hot
> unplug.
>
[snip]
Oren.
By the way, Oren, Kapil and I are hoping to find time in the next few
days to talk offline. Apparently the Linux C/R and DMTCP had continued
for some years unaware of each other. We appreciate that a huge amount
of work has gone into both of the approaches, and so we'd like to reap
the benefit of the experiences of the two approaches. We're still learning
more about each others' approaches. Below, I'll try to answer as best
I can the questions that Matt brings up. Since Matt brings up _lots_
of questions, and I add my own topics, I thought it best to add a table
of contents to this e-mail. For each topic, you'll see a discussion
inline below.
1. Distros, checkpointing a desktop, KDE/Gnome, X
[ Trying to answer Matt's question ]
2. Directly checkpointing a single X11 app
[ Our own preferred approach, as opposed to checkpinting an entire desktop;
This is easy, but we just haven't had the time lately. I estimate
the time to do it is about one person working straight out for two weeks
or so. But who has that much spare time. :-) ]
3. OpenGL
[ Checpointing OpenGL would be a really big win. We don't know the
right way, but we're looking. Do you have some thoughts on that? Thanks.]
4. inotify and NSCD
[ We try to virtualize a single app, instead of also checkpointing
inotify and NSCD themselves. It would have been interesting to consider
checkpointing them in userland, but that would require root privilege,
and one core design principle we have, is that all of our C/R is
completely unprivileged. So, we would see distributing DMTCP as
a package in a distro, and letting individual users decide for
what computation they might want to use it. ]
5. Checkpointing DRM state and other graphics chip state
[ It comes down to virtualization around a single app versus checkpointing
_all_ of X. --- Two different approaches. ]
6. kernel c/r of input devices might be alot easier
[ We agree with you. By virtualizing around a single app, we hope
to avoid this issue. ]
7. C/R for link/open/rm/open/write/read puzzle
8. What happens if the DMTCP coordinator ( checkpoint control process) dies?
[ The same thing that happens if a user process dies. We kill the whole
computation, and restart. At restart, we use a new coordinator.
Coordinators are stateless. ]
9. We try to hide the reserved signal (SIGUSR2 by default) ...
[ Matt says this is a mess, but we note that glibc does this too. ]
10. checkpoint, gdb and PTRACE_ATTACH
[ DMTCP does not use PTRACE_ATTACH in its implementation. So, we can
and do fully support user processes that use PTRACE_ATTACH. ]
11. DMTCP, ABIs, can there be a race condition between the ckpt thread and
user threads of an app?
[ DMTCP doesn't introduce any new ABIs. There may be a misconception here.
If we can talk at length off-line, I could explain more about
the DMTCP design. Inline, I explain why race conditions should
not be an issue. ]
12. nested containers, ABIs, etc.
[ see inline comment ]
13. a userland implementation should have access to most
information necessary to checkpoint without resorting to too messy
[ In fact, the primary ABIs that we use outside of system calls
are /proc/*/maps and /proc/*/fd. Even here, we would have workarounds
if someone took those ABIs away. ]
The full range of comments is inline below. Sorry that this e-mail
is getting so long. There are many things to talk about. I hope to
later take advantage of the higher bandwidth with Oren (by phone)
to thrash out some of these things together.
Thanks,
- Gene
On Fri, Nov 05, 2010 at 10:32:04PM -0700, Matt Helsley wrote:
> On Fri, Nov 05, 2010 at 10:28:09AM +0100, Tejun Heo wrote:
> > Hello,
> >
> > On 11/04/2010 05:44 PM, Gene Cooperman wrote:
> > >>> In our personal view, a key difference between in-kernel and userland
> > >>> approaches is the issue of security.
> > >>
> > >> That's an interesting point but I don't think it's a dealbreaker.
> > >> ... but it's not like CR is gonna be deployed on
> > >> majority of desktops and servers (if so, let's talk about it then).
> > >
> > > This is a good point to clarify some issues. C/R has several good
> > > targets. For example, BLCR has targeted HPC batch facilities, and
> > > does it well.
> > >
> > > DMTCP started life on the desktop, and it's still a primary focus of
> > > DMTCP. We worked to support screen on this release precisely so
> > > that advanced desktop users have the option of putting their whole
> > > screen session under checkpoint control. It complements the core
> > > goal of screen: If you walk away from a terminal, you can get back
> > > the session elsewhere. If your session crashes, you can get back
> > > the session elsewhere (depending on where you save the checkpoint
> > > files, of course :-) ).
> >
> > Call me skeptical but I still don't see, yet, it being a mainstream
> > thing (for average sysadmin John and proverbial aunt Tilly). It
> > definitely is useful for many different use cases tho. Hey, but let's
> > see.
>
> Rightly so. It hasn't been widely proven as something that distros
> would be willing to integrate into a normal desktop session. We've got
> some demos of it working with VNC, twm, and vim. Oren has his own VNC,
> twm, etc demos too. We haven't looked very closely at more advanced
> desktop sessions like (in no particular order) KDE or Gnome. Nor have
> we yet looked at working with any portions of X that were meant to provide
> this but were never popular enough to do so (XSMP iirc).
>
> Does DMTCP handle KDE/Gnome sessions? X too?
1. Distros, checkpointing a desktop, KDE/Gnome, X
DMTCP does checkpoint VNC sessions with a desktop, KDE/Gnome, and X.
We were doing that in some joint work with SCIRun:
http://www.sci.utah.edu/cibc/software/106-scirun.html
SCIRun only works under X, and so it was an absolute prerequisite.
SCIRun optionally also likes to use OpenGL (3-D graphics). We had hacked
up something for OpenGL 1.5, and I write more on that, below.
However, we agree with you that a distro would probably not want to run
C/R under their regular X session. If anything minor fails, it hurts their
reputation, which is everything for them. So, think that's a non-starter.
The other possibility is to use C/R on a VNC session for an X desktop.
We also think that most users would not care for the extra complication
of having two desktops (one under checkpoint control, and the main one).
One can run an individual X11 application under VNC and checkpoint
the VNC. We can and _do_ do that. But it's still unsatisfying for us.
The heaviness and added complexity of checkpointing a VNC server makes
us nervous.
2. Directly checkpointing a single X11 app
So, as I said in a different post, we're planning to virtualize directly
around libX11.so and libxcb.so. Then we'll checkpoint the X11 graphic
application and _only_ the X11 graphic application.
We think that a really cool advantage of this approach is that
if you checkpoint the X11 app under Gnome, then you can bring it back
to life under KDE, and it will now have the look-and-feel of KDE.
Another advantage of this approach is that there's a single desktop
shared by all applications. If the X11 application wishes to use
dbus, a window manager, or whatever, to communicate with other X11 apps,
it can continue to do so. Our virtualization approach should work
well when interaction goes through a small enough library around
which we can place wrappers. The library can be libc.so, libX11.so,
or any of many other libraries.
This also seems more modular to us. A VNC server has to worry about
_lots_ of things, and we only need the connect/disconnect portion of
the VNC server. It's not hard to implement that directly in a small
library. Also, if we checkpoint fewer processes, the time to write to
disk is smaller.
3. OpenGL
We had hacked up something for OpenGL 1.5 with the intention of supporting
SCIRun. It was based on the work of:
http://andres.lagarcavilla.com/publications/LagarCavillaVEE07.pdf
http://andres.lagarcavilla.com/vmgl/index.html
The problem was that OpenGL is growing and adding system calls
faster than one can virtualize them. :-) We didn't want to always
be chasing around to support the newest addition to OpenGL.
Have you also looked at checkpointing OpenGL? It's an interesting
question. Unfortunately, I doubt that the vendors will support C/R
in their video drivers, and so we're forced to look for a different
solution (or give up, and we don't like giving up :-) ).
> On the kernel side of things for the desktop, right now we think our
> biggest obstacle is inotify. I've been working on kernel patches for
> kernel-cr to do that and it seems fairly do-able. Does DMTCP handle
> restarting inotify watches without dropping events that were present
> during checkpoint?
4. inotify and NSCD
We have run into inotify. We don't try to checkpoint inotify itself.
Instead, as with X11 apps, our larger interest is in checkpointing
a single computation that might have been interacting with inotify,
and then be able to restart the single app and resume talking
with inotify. The situation is similar to that with NSCD (Network
Services Caching Daemon).
If you wish to checkpoint a single application, and if it was talking
to the Network Services Caching Daemon, how do you handle that?
Is it that you always checkpoint both the app and the NSCD at the
same time?
If so, perhaps this is a key difference in the two approaches:
virtualize around a single app; or checkpoint _every_ process that
is interacting with the process of interest. But I'm just speculating,
and I need to talk more with you all to understand better.
> The other problem for kernel c/r of X is likely to be DRM. Since the
> different graphics chipsets vary so widely there's nothing we can do
> to migrate DRM state of an NVIDIA chipset to DRM state of an ATI chipset
> as far as I know. Perhaps if that would help hybrid graphics systems
> then it's something that could be common between DRM and
> checkpoint/restart but it's very much pie-in-the-sky at the moment.
5. Checkpointing DRM state and other graphics chip state
Again, this may come down to virtualization around a single
application versus checkpointing everything. We would try to avoid
the necessity of checkpointing graphics drivers, DRM issues, etc.,
through virtualization. As I wrote above, though, we don't yet have
a good virtualization solution when it comes to OpenGL. So, we're very
interested in any thoughts you have about handling OpenGL.
> kernel c/r of input devices might be alot easier. We just simulate
> hot [un]plug of the devices and rely on X responding. We can even
> checkpoint the events X would have missed and deliver them prior to hot
> unplug.
6. kernel c/r of input devices might be alot easier
I think I would agree. As indicated above, our philosphy is to virtualize
the single app, instead of "checkpointing the world", as one of
our team, Jason Ansel, used to like to say. :-) But this is not
to say that checkpointing the entire X with input devices isn't also
interesting. The two works are complementary.
> Also, how does DMTCP handle unlinked files? They are important because
> lots of process open a file in /tmp and then unlink it. And that's not
> even the most difficult case to deal with. How does DMTCP handle:
>
> link a to b
> open a (stays open)
> rm a
> <checkpoint and restart>
> open b
> write to b
> read from a (the write must appear)
>
> ?
7. C/R for link/open/rm/open/write/read puzzle
We did have some similar issues ing like this in some of the apps we
looked at. For example, if my memory is right, in an app that works with
the NSCD daemon, it mmaps a shared file, and then unlinks the file so that
the file will be deleted when the app exits. Just to make sure that
everything is precise, would you mind writing a short app like that
and sending it to us? For example, I'm guessing the link is a symbolic
link, but the actual code will make it all precise. We'll directly perform
the experiment you propose and tell you the result.
I think the short story will be that we have a command-line option
by which the user specifies if they would like to checkpoint open
files. We also have heuristics to try to do the right thing when the
user didn't give us specific instructions on the command line.
The short answer is that we're driven by the use cases we encounter,
and we think of application coverage. You may be right that we don't
currently cover this, but I would like to try it first, and verify.
If you have an important use case for this scenario, we will definitely
add coverage for it.
Maybe this is another difference in philosophy. Oren talked about
full transparency --- meaning that the kernel will always present the
illusion of continuity to an app. Because we know the design of DMTCP,
we know of ways that a userland app could create weird cases where
the wrong things happen. When we discover an app that needs the weird
case, we expand our coverage through additional virtualization.
> > > These are also some excellent points for discussion! The manager thread
> > > is visible. For example, if you run a gdb session under checkpoint
> > > control (only available in our unstable branch, currently), then
> > > the gdb session will indeed see the checkpoint manager thread.
> >
> > I don't think gdb seeing it is a big deal as long as it's hidden from
> > the application itself.
>
> Is the checkpoint control process hidden from the application? What
> happens if it gets killed or dies in the middle of checkpoint? Can
> a malicious task being checkpointed (perhaps for later analysis)
> kill it? Or perhaps it runs as root or a user with special capabilities?
8. What happens if the DMTCP coordinator ( checkpoint control process) dies
If the checkpoint control process dies, then the checkpoint manager thread
in the user app never hears from the coordinator again. The application
continues anyway without failing. But, it's no longer possible to
checkpoint that application. Again, I think it's a difference in
philosophy. We want to checkpoint a single app or computation.
If that computation loses _any_ of its processes (whether it's the
DMTCP coordinator process or one of the application processes itself),
then it's best to kill the compuation and restart from the last
checkpoint image. Our DMTCP coordinator is stateless, and so it's
no problem to create a new DMTCP coordinator at the time of restart.
> > > We try to hid the reserved signal (SIGUSR2 by default, but the user
> Mess.
9. We try to hide the reserved signal (SIGUSR2 by default
Beauty is in the eye of the beholder. :-) I remind you that libc
reserves SIGRTMIN and SIGRTMIN + 1 for thread cancellation and
for setxid, respectively. If reserving a signal is bad, then
libc.so is also a "Mess". In the glibc source, look at:
./nptl/pthreadP.h: #define SIGCANCEL __SIGRTMIN
./nptl/pthreadP.h: #define SIGSETXID (__SIGRTMIN + 1)
Probably glibc is even worse than us. They use the signal, and they
_don't_ hide it from the user. Userland is a messy place. :-)
> > > can configure it to anything else). We put wrappers around system
> > > calls that might see our signal handler, but I'm sure there are
> > > cases where we might not succeed --- and so a skilled user would
> > > have to configure to use a different signal handler. And of course,
> > > there is the rare application that repeatedly resets _every_ signal.
> > > We encountered this in an earlier version of Maple, and the Maple
> > > developers worked with us to open up a hole so that we could
> > > checkpoint Maple in future versions.
> > >
> > >> [while] all programs should be ready to handle -EINTR failure from system
> > >> calls, it's something which is very difficult to verify and test and
> > >> could lead to once-in-a-blue-moon head scratchy kind of failures.
> > >
> > > Exactly right! Excellent point. Perhaps this gets down to
> > > philosophy, and what is the nature of a bug. :-) In some cases, we
> > > have encountered this issue. Our solution was either to refuse to
> > > checkpoint within certain system calls, or to check the return value
> > > and if there was an -EINTR, then we would re-execute the system
> > > call. This works again, because we are using wrappers around many
> > > (but not all) of the system calls.
> >
> > I'm probably missing something but can't you stop the application
> > using PTRACE_ATTACH? You wouldn't need to hijack a signal or worry
>
> Wouldn't checkpoint and gdb interfere then since the kernel only allows
> one task to attach? So if DMTCP is checkpointing something and uses this
> solution then you can't debug it. If a user is debugging their process then
> DMTCP can't checkpoint it.
10. checkpoint, gdb and PTRACE_ATTACH
As a design decision, DMTCP never traces a process. We did this so we
could easily checkpoint a gdb session without worrying about gdb and
DMTCP both trying to trace the gdb target process.
> > about -EINTR failures (there are some exceptions but nothing really to
> > worry about). Also, unless the manager thread needs to be always
> > online, you can inject manager thread by manipulating the target
> > process states while taking a snapshot.
>
> Ugh. Frankly it sounds like we're being asked to pin our hopes on
> a house of cards -- weird userspace hacks involving extra
> processes, hodge-podge combinations of ptrace, LD_PRELOAD, signal
> hijacking, brk hacks, scanning passes in /proc (possibly at numerous
> times which begs for races), etc.
>
> When all is said and done, my suspicion is all of it will be a mess
> that shows races which none of the [added] kernel interfaces can fix.
>
> In contrast, kernel-based cr is rather straight forward when you bother
> to read the patches. It doesn't require using combinations of obscure
> userspace interfaces to intercept and emulate those very same interfaces.
> It doesn't add a scattered set of new ABIs. And any races would be in a
> a syscall where they could likely be fixed without adding yet-more ABIs
> all over the place.
11. DMTCP, ABIs, can there be a race condition between the ckpt thread and
user threads of an app?
DMTCP does not add any new ABIs. But maybe I misunderstood your point.
The only potential races I can see are between the checkpoint thread
and the user threads. But the checkpoint thread does nothing except
listen for a command from the coordinator. When the command comes,
it first quiesces the user threads, before doing anything.
All of those wrappers for virtualization that we refer to are executed
by the ordinary _user_ threads. The checkpoint thread is in a select
system call during that entire time.
> > > But since you ask :-), there is one thing on our wish list. We
> > > handle address space randomization, vdso, vsyscall, and so on quite
> > > well. We do not turn off address space randomization (although on
> > > restart, we map user segments back to their original addresses).
> > > Probably the randomized value of brk (end-of-data or end of heap) is
> > > the thing that gave us the most troubles and that's where the code
> > > is the most hairy.
> >
> > Can you please elaborate a bit? What do you want to see changed?
> >
> > > The implementation is reasonably modularized. In the rush to
> > > address bugs or feature requirements of users, we sometimes cut
> > > corners. We intend to go back and fix those things. Roughly, the
> > > architecture of DMTCP is to do things in two layers: MTCP handles a
> > > single multi-threaded process. There is a separate library mtcp.so.
> > > The higher layer (redundantly again called DMTCP) is implemented in
> > > dmtcphijack.so. In a _very_ rough kind of way, MTCP does a lot of
> > > what would be done within kernel C/R. But the higher DMTCP layer
> > > takes on some of those responsibilities in places. For example,
> > > DMTCP does part of analyzing the pseudo-ttys, since it's not always
> > > easy to ensure that it's the controlling terminal of some process
> > > that can checkpoint things in the MTCP layer.
> > >
> > > Beyond that, the wrappers around system calls are essentially
> > > perfectly modular. Some system calls go together to support a
> > > single kernel feature, and those wrappers are kept in a common file.
> >
> > I see. I just thought that it would be helpful to have the core part
> > - which does per-process checkpointing and restoring and corresponds
> > to the features implemented by in-kernel CR - as a separate thing. It
> > already sounds like that is mostly the case.
> >
> > I don't have much idea about the scope of the whole thing, so please
> > feel free to hammer senses into me if I go off track. From what I
> > read, it seems like once the target process is stopped, dmtcp is able
> > to get most information necessary from kernel via /proc and other
> > methods but the paper says that it needs to intercept socket related
> > calls to gather enough information to recreate them later. I'm
> > curious what's missing from the current /proc. You can map socket to
> > inode from /proc/*/fd which can be matched to an entry in
> > /proc/*/net/PROTO to find out the addresses and most socket options
> > should be readable via getsockopt. Am I missing something?
> >
> > I think this is why userland CR implementation makes much more sense.
>
> One forseeable future is nested containers. How will this house of cards
> work if we wish to checkpoint a container that is itself performing a
> checkpoint? We've thought about the nested container case and designed
> our interfaces so that they won't change for that case.
>
> What happens if any of these new interfaces get used for non-checkpoint
> purposes and then we wish to checkpoint those tasks? Will we need any
> more interfaces for that? We definitely don't want two wind up with an
> ABI that looks like a Russian Doll.
12. nested containers, ABIs, etc.
I think we would need to elaborate with individual cases. But as I wrote
above, DMTCP and Linux C/R started with two different philosophies.
I'm not sure if you fully understood the DMTCP goals and philosophy yet,
but I hope my comments above help clarify it.
> > Most of states visible to a userland process are rather rigidly
> > defined by standards and, ultimately, ABI and the kernel exports most
> > of those information to userland one way or the other. Given the
> > right set of needed features, most of which are probabaly already
> > implemented, a userland implementation should have access to most
> > information necessary to checkpoint without resorting to too messy
>
> So you agree it will be a mess (Just not "too messy"). I have no
> idea what you think "too messy" is, but given all the stuff proposed
> so far I'd say you've reached that point already.
13. a userland implementation should have access to most
information necessary to checkpoint without resorting to too messy
If it helps, DMTCP began with Linux 2.6.3, and we continue to support
Linux 2.6.9. In fact, DMTCP seems to uncover a bug in Linux 2.6.9
and maybe in Linux 2.6.18, or perhaps in the NFS implementation on top
of it. We've experience some reproducible O/S instability when doing C/R
in certain of those environments. :-) But we mostly use newer kernels
now, where the reliability is truly excellent.
Anyway, I suspect most of these ABIs and kernel exports that you mention
did not exist in Linux 2.6.9. We don't depend on them. The ABIs
that we use outside of system calls are: /proc/*/maps /proc/*/fd
If those ABIs were taken away, we have other ways to virtualize
and get the information that we need.
> > methods and then there inevitably needs to be some workarounds to make
> > CR'd processes behave properly w.r.t. other states on the system, so
> > userland workarounds are inevitable anyway unless it resorts to
> > preemtive separation using namespaces and containers, which I frankly
>
> Huh? I am not sure what you mean by "preemptive separation using
> namespaces and containers".
>
> Cheers,
> -Matt Helsley
On 11/05/2010 01:17 PM, Gene Cooperman wrote:
> On Fri, Nov 05, 2010 at 04:57:33AM -0700, Luck, Tony wrote:
>>> Oren noted that sometimes it's important to stop the process only
>>> for a few milliseconds while one checkpoints. In DMTCP, we do that
>>> by configuring with --enable-forked-checkpointing. This causes us
>>> to fork a child process taking advantage of copy-on-write and then
>>> checkpoint the memory pages of the child while the parent continues
>>> to execute.
>>
>> Interesting ... but while the process is only stopped for the duration
>> of the fork, it may be taking COW faults on almost every page it
>> touches. I think this will not work well for large HPC applications
>> that allocate most of physical memory as anonymous pages for the
>> application. It may even result in an OOM kill if you don't complete
>> the checkpoint of the child and have it exit in a timely manner.
>>
>> -Tony
>>
>
> I agree with you that forked checkpointing is probably not what you
> want in the middle of an HPC computation. But isn't that part of
> the nature of COW? Whether the COW is invoked within the kernel,
> or from outside the kernel via fork --- in either case, when you have
> mostly dirty pages, you will have to copy most of the pages.
> Do I understand your point correctly? Thanks,
> - Gene
COW is one way of reducing down time (whether through fork or
in-kernel checkpoint). However, it is possible to avoid using
it (and thus avoid extra page faults and memory overload) by
using the page-table "dirty" bit to track dirty pages. This way
one can "pre-copy" the checkpoint image while the application is
running, without additional overhead (the idea is similar to how
live-migration is done).
Oren.
On 11/04/2010 11:55 PM, Kapil Arya wrote:
> (Sorry for the length of this email, we are excited about being able
> to discuss technical details.)
>
> This is wonderful to have this exchange of techniques and visions. Oren, we
> are guessing that you are at Columbia. If so, we would love to have you come up
> here and give a talk in Boston. Alternatively, if you prefer, we would be happy
> to go to Columbia and give a talk there.
With pleasure.
(LPC would have been a good opportunity - I was in Boston).
Oren.
On 11/06/2010 04:40 PM, Gene Cooperman wrote:
> By the way, Oren, Kapil and I are hoping to find time in the next few
> days to talk offline. Apparently the Linux C/R and DMTCP had continued
That was my understanding too. However, I also felt that I'd better
clarify a key point first.
> for some years unaware of each other. We appreciate that a huge amount
> of work has gone into both of the approaches, and so we'd like to reap
> the benefit of the experiences of the two approaches. We're still learning
> more about each others' approaches. Below, I'll try to answer as best
> I can the questions that Matt brings up. Since Matt brings up _lots_
> of questions, and I add my own topics, I thought it best to add a table
> of contents to this e-mail. For each topic, you'll see a discussion
> inline below.
[snip]
> 2. Directly checkpointing a single X11 app
> [ Our own preferred approach, as opposed to checkpinting an entire desktop;
> This is easy, but we just haven't had the time lately. I estimate
> the time to do it is about one person working straight out for two weeks
> or so. But who has that much spare time. :-) ]
Hmmm... that sounds pretty fast .. given that you will need to
save and reconstruct an arbitrary state kept by the X server...
More importantly, this line of thought was brought up in this
thread multiple times, yet in a very misleading way.
The question is _not_ whether one can do c/r of a single apps
without their surrounding environment. The answer for that is
simple: it _is_ possible either using proper (and more likely
per-app) wrappers, or by adapting the apps to tolerate that.
The above is entirely orthogonal to whether the c/r is in kernel
or in userspace.
So for terminal based apps, one can use 'screen'. For individual X
apps, one can use a light VNC server with proper embedding in the
desktop (e.g. metavnc). Or you could use screen-for-X like 'xpra'.
Or you can write wrappers (messy or hairy or not) that will try to
do that, or you could modify the apps. IIUC, dmtcp chose the way
of the wrappers.
But that is independent of where you do c/r ! The issue on the
table is whether the _core_ c/r should go in kernel or userspace.
Those wrappers of dmtcp are great and will be useful with either
approach.
So let us please _not_ argue that only one approach can c/r apps
or processes out of their context. That is inaccurate and misleading.
And while one may argue that one use-case is more important than
another, let us also _not_ dismiss such use cases (as was argued
by others in this thread). For example, c/r of a full desktop
session in VNC, or a VPS, is a perfectly valid and useful case.
[snip]
> 4. inotify and NSCD
> [ We try to virtualize a single app, instead of also checkpointing
> inotify and NSCD themselves. It would have been interesting to consider
> checkpointing them in userland, but that would require root privilege,
> and one core design principle we have, is that all of our C/R is
> completely unprivileged. So, we would see distributing DMTCP as
> a package in a distro, and letting individual users decide for
> what computation they might want to use it. ]
FYI, inotify() is a syscall and does not require root privileges. It's
a kernel API used to get notifications of changes to file system inodes.
for instance, it's commonly used by file managers (e.g. nautilus).
>
> 5. Checkpointing DRM state and other graphics chip state
> [ It comes down to virtualization around a single app versus checkpointing
> _all_ of X. --- Two different approaches. ]
>
> 6. kernel c/r of input devices might be alot easier
> [ We agree with you. By virtualizing around a single app, we hope
> to avoid this issue. ]
Back to the point argued above, "virtualization around a single app"
are the wrappers that allow to take an app out of context and sort of
implant it in another context. It's a very desirable feature, but
orthogonal to the c/r technique.
>
> 7. C/R for link/open/rm/open/write/read puzzle
>
> 8. What happens if the DMTCP coordinator ( checkpoint control process) dies?
> [ The same thing that happens if a user process dies. We kill the whole
> computation, and restart. At restart, we use a new coordinator.
> Coordinators are stateless. ]
>
> 9. We try to hide the reserved signal (SIGUSR2 by default) ...
> [ Matt says this is a mess, but we note that glibc does this too. ]
>
> 10. checkpoint, gdb and PTRACE_ATTACH
> [ DMTCP does not use PTRACE_ATTACH in its implementation. So, we can
> and do fully support user processes that use PTRACE_ATTACH. ]
Hmm... can you really c/r from userspace a process that was, at
checkpoint time, in a ptrace-stopped state at an arbitrary kernel
ptrace-hook ? I strongly suspect the answer is "no", definitely
not unless you also virtualize and replicate the entire in-kernel
ptrace functionality in userspace,
>
> 11. DMTCP, ABIs, can there be a race condition between the ckpt thread and
> user threads of an app?
> [ DMTCP doesn't introduce any new ABIs. There may be a misconception here.
> If we can talk at length off-line, I could explain more about
> the DMTCP design. Inline, I explain why race conditions should
> not be an issue. ]
I beg to differ. Virtualization that relies on a "black box" (in
the sense that it works around an API but not integrated into the
API, like dmtcp does) has been shown time and again to be racy. The
common term is TOCTTOU races. See "Traps and Pitfalls: Practical
Problems in System Call Interposition Based Security Tools" for
example (http://www.stanford.edu/~talg/papers/traps/abstract.html),
and many others that cite (or not) this work.
I believe the way dmtcp virtualizes the pid-namespace makes no
exception to this rule.
[snip]
>
> I think we would need to elaborate with individual cases. But as I wrote
> above, DMTCP and Linux C/R started with two different philosophies.
> I'm not sure if you fully understood the DMTCP goals and philosophy yet,
> but I hope my comments above help clarify it.
Yes, let's look into the goals:
dmtcp aims to provide c/r for a certain class of applications and
envrionments. For this dmtcp offers:
(1) userspace c/r engine and c/r-oriented virtualization, and
(2) userspace (often per-application or per-environment) wrappers.
linux-cr provides (3) generic, transparent kernel-based c/r engine
(yes, transparent! without userspace virtualization, LD_PRELOAD
tricks, or collaboration of the developer/application/user).
So let's compare apples to apples - let's compare (3) to (1).
All of the work related to item (2) applies to and benefits
from either.
(Now looking forward to discuss more details with dmtcp team on
Tuesday and on :)
Thanks,
Oren.
On 11/05/2010 08:36 PM, Kapil Arya wrote:
>> I'm probably missing something but can't you stop the application
>> using PTRACE_ATTACH? You wouldn't need to hijack a signal or worry
>> about -EINTR failures (there are some exceptions but nothing really to
>> worry about). Also, unless the manager thread needs to be always
>> online, you can inject manager thread by manipulating the target
>> process states while taking a snapshot.
>
> In fact CryoPid uses exactly the same approach and has been around for around 5
> years. Not as much development effort has gone into CryoPid as DMTCP and so its
> application coverage is not as broad. But the larger issue for using PTRACE is
> that you can not have two superiors tracing the same inferior process. So if you
> want to checkpoint a gdb session or valgrind or tmux or strace, then you can not
> directly control and quiesce the inferior process being traced.
>
> Beyond that, we also have a vision (not yet implemented) of process
> virtualization by which one can change the behavior of a program. For example,
> if a distributed computation runs over infiniband, can we migrate to a TCP/IP
> cluster. For this, one needs the flexibility of wrappers around system calls.
> This vision of process virtualization also motivates why our own research
> project has steered away from in-kernel C/R.
This is a very useful vision. However, it is unrelated to how you
do c/r, but rather to what you do after you restart and before you
let the application resume execution.
For example, in your example, you'd need to wrap the library calls
(e.g. of MPI implementation) and replaced them to use TCP/IP or
infiniband. Wrapping on system calls won't help you.
Or you could just replace the resource - e.g., make the restarted
application use s socket for stdout instead of the tty, so you can
redirect the output to where-ever.
Both methods are orthogonal to the c/r itself: linux-cr will allow
you to replace/modify resources if you so wish, and I suspect that
MTCP also can/will.
Interposing on library calls is possible with MTCP methods, or
using binary instrumentation, or PIN, or DynInst, or LD_PRELOAD.
The only two reasons to interpose on systems calls, as I noted
in earlier message (http://lkml.org/lkml/2010/11/5/262 - see
points "2)" and "3)" about userland-workarounds):
One - to virtualize in userspace reosurces (e.g. pids) that the
kernel already knows how to virtualize.
Two - to track state of resources during execution and lie about
their state when needed, because userspace can't cleanly save
and restore their state.
Virtualization through interposition is extremely tricky in and
out of the kernel. The examples given throughout this thread (by
either side) expose the tip of the iceberg. Interposition as a
technique is full of security and other pitfalls, as discussed
by extensive literature in the area. (I cited in another email).
So I'll repeat the question I asked there: is re-reimplementing
chunks of kernel functionality and all namespaces in userspace
the way to go ?
>
>>> But since you ask :-), there is one thing on our wish list. We
>>> handle address space randomization, vdso, vsyscall, and so on quite
>>> well. We do not turn off address space randomization (although on
>>> restart, we map user segments back to their original addresses).
>>> Probably the randomized value of brk (end-of-data or end of heap) is
>>> the thing that gave us the most troubles and that's where the code
>>> is the most hairy.
>>
[snip]
> The design of DMTCP was decided upon roughly during the period from Linux 2.6.3
> through Linux 2.6.18. At that time, /proc/*/net did not exist. You are right
> that this can provide much better design for DMTCP and eliminate some of our
> wrappers. Thanks very much for pointing this out. We are now egar to implement a
> new design based on /proc/*/net in the near future.
>
> Since /proc/*/net provides a simpler design for sockets, we started wondering
> what other simplifications may be possible. Here is one possibility, in the case
> of shared file descriptors, DMTCP goes through two barriers in order to decide
> which process will be responsible for checkpointing which shared-file
> descriptor. It works and the overhead is reasonable, but if you have additional
> suggestion for this case, we would be very interested.
What is "reasonable" overhead ?
For which applications ?
What about a 'kernel make' ?
What about servers (db, web, etc) ?
What about VPSs/VDIs ?
Can we do better, including for HPC ?
...
>
>> I think this is why userland CR implementation makes much more sense.
>> Most of states visible to a userland process are rather rigidly
>> defined by standards and, ultimately, ABI and the kernel exports most
>> of those information to userland one way or the other. Given the
>> right set of needed features, most of which are probabaly already
>> implemented, a userland implementation should have access to most
>> information necessary to checkpoint without resorting to too messy
>> methods and then there inevitably needs to be some workarounds to make
>> CR'd processes behave properly w.r.t. other states on the system, so
>> userland workarounds are inevitable anyway unless it resorts to
>> preemtive separation using namespaces and containers, which I frankly
>> think isn't much of value already and more so going forward.
>
> Its a very good point and we agree completely. Here are some examples where we
> believe, a userland component is inevitable even if one begins with in-kernel
> C/R:
Exactly ! Wrapping around apps to isolate them from the environment
is desirable, regardless of how you technically c/r the apps, when
you want to be able to c/r apps outside their native environment.
Generally, you can either include the environment in the checkpoint,
or provide wrappers to virtualize it after restart, or modify the app
so that it knows how to adapt to new environments after restart.
Either way, you need to technically c/r the app, no matter how much
userspace trickery you may choose to apply afterwards if needed. And
doing so in-kernel is more transparent (yes, transparent means that
it does not require LD_PRELOAD or collaboration of the application!
nor does it require userspace virtualizations of so many things
already provided by the kernel today), more generic, more flexible,
provides more guarantees, cover more types or states of resources,
and can perform significantly better.
And then, if you want to work with dmtcp's type of scenarios, you
could use the generic c/r and apply their wrappers on top of it !
[snip]
Thanks,
Oren.
> > 2. Directly checkpointing a single X11 app
> > [ Our own preferred approach, as opposed to checkpinting an entire desktop;
> > This is easy, but we just haven't had the time lately. I estimate
> > the time to do it is about one person working straight out for two weeks
> > or so. But who has that much spare time. :-) ]
>
> Hmmm... that sounds pretty fast .. given that you will need to
> save and reconstruct an arbitrary state kept by the X server...
>
> More importantly, this line of thought was brought up in this
> thread multiple times, yet in a very misleading way.
>
> The question is _not_ whether one can do c/r of a single apps
> without their surrounding environment. The answer for that is
> simple: it _is_ possible either using proper (and more likely
> per-app) wrappers, or by adapting the apps to tolerate that.
>
> The above is entirely orthogonal to whether the c/r is in kernel
> or in userspace.
These are all good points by Oren. It's not about in-kernel _or_ userland.
There are opportunities to use both -- each where it is strongest,
and I'm looking forward to that discussion with Oren. I do think
that reconstructing the state of the X server is not as hard as Oren
paints it, but let's talk about that in the discussion.
> But that is independent of where you do c/r ! The issue on the
> table is whether the _core_ c/r should go in kernel or userspace.
> Those wrappers of dmtcp are great and will be useful with either
> approach.
>
> So let us please _not_ argue that only one approach can c/r apps
> or processes out of their context. That is inaccurate and misleading.
>
> And while one may argue that one use-case is more important than
> another, let us also _not_ dismiss such use cases (as was argued
> by others in this thread). For example, c/r of a full desktop
> session in VNC, or a VPS, is a perfectly valid and useful case.
I agree. I apologize if I was too argumentative in the previous post.
> FYI, inotify() is a syscall and does not require root privileges. It's
> a kernel API used to get notifications of changes to file system inodes.
> for instance, it's commonly used by file managers (e.g. nautilus).
Yes, I know. I was writing too fast in trying to respond to all the points.
Matt had asked how we would handle inotify(), but I was getting swamped
by all the questions. There is a virtualization approach to inotify in which
one puts wrappers around inotify_add_watch(), inotify_rm_watch() and
friends in the same way as we wrap open() and could wrap close().
One would then need to wrap read() (which we don't like to do, just
in case it could add significant overhead). But if we consider kernel
and userland virtualization together, then something similar to TIOCSTI
for ioctl would allow us to avoid wrapping read().
> Back to the point argued above, "virtualization around a single app"
> are the wrappers that allow to take an app out of context and sort of
> implant it in another context. It's a very desirable feature, but
> orthogonal to the c/r technique.
I agree. I look forward to the discussion where we can put all this
into a single perspective.
> Hmm... can you really c/r from userspace a process that was, at
> checkpoint time, in a ptrace-stopped state at an arbitrary kernel
> ptrace-hook ? I strongly suspect the answer is "no", definitely
> not unless you also virtualize and replicate the entire in-kernel
> ptrace functionality in userspace,
Let's try it and see. If you write a program, we'll try it out in
DMTCP (unstable branch) and see. So far, checkpointing gdb sessions
has worked well for us. If there is something we don't cover, it will
be helpful to both of us to find it, and analyze that case.
> I beg to differ. Virtualization that relies on a "black box" (in
> the sense that it works around an API but not integrated into the
> API, like dmtcp does) has been shown time and again to be racy. The
> common term is TOCTTOU races. See "Traps and Pitfalls: Practical
> Problems in System Call Interposition Based Security Tools" for
> example (http://www.stanford.edu/~talg/papers/traps/abstract.html),
> and many others that cite (or not) this work.
>
> I believe the way dmtcp virtualizes the pid-namespace makes no
> exception to this rule.
Another excellent topic for discussion. I look forward to the discussion.
Thanks for the advance pointer for us to take a look at.
> Yes, let's look into the goals:
>
> dmtcp aims to provide c/r for a certain class of applications and
> envrionments. For this dmtcp offers:
> (1) userspace c/r engine and c/r-oriented virtualization, and
> (2) userspace (often per-application or per-environment) wrappers.
>
> linux-cr provides (3) generic, transparent kernel-based c/r engine
> (yes, transparent! without userspace virtualization, LD_PRELOAD
> tricks, or collaboration of the developer/application/user).
>
> So let's compare apples to apples - let's compare (3) to (1).
> All of the work related to item (2) applies to and benefits
> from either.
>
> (Now looking forward to discuss more details with dmtcp team on
> Tuesday and on :)
Also a very good point above, and I agree. The offline discussion should
be a better forum for putting this all into perspective.
Thanks again for your thoughtful response,
- Gene
I'd like to add a few clafifications, below, about DMTCP concerning
Oren's comments. I'd also like to point out that we've had about 100
downloads per month from sourceforge (and some interesting use cases
from end users) over the last year (although the sourceforge numbers
do go up and down :-) ). In general, I think we'll all understand the
situation better after having had the opportunity to talk offline.
Below are some clarifications about DMTCP.
===
> For example, in your example, you'd need to wrap the library calls
> (e.g. of MPI implementation) and replaced them to use TCP/IP or
> infiniband. Wrapping on system calls won't help you.
We do not put any wrappers around MPI library calls. MPI calls things
like open, close, connect, listen, execve({"ssh", ...}, ...), etc.
At this time, DMTCP adds wrappers _only_ around calls to libc.so
and libpthread.so . This is sufficient to checkpoint a distributed
computation like MPI.
> The only two reasons to interpose on systems calls, ...
>
> One - to virtualize in userspace reosurces (e.g. pids) that the
> kernel already knows how to virtualize.
>
> Two - to track state of resources during execution and lie about
> their state when needed, because userspace can't cleanly save
> and restore their state.
Just a small correction about interposition. The primary "Reason Two"
for interposing on system calls should be to _spy_ on what the user process
is doing and save that information. For the most part, we do not
_lie about their state when needed_. I agree that virtualization of pids
is an exception where we have to lie, but that was already stated as
"Reason One" above. At restart time, we may also recreate resources that are
no longer in the kernel. But this is not an example of interposition.
I suppose that it is an example of lying, but every C/R technique will
need to do this.
Later, perhaps Oren, Kapil and I can browse the DMTCP code together,
and we can look exactly at what each wrapper is doing. The system call
wrappers are, in fact, the smaller part of the DMTCP code. It's about
3000 lines of code. For anybody who is curious about what our wrappers do,
please download the DMTCP source code, and look at
.../dmtcp/src/*wrapper*.cpp .
> So I'll repeat the question I asked there: is re-reimplementing
> chunks of kernel functionality and all namespaces in userspace
> the way to go ?
If you're referring to interposition here, that takes place essentially
in the wrappers, and the wrappers are only 3000 lines of code in DMTCP.
Also, I don't believe that we're "re-implementing chunks of kernel
functionality", but let's continue that discussion offline.
> What is "reasonable" overhead ?
> For which applications ?
> What about a 'kernel make' ?
> What about servers (db, web, etc) ?
> What about VPSs/VDIs ?
> Can we do better, including for HPC ?
Again, all good questions that will be answered more easily offline.
> ... (yes, transparent means that
> it does not require LD_PRELOAD or collaboration of the application!
> nor does it require userspace virtualizations of so many things
> already provided by the kernel today), more generic, more flexible,
> provides more guarantees, cover more types or states of resources,
> and can perform significantly better.
I still haven't understood why you object to the DMTCP use of LD_PRELOAD.
How will the user app ever know that we used LD_PRELOAD, since we remove
LD_PRELOAD from the environment before the user app libraries and main
can begin? And, if you really object to LD_PRELOAD, then there are
other ways to capture control. Similarly, I'll have to understand better
what you mean by the _collaboration of the application_. DMTCP operates
on unmodified application binaries. Basically, if _transparent_ means
that one is not allowed to use anything at all from userland, then I
agree with you that no userland checkpointing can ever be transparent.
But, I think that's a biased definition of _transparent_. :-)
> And then, if you want to work with dmtcp's type of scenarios, you
> could use the generic c/r and apply their wrappers on top of it !
Agreed. As before, I'm looking forward to us analyzing all the
use cases offline. I think that we're all (myself included) in the
situation of the three blind men and the elephant. I think part of the
misunderstanding is that we're each thinking about a different use case,
and so we (myself included) end up comparing apples and oranges.
Thanks,
- Gene
On 11/07/2010 02:42 PM, Gene Cooperman wrote:
> I'd like to add a few clafifications, below, about DMTCP concerning
> Oren's comments. I'd also like to point out that we've had about 100
> downloads per month from sourceforge (and some interesting use cases
> from end users) over the last year (although the sourceforge numbers
> do go up and down :-) ). In general, I think we'll all understand the
> situation better after having had the opportunity to talk offline.
> Below are some clarifications about DMTCP.
> ===
>
>> For example, in your example, you'd need to wrap the library calls
>> (e.g. of MPI implementation) and replaced them to use TCP/IP or
>> infiniband. Wrapping on system calls won't help you.
>
> We do not put any wrappers around MPI library calls. MPI calls things
> like open, close, connect, listen, execve({"ssh", ...}, ...), etc.
> At this time, DMTCP adds wrappers _only_ around calls to libc.so
> and libpthread.so . This is sufficient to checkpoint a distributed
> computation like MPI.
Of course. And you don't need syscall virtualization for this.
Zap did it already many years ago :) Only problem with the above
is that, conveniently enough, you _left out_ the context:
>> For example,
>> if a distributed computation runs over infiniband, can we migrate to
a TCP/IP
>> cluster. For this, one needs the flexibility of wrappers around
system calls.
Do you also support checkpoint a distributed app that uses an
infiniband MPI stack and restart it with a TCP based MPI stack ?
Can you do it with only syscall wrapping and without knowledge
on the MPI implementation and some MPI-specific logic in the
wrappers ? I'm curious how you do that without wrapping around
MPI calls, or without an c/r-aware implementation of MPI.
Again, this is unrelated to how you do the core c/r work. I think
we both agree that _this_ kind of app-wrappers/app-awareness is
useful for certain uses of c/r.
[snip]
>> So I'll repeat the question I asked there: is re-reimplementing
>> chunks of kernel functionality and all namespaces in userspace
>> the way to go ?
>
> If you're referring to interposition here, that takes place essentially
> in the wrappers, and the wrappers are only 3000 lines of code in DMTCP.
> Also, I don't believe that we're "re-implementing chunks of kernel
> functionality", but let's continue that discussion offline.
The interposition itself is relatively simple (though not atomic).
The problem is the logic to "spy" on and "lie" to the applications.
Examples: saving ptrace state, saving FD_CLOEXEC flag, correctly
maintaining a userspace pid-ns, etc.
[...]
>
>> ... (yes, transparent means that
>> it does not require LD_PRELOAD or collaboration of the application!
>> nor does it require userspace virtualizations of so many things
>> already provided by the kernel today), more generic, more flexible,
>> provides more guarantees, cover more types or states of resources,
>> and can perform significantly better.
>
> I still haven't understood why you object to the DMTCP use of LD_PRELOAD.
> How will the user app ever know that we used LD_PRELOAD, since we remove
> LD_PRELOAD from the environment before the user app libraries and main
> can begin? And, if you really object to LD_PRELOAD, then there are
> other ways to capture control. Similarly, I'll have to understand better
I don't object to it per se - it's actually pretty useful oftentimes.
But in our context, it has limitations. For example, it does not
cover static applications, nor apps that call syscalls directly
using int 0x80. Also, it conflicts with LD_PRELOAD possibly needed
for other software (like valgrind) - for which again you would need
yet another per-app wrapper, at the very least.
> what you mean by the _collaboration of the application_. DMTCP operates
> on unmodified application binaries.
I mean that the applications needs to be scheduled and to run to
participate in its own checkpoint. You use syscall interposition
and signals games to do exactly that - gain control over the app
and run your library's code. This has at least three negatives:
first, some apps don't want to or can't run - e.g. ptraced, or
swapped (think incremental checkpoint: why swap everything in ?!);
Second, the coordination can take significant time, especially if
many tasks/threads and resources are involved; Third, it modifies
the state of the app - if something goes wrong while you use c/r
to migrate an app, you impact the app.
(While 'ptrace' relieves you from the need for "collaboration"
of processes, but doesn't address the other problems and adds
its own issues).
> Basically, if _transparent_ means
> that one is not allowed to use anything at all from userland, then I
> agree with you that no userland checkpointing can ever be transparent.
> But, I think that's a biased definition of _transparent_. :-)
"Transparent" c/r means "invisible" to the user/apps, i.e. that
you don't restrict the user or the app in what they do and how
they do it.
Did you ever try to 'ltrace skype' ? there exists useful and
popular software that doesn't like being spied after...
Oren.
[cc'ing linux containers mailing list]
On 11/06/2010 04:40 PM, Gene Cooperman wrote:
> 8. What happens if the DMTCP coordinator ( checkpoint control process) dies?
> [ The same thing that happens if a user process dies. We kill the whole
> computation, and restart. At restart, we use a new coordinator.
> Coordinators are stateless. ]
My experience is different:
I downloaded dmtcp and followed the quick-start guide:
(1) "dmtcp_coordinator" on one terminal
(2) "dmtcp_checkpoint bash" on another terminal
Then I:
(3) pkill -9 dmtcp_coordinator
... oops - 'bash' died.
I didn't even try to take a checkpoint :(
Oren.
[cc'ing linux containers mailing list]
On 11/07/2010 01:49 PM, Gene Cooperman wrote:
[snip]
> Matt had asked how we would handle inotify(), but I was getting swamped
> by all the questions. There is a virtualization approach to inotify in which
> one puts wrappers around inotify_add_watch(), inotify_rm_watch() and
> friends in the same way as we wrap open() and could wrap close().
> One would then need to wrap read() (which we don't like to do, just
This sounds like reimplementation in userspace the very same logic
done by the kernel :)
> in case it could add significant overhead). But if we consider kernel
> and userland virtualization together, then something similar to TIOCSTI
> for ioctl would allow us to avoid wrapping read().
We could work to add ABIs and APIs for each and every possible piece
of state that affects userspace. And for each we'll argue forever
about the design and some time later regret that it wasn't designed
correctly :p
Even if that happens (which is very unlikely and unnecessary),
it will generate all the very same code in the kernel that Tejun
has been complaining about, and _more_. And we will still suffer
from issues such as lack of atomicity and being unable to do many
simple and advanced optimizations.
Or we could use linux-cr for that: do the c/r in the kernel,
keep the know-how in the kernel, expose (and commit to) a
per-kernel-version ABI (not vow to keep countless new individual
ABIs forever after getting them wrongly...), be able to do all
sorts of useful optimization and provide atomicity and guarantees
(see under "leak detection" in the OLS linux-cr paper). Also,
once the c/r infrastructure is in the kernel, it will be easy
(and encouraged) to support new =ly introduced features.
Finally, then we would use dmtcp as well as other tools on top
of the kernel-cr - and I'm looking forward to do that !
[snip]
>> Hmm... can you really c/r from userspace a process that was, at
>> checkpoint time, in a ptrace-stopped state at an arbitrary kernel
>> ptrace-hook ? I strongly suspect the answer is "no", definitely
>> not unless you also virtualize and replicate the entire in-kernel
>> ptrace functionality in userspace,
>
> Let's try it and see. If you write a program, we'll try it out in
> DMTCP (unstable branch) and see. So far, checkpointing gdb sessions
> has worked well for us. If there is something we don't cover, it will
> be helpful to both of us to find it, and analyze that case.
Try "strace bash" :)
I suspect it won't work - and for the reasons I described.
[snip]
>> (Now looking forward to discuss more details with dmtcp team on
>> Tuesday and on :)
>
> Also a very good point above, and I agree. The offline discussion should
> be a better forum for putting this all into perspective.
>
> Thanks again for your thoughtful response,
Same here. Talk to you soon...
Oren.
On Sat, 6 Nov 2010, Matt Helsley wrote:
> Yes, our patches touch a wide variety of kernel code. You have just failed
> to appreciate how "wide" the kernel ABI truly is. You can't really count
> it by number of syscalls, number of pseudo-filesystems, etc. There's
> also the intended behavior of those interfaces to consider. Each piece
> of checkpoint/restart code is relatively self-contained. This can be
> confirmed merely by looking at many of the patches we've already posted
> enabling checkpoint/restart of that feature. Until you've tried to
> implement checkpoint/restart for an interface or until you've bothered
> to review a patch for one of them (my favorite on is eventfd:
> http://www.mail-archive.com/[email protected]/msg21565.html ) please
> don't tell us it's too complex. Then compare that with your proposed
> ghastly stack of userspace cards -- ptrace (really more like strace) +
> LD_PRELOAD + a daemon...
>
> Incidentally, 20k lines of code is less than many pieces of the kernel.
> It's less than many:
>
> Filesystems (I've selected ones designed for rotating media or networks usually..)
> ext4, nfs, ocfs2, xfs, reiserfs, ntfs, gfs2, jfs, cifs, ubifs, nilfs2, btrfs
>
> Non-filesystem file-system support code:
> nfsd, nls
>
> It's less than one of the simpler DRM graphics drivers -- i915:
> $ cd drivers/gpu/drm/i915
> $ wc -l *.[ch]
> ...
> 41481 total
>
> It's less than any one of the lpfc, bfa, aic7xxx, qla2xxx, and mpt2sas
> drivers I see under scsi. Perhaps a more fair comparison might be to compare
> a single driver to a single checkpointable kernel interface but it's
> a more-fair comparison that skews even more in our favor.
Please, do not compare things like single file systems, drivers, or
otherwise fairly isolated components, with this "thing".
This thing touches a freaky-large number of subsystems, effectively
adding a glueage between them, which can might end up causing problems
(and/or restrict design choices) in the future.
The naked patch looks like just a sugar coating to me, which left out 300+
lines of extra logic in epoll alone.
This is one of the widest, deepest, intrusive patches I have seen in a
while, whose inclusion would require a little bit more than handwaving and
continuous re-posting IMO.
- Davide
On Sun, Nov 07, 2010 at 04:30:19PM -0500, Oren Laadan wrote:
>
>
> On 11/07/2010 02:42 PM, Gene Cooperman wrote:
> >I'd like to add a few clafifications, below, about DMTCP concerning
> >Oren's comments. I'd also like to point out that we've had about 100
> >downloads per month from sourceforge (and some interesting use cases
> >from end users) over the last year (although the sourceforge numbers
> >do go up and down :-) ). In general, I think we'll all understand the
> >situation better after having had the opportunity to talk offline.
> >Below are some clarifications about DMTCP.
> >===
> >
> >>For example, in your example, you'd need to wrap the library calls
> >>(e.g. of MPI implementation) and replaced them to use TCP/IP or
> >>infiniband. Wrapping on system calls won't help you.
> >
> >We do not put any wrappers around MPI library calls. MPI calls things
> >like open, close, connect, listen, execve({"ssh", ...}, ...), etc.
> >At this time, DMTCP adds wrappers _only_ around calls to libc.so
> >and libpthread.so . This is sufficient to checkpoint a distributed
> >computation like MPI.
>
> Of course. And you don't need syscall virtualization for this.
> Zap did it already many years ago :) Only problem with the above
> is that, conveniently enough, you _left out_ the context:
>
> >> For example,
> >> if a distributed computation runs over infiniband, can we migrate
> to a TCP/IP
> >> cluster. For this, one needs the flexibility of wrappers around
> system calls.
>
> Do you also support checkpoint a distributed app that uses an
> infiniband MPI stack and restart it with a TCP based MPI stack ?
> Can you do it with only syscall wrapping and without knowledge
> on the MPI implementation and some MPI-specific logic in the
> wrappers ? I'm curious how you do that without wrapping around
> MPI calls, or without an c/r-aware implementation of MPI.
> ...
Yes, that's exactly what we plan to do. And we have begun some of the
initial work. And yes, we plan to do it without any MPI-specific logic.
When we talk to each other offline, I'd be happy to give you more
details of how we do it now for TCP "without wrapping around MPI calls,
or without an c/r-aware implementation of MPI", and how we are working
on extending that to Infiniband.
> [snip]
>
> >>So I'll repeat the question I asked there: is re-reimplementing
> >>chunks of kernel functionality and all namespaces in userspace
> >>the way to go ?
> >
> >If you're referring to interposition here, that takes place essentially
> >in the wrappers, and the wrappers are only 3000 lines of code in DMTCP.
> >Also, I don't believe that we're "re-implementing chunks of kernel
> >functionality", but let's continue that discussion offline.
>
> The interposition itself is relatively simple (though not atomic).
> The problem is the logic to "spy" on and "lie" to the applications.
> Examples: saving ptrace state, saving FD_CLOEXEC flag, correctly
> maintaining a userspace pid-ns, etc.
And let's wait for the offline discussion for that --- and we'll describe
in detail at that time how we do each one of the things that you mention.
It will be easier to discuss each of the things that you mention by
looking at the DMTCP code "side-by-side" over the phone. We hope to
show you that the logic is really not so complex.
> >
> >>... (yes, transparent means that
> >>it does not require LD_PRELOAD or collaboration of the application!
> >>nor does it require userspace virtualizations of so many things
> >>already provided by the kernel today), more generic, more flexible,
> >>provides more guarantees, cover more types or states of resources,
> >>and can perform significantly better.
> >
> >I still haven't understood why you object to the DMTCP use of LD_PRELOAD.
> >How will the user app ever know that we used LD_PRELOAD, since we remove
> >LD_PRELOAD from the environment before the user app libraries and main
> >can begin? And, if you really object to LD_PRELOAD, then there are
> >other ways to capture control. Similarly, I'll have to understand better
>
> I don't object to it per se - it's actually pretty useful oftentimes.
> But in our context, it has limitations. For example, it does not
> cover static applications, nor apps that call syscalls directly
> using int 0x80.
For static apps, we would use other interposition techniques. And yes,
we haven't implemented support of static apps so far, because our
user base hasn't asked for it. We do handle apps that use the
syscall system call to make system calls. We don't handle apps
that directly use "int 0x80". Again, there are ways to do this, but
our user base hasn't asked for it.
In general, please keep in mind the principles that you rightly had
to remind me of in a previous post. :-) Our two pieces of work are coming
from two different directions with two different visions. Linux C/R wants
to be so transparent that no user app can ever detect it. DMTCP wants to be
transparent enough that any reasonable use case is covered.
In particular, DMTCP considers distributed computations to be equally
valid use cases for the core DMTCP C/R. I also agree that Linux C/R can be
extended to cover distributed apps -- either through userland extensions,
or maybe with techniques like in your excellent CLUSTER-2005 paper.
Hence, DMTCP has grown its coverage of apps over the years. When we
talk offline, let's talk about future use cases, and whether there are
or are not showstoppers for a userland approach.
> Also, it conflicts with LD_PRELOAD possibly needed
> for other software (like valgrind) - for which again you would need
> yet another per-app wrapper, at the very least.
DMTCP does not conflict with the fact that valgrind uses LD_PRELOAD.
We add dmtcphijack.so to the beginning of LD_PRELOAD before the user app
starts. We then remove it before the app really starts. The LD_PRELOAD
requests of valgrind continue to be honored. It all works.
> >what you mean by the _collaboration of the application_. DMTCP operates
> >on unmodified application binaries.
>
> I mean that the applications needs to be scheduled and to run to
> participate in its own checkpoint. You use syscall interposition
> and signals games to do exactly that - gain control over the app
> and run your library's code. This has at least three negatives:
> first, some apps don't want to or can't run - e.g. ptraced, or
> swapped (think incremental checkpoint: why swap everything in ?!);
> Second, the coordination can take significant time, especially if
> many tasks/threads and resources are involved; Third, it modifies
> the state of the app - if something goes wrong while you use c/r
> to migrate an app, you impact the app.
>
> (While 'ptrace' relieves you from the need for "collaboration"
> of processes, but doesn't address the other problems and adds
> its own issues).
Again, I'll add some clarification, although this will best be done
offline. DMTCP does indeed do interposition of the 'syscall' system
call in glibc. As for signals, we don't really play
any signal games. The sole use of signals in DMTCP is for the
checkpoint thread of a process to quiesce the user threads of that
same thread. We use one reserved signal, and we use it solely
internally within a single process. If the user app will allow
us to use a single signal (e.g. SIGRTMIN+2), then we don't need
any games or interposition at all. We were worried about apps
that wish to set _every_ signal to SIG_IGN, etc.
Next, let's consider what you say about wrappers around wrappers,
and your valgrind example. Also, I'd like to make clear that we've
tested primarily on gdb. If it's important, we could do a quick test on
valgrind and report back. Our user base hasn't requested support for
valgrind so far. Assuming that valgrind does use wrappers, we have a
valgrind wrapper around a DMTCP wrapper around a glibc call, which itself
is really a wrapper around a kernel API call.
If it helps, then think of a wrapper as just another function,
that calls an inner function. Object-oriented programming uses this
principle all the time. Similarly, the glibc wrapper around a kernel
API is just one more of these functions. Another way to view this is
through the idea of layers. Each layer of the software receives a call
from the layer above and may call to the next layer below. As you're
already aware, this is a basic principle of O/S design, and so
the O/S is full of wrappers. We're just inserting one more layer ---
this time between the user app and the glibc layer.
I still don't fully understand what you mean by "collaboration", but
it sounds like your definition reduces to the the use of system call
wrappers. In that case, I agree that if DMTCP were not allowed to use
system call wrappers, then DMTCP would fall apart. Aside from that
almost tautology, I don't understand why system call wrappers are inherently
bad. Glibc puts system call wrappers around almost every kernel system call.
Glibc even reserves two signals solely for its own use.
By the way, for those who wish to inspect the DMTCP wrappers, I'd like
to add to my pointers to DMTCP wrappers. the relevant DMTCP code, is in:
dmtcp/src/execwrappers.cpp
dmtcp/src/miscwrappers.cpp
dmtcp/src/pidwrappers.cpp
dmtcp/src/signalwrappers.cpp
dmtcp/src/socketwrappers.cpp
dmtcp/src/syscallsreal.c
dmtcp/src/syscallwrappers.h
dmtcp/src/uniquepid.cpp
dmtcp/src/virtualpidtable.cpp
The total line count is probably 4,500 lines of code, which includes
about 500 lines of copyright statement (LGPL), #include and other boring
boiler-plating. I apologize for the shorter listing in my earlier post.
I didn't intend to mislead. There's lots of other DMTCP code concerned with
what to do at the time of checkpoint and restart, but that would be
a different story.
> >Basically, if _transparent_ means
> >that one is not allowed to use anything at all from userland, then I
> >agree with you that no userland checkpointing can ever be transparent.
> >But, I think that's a biased definition of _transparent_. :-)
>
> "Transparent" c/r means "invisible" to the user/apps, i.e. that
> you don't restrict the user or the app in what they do and how
> they do it.
>
> Did you ever try to 'ltrace skype' ? there exists useful and
> popular software that doesn't like being spied after...
We have not tried to 'ltrace skype'. But ltrace is using PTRACE.
Note that DMTCP does not use PTRACE. I imagine the more interesting question
is if we ever tried 'dmtcp_checkpoint skype'. No, we have not, but
it sounds like an interesting experiment. We'd love to do it, and
discuss with you whatever we learn. In the offline discussion, perhaps
we can take a shortcut and have you describe the skype tricks to us,
so that we can give you a quick first guess.
Anyway, there's one other obvious issue with skype for both Linux C/R
and DMTCP. Skype is talking to a remote app that is probably not under
checkpoint control. And even if both ends are under checkpoint control,
Skype is probably not a good use case for C/R, but if it were, it might
indeed be a difficult problem. (I'd have to think about it.)
As before, remember that we are talking about two different approaches:
- in-kernel C/R and capturing every possible application;
- userland C/R and covering the actual use cases that one finds in practice
You seem to be arguing that there is an important use case that a DMTCP
userland approach can never cover. You may be right about such a use
case, but that detailed back-and-forth will be easier to do offline;
and then we can summarize for the list.
We'll even _help you_ look for those difficult use cases. If they're there,
we want to know about them, too. :-)
Thanks and best wishes,
- Gene
On Sun, Nov 07, 2010 at 04:44:20PM -0500, Oren Laadan wrote:
> [cc'ing linux containers mailing list]
>
> On 11/06/2010 04:40 PM, Gene Cooperman wrote:
>
> >8. What happens if the DMTCP coordinator ( checkpoint control process) dies?
> > [ The same thing that happens if a user process dies. We kill the whole
> > computation, and restart. At restart, we use a new coordinator.
> > Coordinators are stateless. ]
>
> My experience is different:
>
> I downloaded dmtcp and followed the quick-start guide:
> (1) "dmtcp_coordinator" on one terminal
> (2) "dmtcp_checkpoint bash" on another terminal
>
> Then I:
> (3) pkill -9 dmtcp_coordinator
> ... oops - 'bash' died.
>
> I didn't even try to take a checkpoint :(
You're right. I just reproduced your example. But please remember that
we're working in a design space where if any process of a computation
dies, then we kill the computation and restart. It doesn't matter to us
if it's a user process or the DMTCP coordinator that died. I do think
this is getting too detailed for the LKML list, but since you bring it
up, here is the analysis. The user bash process exits with:
[31331] ERROR at dmtcpmessagetypes.cpp:62 in assertValid; REASON='JASSERT(strcmp ( DMTCP_MAGIC_STRING,_magicBits ) == 0) failed'
_magicBits =
Message: read invalid message, _magicBits mismatch. Did DMTCP coordinator die uncleanly?
This means that when the DMTCP coordinator died, it sent a message to the
checkpoint thread within the user process. The message was ill-formed.
The current DMTCP code says that if a checkpoint thread receives an
ill-formed message from the coordinator, then it should die. It's not
hard to change the protocol between DMTCP coordinator and checkpoint
thread of the user process into a more robust protocol with RETRY, further
ACK, etc. We haven't done this. Right now, the user simply restarts from
the last checkpoint. If one process of a computation has been compromised
(either DMTCP coordinator or user process), then the whole computation
has been compromised. I think in a previous version of DMTCP, the policy
was to allow the computation to continue when the coordinator dies.
Policies change.
But I think you're missing the larger point. We've developed DMTCP
over six years, largely with programmers who are much less experienced
than the kernel developers. Yet DMTCP works reliably for many users.
I consider this a credit to the DMTCP design. The Linux C/R design
is also excellent.
Can we get back to questions of design, using the implementations as
reference implementations? If you don't object, I'll also skip replying
to the other post, since I think we're getting too detailed. I'm having
trouble keeping up with the posts. :-) An offline discussion will
give us time to look more carefully at these issues, and draw more
careful conclusions.
Thanks,
- Gene
On Sun, 7 Nov 2010, Davide Libenzi wrote:
> Please, do not compare things like single file systems, drivers, or
> otherwise fairly isolated components, with this "thing".
> This thing touches a freaky-large number of subsystems, effectively
> adding a glueage between them, which can might end up causing problems
> (and/or restrict design choices) in the future.
I've got a question about the ABI that would be created
I see two possible areas that could be considered an ABI
1. control of the C/R process
This is very clearly a userspace ABI, to be figured out and locked down
like any other ABI
2. the details of how things are stored and added back into a system
This is not as clear. at one extreme, this could be like the module
interface, (the checkpointed image is only guaranteed to work on a new
system with a kernel compiled with the same config options as the system
it was checkpointed from). At the other extreme, this could be something
that allows you to ckeckpoint an image on 2.6.40 and restore it on 2.6.80.
Or it could be something in between.
I don't see any way that it is sane to make the C/R image defiition and
interface (#2) be an ABI that is guaranteed to never change without
hurting future kernel development (exactly the type of things that Davide
is worried about above), but what sort of guarantee are people interested
in?
is it enough to sa that it must be the same kernel version compiled with
the same options? (or at least the same options for some list of things
that matter, most device drivers probably would not matter for example)
or would you need compatibility across all compile options for a kernel
release?
would you require compatibility between 2.6.x.y and 2.6.x.z?
would you require compatibility between 2.6.x and 2.6.x+n (for some value
of n)?
is this something that could go in with the weakest guarantee initially,
and then as everyone is more comfortable with it, start extending the
guarantee (and as-needed adding code to the kernel to maintain
compatibility with old images)?
would you require compatibility between 2.6.x and 2.6.x-n?
David Lang
On 11/07/2010 06:05 PM, Gene Cooperman wrote:
[snip]
>>>> ... (yes, transparent means that
>>>> it does not require LD_PRELOAD or collaboration of the application!
>>>> nor does it require userspace virtualizations of so many things
>>>> already provided by the kernel today), more generic, more flexible,
>>>> provides more guarantees, cover more types or states of resources,
>>>> and can perform significantly better.
>>>
>>> I still haven't understood why you object to the DMTCP use of LD_PRELOAD.
>>> How will the user app ever know that we used LD_PRELOAD, since we remove
>>> LD_PRELOAD from the environment before the user app libraries and main
>>> can begin? And, if you really object to LD_PRELOAD, then there are
>>> other ways to capture control. Similarly, I'll have to understand better
>>
>> I don't object to it per se - it's actually pretty useful oftentimes.
>> But in our context, it has limitations. For example, it does not
>> cover static applications, nor apps that call syscalls directly
>> using int 0x80.
>
> For static apps, we would use other interposition techniques. And yes,
> we haven't implemented support of static apps so far, because our
> user base hasn't asked for it. We do handle apps that use the
> syscall system call to make system calls. We don't handle apps
> that directly use "int 0x80". Again, there are ways to do this, but
> our user base hasn't asked for it.
> In general, please keep in mind the principles that you rightly had
> to remind me of in a previous post. :-) Our two pieces of work are coming
> from two different directions with two different visions. Linux C/R wants
> to be so transparent that no user app can ever detect it. DMTCP wants to be
> transparent enough that any reasonable use case is covered.
Agreed - as long as we are considering the c/r-engine functionality
(and not the "glue" logic to keep apps outside their context after
the restart).
That said, I'm afraid we'll more definitions to what is "reasonable"
than to what is "transparent"...
> In particular, DMTCP considers distributed computations to be equally
> valid use cases for the core DMTCP C/R. I also agree that Linux C/R can be
> extended to cover distributed apps -- either through userland extensions,
> or maybe with techniques like in your excellent CLUSTER-2005 paper.
Distributed c/r is one of the proposed use-cases for linux-cr.
The technique in that paper, BTW, was a userspace glue: during
restart, that glue re-establishes connectivity by using new TCP
connections, and c/r uses those new sockets in lieu of restoring
the old ones.
For that and other use-cases we designed linux-cr to be flexible
so that it is possible and easy to integrate any userspace glue.
>> Also, it conflicts with LD_PRELOAD possibly needed
>> for other software (like valgrind) - for which again you would need
>> yet another per-app wrapper, at the very least.
>
> DMTCP does not conflict with the fact that valgrind uses LD_PRELOAD.
> We add dmtcphijack.so to the beginning of LD_PRELOAD before the user app
> starts. We then remove it before the app really starts. The LD_PRELOAD
> requests of valgrind continue to be honored. It all works.
I stand corrected.
>>> what you mean by the _collaboration of the application_. DMTCP operates
>>> on unmodified application binaries.
>>
>> I mean that the applications needs to be scheduled and to run to
>> participate in its own checkpoint. You use syscall interposition
>> and signals games to do exactly that - gain control over the app
>> and run your library's code. This has at least three negatives:
>> first, some apps don't want to or can't run - e.g. ptraced, or
>> swapped (think incremental checkpoint: why swap everything in ?!);
>> Second, the coordination can take significant time, especially if
>> many tasks/threads and resources are involved; Third, it modifies
>> the state of the app - if something goes wrong while you use c/r
>> to migrate an app, you impact the app.
[snip]
> If it helps, then think of a wrapper as just another function,
> that calls an inner function. Object-oriented programming uses this
> principle all the time. Similarly, the glibc wrapper around a kernel
> API is just one more of these functions. Another way to view this is
> through the idea of layers. Each layer of the software receives a call
> from the layer above and may call to the next layer below. As you're
> already aware, this is a basic principle of O/S design, and so
> the O/S is full of wrappers. We're just inserting one more layer ---
> this time between the user app and the glibc layer.
Wrappers are great (I did TA the w4118 class here...). They are
a powerful tool; however in _our_ context they have downsides:
(a) wrappers add visible overhead (less so for cpu-bound apps,
more so with server apps)
(b) wrappers that do virtualization to a "black-box" API (as
opposed to integrate with the API) are prone to races (see the
paper that I cited before)
(c) wrappers duplicate kernel logic, IMHO unnecessarily (and I
don't refer to the userspace "glue" from above)
(d) wrappers are hard to make hermetic (no escapes) to apps.
IMO, the one excellent reasons to use wrappers is to support
the userspace glue that allows restarted apps to run out of
their original context.
>
> I still don't fully understand what you mean by "collaboration", but
> it sounds like your definition reduces to the the use of system call
> wrappers. In that case, I agree that if DMTCP were not allowed to use
I clearly failed to explain well. Lemme try again:
If you use PTRACE to checkpoint, then you ptrace the target tasks,
peek at and save their state, and then let them resume execution.
The target apps need not collaborate - they are forced by the kernel
to the ptraced state regardless of what they were doing, and resume
execution without knowing what happened.
In linux-cr it works similarly: checkpoint does not require that
the processes be scheduled to run - they don't participate; rather,
external process(es) do the work.
In contrast, IIUC, dmtcp uses syscall wrappers and overloading of
signal(s) in order to make every checkpointed process/thread actively
execute the checkpoint logic. I refer to this as "collaborating"
with the checkpoint operation. (I mentioned the downside of this
requirement above).
> system call wrappers, then DMTCP would fall apart. Aside from that
> almost tautology, I don't understand why system call wrappers are inherently
> bad. Glibc puts system call wrappers around almost every kernel system call.
> Glibc even reserves two signals solely for its own use.
Again, I failed to deliver the message: syscall wrappers are not bad.
They have limitations as noted above. Some users won't care, others
may and do.
As for glibc - those wrappers have a set of well defined tasks,
e.g. set errno, hide underlying syscall, caching, threads etc. But
glibc does not try to virtualize pids, for example, nor "spy" after
the processes, so to speak.
>>> Basically, if _transparent_ means
>>> that one is not allowed to use anything at all from userland, then I
>>> agree with you that no userland checkpointing can ever be transparent.
>>> But, I think that's a biased definition of _transparent_. :-)
>>
>> "Transparent" c/r means "invisible" to the user/apps, i.e. that
>> you don't restrict the user or the app in what they do and how
>> they do it.
>>
>> Did you ever try to 'ltrace skype' ? there exists useful and
>> popular software that doesn't like being spied after...
>
> We have not tried to 'ltrace skype'. But ltrace is using PTRACE.
> Note that DMTCP does not use PTRACE. I imagine the more interesting question
Oh... that's not what I meant: 'ltrace skype' fails because skype
tries to protect itself from being reverse-engineered. It doesn't
like ltrace's interposition on some library calls (don't know the
details). (Note that PTRACE doesn't upset skype: 'strace skype'
does work). The point being - userspace wrapping is "escapable".
> is if we ever tried 'dmtcp_checkpoint skype'. No, we have not, but
> it sounds like an interesting experiment. We'd love to do it, and
> discuss with you whatever we learn. In the offline discussion, perhaps
> we can take a shortcut and have you describe the skype tricks to us,
> so that we can give you a quick first guess.
No tricks - I once tried after a colleague mentioned that skype is
hard to reverse engineer (I thought I could prove him wrong...).
> Anyway, there's one other obvious issue with skype for both Linux C/R
> and DMTCP. Skype is talking to a remote app that is probably not under
> checkpoint control.
Linux-cr can do live migration - e.g. VDI, move the desktop - in
which case skype's sockets' network stacks are reconstructed,
transparently to both skype (local apps) and the peer (remote apps).
Then, at the destination host and skype continues to work.
> And even if both ends are under checkpoint control,
> Skype is probably not a good use case for C/R, but if it were, it might
> indeed be a difficult problem. (I'd have to think about it.)
> As before, remember that we are talking about two different approaches:
> - in-kernel C/R and capturing every possible application;
> - userland C/R and covering the actual use cases that one finds in practice
I'd assume that if the c/r engine can do the former, then it
will also do the latter. Maybe even it would be useful for dmtcp
to be able to use a couple of syscalls (checkpoint,restart) to
do the base c/r work :p
Oren.
As before, Oren, let's have that phone discussion so that we can preprocess
a lot of this, instead of acting like the the three blind men and the
elephant. I will _tell you_ the strengths and weaknesses of DMTCP
on the phone, instead of you having to guess at them here on LKML. And
of course, I hope you will be similarly frank about Linux C/R on the phone.
Thank you for lowering the heat on this last post. I'll reply only to
some relevant issues in this post, rather than trying to respond to all
of your posts. I remind you that I still have my own questions about
Linux C/R, but I'm saving them for the phone discussion, since that will
be more efficient, and result in less heat.
> > If it helps, then think of a wrapper as just another function,
> >that calls an inner function. Object-oriented programming uses this
> >principle all the time. Similarly, the glibc wrapper around a kernel
> >API is just one more of these functions. Another way to view this is
> >through the idea of layers. Each layer of the software receives a call
> >from the layer above and may call to the next layer below. As you're
> >already aware, this is a basic principle of O/S design, and so
> >the O/S is full of wrappers. We're just inserting one more layer ---
> >this time between the user app and the glibc layer.
>
> Wrappers are great (I did TA the w4118 class here...). They are
> a powerful tool; however in _our_ context they have downsides:
> (a) wrappers add visible overhead (less so for cpu-bound apps,
> more so with server apps)
In our experience, the primary overhead of C/R is to save the
data to disk. This far outweighs the question of how many ms
one technique or another may require in a system call or in the kernel.
> (b) wrappers that do virtualization to a "black-box" API (as
> opposed to integrate with the API) are prone to races (see the
> paper that I cited before)
The paper you cited was:
http://www.stanford.edu/~talg/papers/traps/abstract.html
Traps and Pitfalls: Practical Problems in System Call
Interposition Based Security Tools
That paper is about Sandboxing. DMTCP is about C/R. If DMTCP was trying
to do a sandbox, it might have some of the same traps and pitfalls.
Luckily, userland C/R is a _lot_ easier than userland sandboxing.
By the way, although of less importance, I'll point out that the paper
was written in 2003, before DMTCP even started.
Next, you talk about races. The authors of that paper have races
because they are trying to do sandboxing. I already answered Matt's
post earlier about why we don't see races in DMTCP.
I'll answer it again, but in more detail.
At ordinary run-time, the DMTCP checkpoint thread is just waiting
on a select -- waiting for instructions from the DMTCP coordinator.
Our system call wrappers around user threads to not change the issue
of races. If two user threads used to have a race, they will continue
to do so in DMTCP. If two user threads did not have a race, then
DMTCP will not introduce any new races. How should DMTCP introduce
a new race when DMTCP wrappers _never_ communicate with any other thread.
At checkpoint or restart time, the DMTCP checkpoint thread also runs.
However, at checkpoint time, the first thing it does is to quiesce
all the user threads by sending a signal and forcing them into a DMTCP
signal handler. (And before we go down that other road again, I remind
you that glibc also reserves two signals solely for the use of glibc.
A user app can break glibc by using the glibc reserved signals.)
During checkpoint-restart, the DMTCP checkpoint thread is then the
_only_ thread that is executing. So, again, I don't see how a race
could be introduced. Finally, the last thing the DMTCP checkpoint
thread does is resume the user threads. The DMTCP checkpoint thread
then goes back to waiting on select for a message from the DMTCP coordinator.
> (c) wrappers duplicate kernel logic, IMHO unnecessarily (and I
> don't refer to the userspace "glue" from above)
DMTCP wrappers do not duplicate kernel logic. In our phone conversation,
I will show you each and every one of the DMTCP wrappers. I've already
posted for the entire list where they can find the DMTCP wrappers.
I honestly don't see any duplication of kernel logic. If you do see this,
please tell us which DMTCP wrapper is duplicating the kernel logic, so
that we can talk about specifics. But please, can we review the DMTCP
code offline? A code review within LKML seems _awfully_ tedious. :-)
> (d) wrappers are hard to make hermetic (no escapes) to apps.
In general, we don't try to make all the DMTCP wrappers hermetic.
Your mindset may be influenced by the sandboxing paper above.
But again, we're not doing sandboxing. We're doing C/R.
If you're using "hermetic" as a placeholder for what we call
"pid virtualization" (a translation table between original and current
pid), then yes: for every system call that takes a pid as an argument
or returns a pid, we must add a wrapper. That is not a difficult task.
Let's do a code review of DMTCP together (on the phone) to look for a "leak"
in the DMTCP pid's. I do think this is a lot easier and less
complex to do than to guard against all resource leaks in a container. :-)
(Sorry, I know that's a cheap shot on my part. I'm getting tired
of overly broad statements, without the opportunity for us to do a code
review or preprocess the issues back and forth on the phone.)
> IMO, the one excellent reasons to use wrappers is to support
> the userspace glue that allows restarted apps to run out of
> their original context.
>
> >
> >I still don't fully understand what you mean by "collaboration", but
> >it sounds like your definition reduces to the the use of system call
> >wrappers. In that case, I agree that if DMTCP were not allowed to use
>
> I clearly failed to explain well. Lemme try again:
>
> If you use PTRACE to checkpoint, then you ptrace the target tasks,
> peek at and save their state, and then let them resume execution.
> The target apps need not collaborate - they are forced by the kernel
> to the ptraced state regardless of what they were doing, and resume
> execution without knowing what happened.
>
> In linux-cr it works similarly: checkpoint does not require that
> the processes be scheduled to run - they don't participate; rather,
> external process(es) do the work.
>
> In contrast, IIUC, dmtcp uses syscall wrappers and overloading of
> signal(s) in order to make every checkpointed process/thread actively
> execute the checkpoint logic. I refer to this as "collaborating"
> with the checkpoint operation. (I mentioned the downside of this
> requirement above).
Again, a correction. DMTCP does _not_ overload signals. It uses
a signal not already used by the app. If the app tries to "zero out"
all signals, then DMTCP protects itself through wrappers (or what
you would call "lying", although I dislike these emotionally
loaded phrases). Glibc also uses dedicated signals.
Concerning "collaboration", when gdb inserts a breakpoint, it modifies
the user code. So, even though gdb uses PTRACE, by your definition,
the gdb use of breakpoints relies on "collaboration".
> >system call wrappers, then DMTCP would fall apart. Aside from that
> >almost tautology, I don't understand why system call wrappers are inherently
> >bad. Glibc puts system call wrappers around almost every kernel system call.
> >Glibc even reserves two signals solely for its own use.
>
> Again, I failed to deliver the message: syscall wrappers are not bad.
> They have limitations as noted above. Some users won't care, others
> may and do.
>
> As for glibc - those wrappers have a set of well defined tasks,
> e.g. set errno, hide underlying syscall, caching, threads etc. But
> glibc does not try to virtualize pids, for example, nor "spy" after
> the processes, so to speak.
I'm sorry to be blunt, but I simply have to say that you are wrong here.
We've spent six years developing DMTCP. We've spent a lot of time getting
to know the design principles of glibc. (And by the way, it's not just glibc
that does these dirty tricks with system calls --- bash, dash, Matlab,
and a host of other applications also do it.)
Anyway, glibc definitely does have its own "dirty tricks", including
"spy"-ing. Caching a pid and refusing to make a later system call
is definitely a form of spying.
It gets worse with glibc session ids and group ids. When a session id
or group id changes, glibc must inform all of the user threads that their
cached id has changed. To do this it uses the SETXID concept and a
dedicated signal, as I mentioned earlier. At the time when the clone call
was created, there was a dicussion whether to implement threads directly
in the Linux kernel. It was decided to go with the clone call, instead.
If I understand your general philosophy, that was a bad decision,
because NPTL threads are no longer transparent, and they now require
collaboration through wrappers in glibc.
(Sorry, another cheap shot. Can we please shift the discussion to
a phone conversation? If you're going to make me spend hours replying
on LKML, when I could explain it all to you in one hour on the phone,
then I will get cranky.)
There are also other "dirty tricks" from glibc that I could bring out
for you -- where one might argue that glibc breaks your definition
of transparency. (However, the literature has lots of papers on "transparent
checkpointing", and I think they use a different definition
of transparency from yours.)
With DMTCP and glibc both, the philosophy is that as long as
the application coverage is broad enough, and as long as the tricks
of DMTCP and glibc do not affect any programmer's natural programming
methodology, then it's okay. This is not about sandboxing, or hermeticity.
I understand that Linux C/R may have those higher goals, and that's laudable,
but please don't tell us that DMTCP is bad because it doesn't do
exactly what Linux C/R does. (Sorry, getting cranky, again.)
> >>>Basically, if _transparent_ means
> >>>that one is not allowed to use anything at all from userland, then I
> >>>agree with you that no userland checkpointing can ever be transparent.
> >>>But, I think that's a biased definition of _transparent_. :-)
> >>
> >>"Transparent" c/r means "invisible" to the user/apps, i.e. that
> >>you don't restrict the user or the app in what they do and how
> >>they do it.
> >>
> >>Did you ever try to 'ltrace skype' ? there exists useful and
> >>popular software that doesn't like being spied after...
> >
> >We have not tried to 'ltrace skype'. But ltrace is using PTRACE.
> >Note that DMTCP does not use PTRACE. I imagine the more interesting question
>
> Oh... that's not what I meant: 'ltrace skype' fails because skype
> tries to protect itself from being reverse-engineered. It doesn't
> like ltrace's interposition on some library calls (don't know the
> details). (Note that PTRACE doesn't upset skype: 'strace skype'
> does work). The point being - userspace wrapping is "escapable".
>
> >is if we ever tried 'dmtcp_checkpoint skype'. No, we have not, but
> >it sounds like an interesting experiment. We'd love to do it, and
> >discuss with you whatever we learn. In the offline discussion, perhaps
> >we can take a shortcut and have you describe the skype tricks to us,
> >so that we can give you a quick first guess.
>
> No tricks - I once tried after a colleague mentioned that skype is
> hard to reverse engineer (I thought I could prove him wrong...).
>
> > Anyway, there's one other obvious issue with skype for both Linux C/R
> >and DMTCP. Skype is talking to a remote app that is probably not under
> >checkpoint control.
>
> Linux-cr can do live migration - e.g. VDI, move the desktop - in
> which case skype's sockets' network stacks are reconstructed,
> transparently to both skype (local apps) and the peer (remote apps).
> Then, at the destination host and skype continues to work.
That's a really cool thing to do, and it's definitely not part of what
DMTCP does. It might be possible to do userland live migration,
but it's definitely not part of our current scope. But if we're talking
about live migration, have you also looked at the work of
Andres Lagar Caviilla on SnowFlock?
http://andres.lagarcavilla.com/publications/LagarCavillaEurosys09.pdf
He does live migration of entire virtual machines, again with very
small delay. Of course, the issue for any type of live migration is that
if the rate of dirtying pages is very high (e.g. HPC), then there is
still a delay or slow response, due to page faults to a remote host.
> >And even if both ends are under checkpoint control,
> >Skype is probably not a good use case for C/R, but if it were, it might
> >indeed be a difficult problem. (I'd have to think about it.)
> > As before, remember that we are talking about two different approaches:
> >- in-kernel C/R and capturing every possible application;
> >- userland C/R and covering the actual use cases that one finds in practice
>
> I'd assume that if the c/r engine can do the former, then it
> will also do the latter. Maybe even it would be useful for dmtcp
> to be able to use a couple of syscalls (checkpoint,restart) to
> do the base c/r work :p
Yes, we have no objection to combining ideas from DMTCP and Linux C/R.
This is not a case of either-or. Let's study the use cases together.
I won't say more, because I'm clearly getting cranky right now. :-)
> Oren.
Hi,
Ok, I'll bite the bullet for now - to be continued...
Just one important clarification:
>> Linux-cr can do live migration - e.g. VDI, move the desktop - in
>> which case skype's sockets' network stacks are reconstructed,
>> transparently to both skype (local apps) and the peer (remote apps).
>> Then, at the destination host and skype continues to work.
>
> That's a really cool thing to do, and it's definitely not part of what
> DMTCP does. It might be possible to do userland live migration,
> but it's definitely not part of our current scope. But if we're talking
> about live migration, have you also looked at the work of
> Andres Lagar Caviilla on SnowFlock?
> http://andres.lagarcavilla.com/publications/LagarCavillaEurosys09.pdf
> He does live migration of entire virtual machines, again with very
> small delay. Of course, the issue for any type of live migration is that
> if the rate of dirtying pages is very high (e.g. HPC), then there is
> still a delay or slow response, due to page faults to a remote host.
VMware, Xen and KVM already do live migration. However, VMs
are a separate beast.
We are concerned about _application_ level c/r and migration
(complete containers or individual applications). Many proven
techniques from the VM world apply to our context too (in your
example, post-copy migration).
Oren.
Thanks for the careful response, Oren. For others who read this,
one could interpret Oren's rapid post as criticizing the work of
Andres Lagar Cavilla. I'm sure that this was not Oren's intention.
Please read below for a brief clarification of the novelty of SnowFlock.
Anyway, I really look forward to the phone discussion. I've also
enjoyed our interchange, for giving me an opportunity to explain more about
the DMTCP design. Thank you.
Best wishes,
- Gene
On Mon, Nov 08, 2010 at 01:14:12PM -0500, Oren Laadan wrote:
> Hi,
>
> Ok, I'll bite the bullet for now - to be continued...
>
> Just one important clarification:
>
> >>Linux-cr can do live migration - e.g. VDI, move the desktop - in
> >>which case skype's sockets' network stacks are reconstructed,
> >>transparently to both skype (local apps) and the peer (remote apps).
> >>Then, at the destination host and skype continues to work.
> >
> >That's a really cool thing to do, and it's definitely not part of what
> >DMTCP does. It might be possible to do userland live migration,
> >but it's definitely not part of our current scope. But if we're talking
> >about live migration, have you also looked at the work of
> >Andres Lagar Caviilla on SnowFlock?
> > http://andres.lagarcavilla.com/publications/LagarCavillaEurosys09.pdf
> >He does live migration of entire virtual machines, again with very
> >small delay. Of course, the issue for any type of live migration is that
> >if the rate of dirtying pages is very high (e.g. HPC), then there is
> >still a delay or slow response, due to page faults to a remote host.
>
> VMware, Xen and KVM already do live migration. However, VMs
> are a separate beast.
I absolutely agree with your point that live migration of
applications is a different beast, and technically very novel.
Since I know Andres Lagar Cavilla personally, I also feel obligated
to comment why SnowFlock truly is novel in the VM space. First, as Andres
writes:
"SnowFlock is an open-source project [SnowFlock] built on the Xen 3.0.3
VMM [Barham 2003]."
In the abstract, Andres points out one of the major points of novelty:
"To evaluate SnowFlock, we focus on the demanding
scenario of services requiring on-the-fly creation of hundreds
of parallel workers in order to solve computationallyintensive
queries in seconds."
We must be careful that we don't destroy someone's reputation without
a careful study of their work.
> We are concerned about _application_ level c/r and migration
> (complete containers or individual applications). Many proven
> techniques from the VM world apply to our context too (in your
> example, post-copy migration).
>
> Oren.
On 11/08/2010 01:37 PM, Gene Cooperman wrote:
> Thanks for the careful response, Oren. For others who read this,
> one could interpret Oren's rapid post as criticizing the work of
> Andres Lagar Cavilla. I'm sure that this was not Oren's intention.
> Please read below for a brief clarification of the novelty of SnowFlock.
Err... yes, that was careless of me. I was too focused on
getting the thread back to track. Thanks for pointing out.
>>> about live migration, have you also looked at the work of
>>> Andres Lagar Caviilla on SnowFlock?
>>> http://andres.lagarcavilla.com/publications/LagarCavillaEurosys09.pdf
>>> He does live migration of entire virtual machines, again with very
>>> small delay. Of course, the issue for any type of live migration is that
>>> if the rate of dirtying pages is very high (e.g. HPC), then there is
>>> still a delay or slow response, due to page faults to a remote host.
>>
>> VMware, Xen and KVM already do live migration. However, VMs
>> are a separate beast.
>
> I absolutely agree with your point that live migration of
> applications is a different beast, and technically very novel.
> Since I know Andres Lagar Cavilla personally, I also feel obligated
> to comment why SnowFlock truly is novel in the VM space. First, as Andres
> writes:
> "SnowFlock is an open-source project [SnowFlock] built on the Xen 3.0.3
> VMM [Barham 2003]."
> In the abstract, Andres points out one of the major points of novelty:
> "To evaluate SnowFlock, we focus on the demanding
> scenario of services requiring on-the-fly creation of hundreds
> of parallel workers in order to solve computationallyintensive
> queries in seconds."
> We must be careful that we don't destroy someone's reputation without
> a careful study of their work.
Yes, it's really nice work - I saw it when I visited there.
(Coincidentally the post-copy idea with Xen appeared also in
VEE 09 briefly before).
Oren.
GC> As before, Oren, let's have that phone discussion so that we can
GC> preprocess a lot of this, instead of acting like the the three
GC> blind men and the elephant. I will _tell you_ the strengths and
GC> weaknesses of DMTCP on the phone, instead of you having to guess
GC> at them here on LKML. And of course, I hope you will be similarly
GC> frank about Linux C/R on the phone.
I want to be in on that discussion too, as do a lot of other people
here. However, I doubt we'll all be able to find a common spot on our
collective schedules, nor would that conversation be archived for
posterity. I think sticking to LKML is the right (and time-tested)
approach.
OL> Linux-cr can do live migration - e.g. VDI, move the desktop - in
OL> which case skype's sockets' network stacks are reconstructed,
OL> transparently to both skype (local apps) and the peer (remote
OL> apps). Then, at the destination host and skype continues to work.
GC> That's a really cool thing to do, and it's definitely not part of
GC> what DMTCP does. It might be possible to do userland live
GC> migration, but it's definitely not part of our current scope.
How would you go about doing that in userland? With the current
linux-cr implementation, I can move something like sshd or sendmail
from one machine to another without a remote (connected) client
noticing anything more than a bit of delay during the move.
I think that saving and restoring the state of a TCP connection from
userland is probably a good example of a case where it makes sense to
have it as part of a C/R function, but not necessarily exposed in /sys
or /proc somewhere. Unless it can be argued that doing so is not
useful, I think that's a good talking point for discussing the kernel
vs. user approach, no?
--
Dan Smith
IBM Linux Technology Center
Hello, sorry about the long delay. Was lost in something else.
On 11/06/2010 01:36 AM, Kapil Arya wrote:
>> I'm probably missing something but can't you stop the application
>> using PTRACE_ATTACH? You wouldn't need to hijack a signal or worry
>> about -EINTR failures (there are some exceptions but nothing really to
>> worry about). Also, unless the manager thread needs to be always
>> online, you can inject manager thread by manipulating the target
>> process states while taking a snapshot.
>
> In fact CryoPid uses exactly the same approach and has been around
> for around 5 years. Not as much development effort has gone into
> CryoPid as DMTCP and so its application coverage is not as
> broad. But the larger issue for using PTRACE is that you can not
> have two superiors tracing the same inferior process. So if you want
> to checkpoint a gdb session or valgrind or tmux or strace, then you
> can not directly control and quiesce the inferior process being
> traced.
I've been thinking about this. We can easily introduce a new ptrace
call which allows neseting. AFAICS, ptrace already exports most of
information necessary to restart the task - where it's stopped and
why. The only missing thing seems to be the wait state (including for
group stop) which can be added without too much difficulty. I'll try
to write up a RFC patch. Things like that would useful for other
things too - say, you would be able to attach gdb to a strace'd
process which would come handy in some cases.
> Beyond that, we also have a vision (not yet implemented) of process
> virtualization by which one can change the behavior of a
> program. For example, if a distributed computation runs over
> infiniband, can we migrate to a TCP/IP cluster. For this, one needs
> the flexibility of wrappers around system calls. This vision of
> process virtualization also motivates why our own research project
> has steered away from in-kernel C/R.
Yeah, definitely, for the higher level workarounds, there's no way
around it but I think it would still be worthwhile to be able to
provide a baseline implementation which can checkpoint and restart a
single process in a reliable and well-defined way.
>>> But since you ask :-), there is one thing on our wish list. We
>>> handle address space randomization, vdso, vsyscall, and so on quite
>>> well. We do not turn off address space randomization (although on
>>> restart, we map user segments back to their original addresses).
>>> Probably the randomized value of brk (end-of-data or end of heap) is
>>> the thing that gave us the most troubles and that's where the code
>>> is the most hairy.
>>
>> Can you please elaborate a bit? What do you want to see changed?
>
> Yes, we would love to elaborate :-). We began DMTCP with Linux
> kernel 2.6.3. When Address Space Layout Randomization was added, we
> were forced to add some hacks concerning VDSO location and
> end-of-data. end-of-data is the uglier part. On restart, we
> directly map each memory segment into the original address at
> checkpoint time. The issue comes in mapping heap back to its
> original location. We call sbrk() to reset the end-of-data to the
> end of the original heap. This fails if the randomized
> beginning-of-data/end-of-data given to us by the kernel for the
> restarted process is too far away from where we want to remap the
> heap. To get around this, we play games with legacy layout, other
> personality parameters, and RLIMIT_STACK (since the kernel uses
> RLIMIT_STACK in choosing the appropriate memory layout).
>
> For our wish list, we would like a way of telling the kernel, where
> to set beginning-of-data/end-of-data. Curiously enough, at the time
> at which Linux started randomizing address space, there was
> discussion of offering exactly this facility for the sake of legacy
> programs, but it turned out not to be needed.
I see. Yeah, I completely forgot that kernel keeps track of brk.
> Similarly, it would be nice to tell the kernel where we want the
> VDSO page. Currently, we get around this by keeping two VDSO pages,
> the old one which we restore and the new one specified to us by the
> kernel when the restart process is created. This works well for, and
> so controlling the address of the VDSO page is less important for
> us.
I haven't really looked at the VDSO generation but symbol offsets
inside VDSO page can differ depending on kernel version,
configuration, toolchains used, etc... right? You would need an extra
layer of indirection no matter what in that case.
> Since /proc/*/net provides a simpler design for sockets, we started
> wondering what other simplifications may be possible. Here is one
> possibility, in the case of shared file descriptors, DMTCP goes
> through two barriers in order to decide which process will be
> responsible for checkpointing which shared-file descriptor. It works
> and the overhead is reasonable, but if you have additional
> suggestion for this case, we would be very interested.
I wrote in another mail but you can find out which fd's are shared by
flipping O_NONBLOCK and looking at the flags field of
/proc/*/fdinfo/*. Or are you talking about something else?
> We really enjoyed this discussion. If you are interested, we would
> be happy to talk further by phone in order to take advantage of the
> higher bandwidth.
As a few others have already pointed out, I think it's better to keep
technical discussions on-line. Different people think at different
paces and the schedules don't always match. Plus, other people can
jump in and look up things later. It may take a bit more effort at
the beginning but I think it gets easier in time.
Thank you.
--
tejun
Hello,
On 11/08/2010 08:05 PM, Dan Smith wrote:
> GC> As before, Oren, let's have that phone discussion so that we can
> GC> preprocess a lot of this, instead of acting like the the three
> GC> blind men and the elephant. I will _tell you_ the strengths and
> GC> weaknesses of DMTCP on the phone, instead of you having to guess
> GC> at them here on LKML. And of course, I hope you will be similarly
> GC> frank about Linux C/R on the phone.
>
> I want to be in on that discussion too, as do a lot of other people
> here. However, I doubt we'll all be able to find a common spot on our
> collective schedules, nor would that conversation be archived for
> posterity. I think sticking to LKML is the right (and time-tested)
> approach.
Amen.
> OL> Linux-cr can do live migration - e.g. VDI, move the desktop - in
> OL> which case skype's sockets' network stacks are reconstructed,
> OL> transparently to both skype (local apps) and the peer (remote
> OL> apps). Then, at the destination host and skype continues to work.
>
> GC> That's a really cool thing to do, and it's definitely not part of
> GC> what DMTCP does. It might be possible to do userland live
> GC> migration, but it's definitely not part of our current scope.
>
> How would you go about doing that in userland? With the current
> linux-cr implementation, I can move something like sshd or sendmail
> from one machine to another without a remote (connected) client
> noticing anything more than a bit of delay during the move.
>
> I think that saving and restoring the state of a TCP connection from
> userland is probably a good example of a case where it makes sense to
> have it as part of a C/R function, but not necessarily exposed in /sys
> or /proc somewhere. Unless it can be argued that doing so is not
> useful, I think that's a good talking point for discussing the kernel
> vs. user approach, no?
Meh, just implementing a conntrack module should be good enough for
most use cases. If it ever becomes a general enough problem (which I
extremely strongly doubt), we can think about allowing processes in a
netns to change sequence number but that would be a single setsockopt
option instead of the horror show of dumping in-kernel data structures
in binary blob.
Thanks.
--
tejun
Hello, Oren.
On 11/07/2010 10:59 PM, Oren Laadan wrote:
> We could work to add ABIs and APIs for each and every possible piece
> of state that affects userspace. And for each we'll argue forever
> about the design and some time later regret that it wasn't designed
> correctly :p
I'm sorry but in-kernel CR already looks like a major misdesign to me.
> Even if that happens (which is very unlikely and unnecessary),
> it will generate all the very same code in the kernel that Tejun
> has been complaining about, and _more_. And we will still suffer
> from issues such as lack of atomicity and being unable to do many
> simple and advanced optimizations.
It may be harder but those will be localized for specific features
which would be useful for other purposes too. With in-kernel CR,
you're adding a bunch of intrusive changes which can't be tested or
used apart from CR.
> Or we could use linux-cr for that: do the c/r in the kernel,
> keep the know-how in the kernel, expose (and commit to) a
> per-kernel-version ABI (not vow to keep countless new individual
> ABIs forever after getting them wrongly...), be able to do all
> sorts of useful optimization and provide atomicity and guarantees
> (see under "leak detection" in the OLS linux-cr paper). Also,
> once the c/r infrastructure is in the kernel, it will be easy
> (and encouraged) to support new =ly introduced features.
And the only reason it seems easier is because you're working around
the ABI problem by declaring that these binary blobs wouldn't be kept
compatible between different kernel versions and configurations. That
simply is the wrong approach. If you want to export something, build
it properly into ABI.
> Finally, then we would use dmtcp as well as other tools on top
> of the kernel-cr - and I'm looking forward to do that !
Yeah, this part I agree. The higher level workarounds implemented in
dmtcp are quite impressive and useful no matter what happens to lower
layer.
Thanks.
--
tejun
On 11/17/2010 11:45 AM, Tejun Heo wrote:
>> Since /proc/*/net provides a simpler design for sockets, we started
>> wondering what other simplifications may be possible. Here is one
>> possibility, in the case of shared file descriptors, DMTCP goes
>> through two barriers in order to decide which process will be
>> responsible for checkpointing which shared-file descriptor. It works
>> and the overhead is reasonable, but if you have additional
>> suggestion for this case, we would be very interested.
>
> I wrote in another mail but you can find out which fd's are shared by
> flipping O_NONBLOCK and looking at the flags field of
> /proc/*/fdinfo/*. Or are you talking about something else?
Ooh, one more thing, /proc/*/net/* has tx/rx queue counts. With
those, you wouldn't need the cookie based connection draining, right?
Thanks.
--
tejun
Quoting Tejun Heo ([email protected]):
> Hello, Oren.
>
> On 11/07/2010 10:59 PM, Oren Laadan wrote:
> > We could work to add ABIs and APIs for each and every possible piece
> > of state that affects userspace. And for each we'll argue forever
> > about the design and some time later regret that it wasn't designed
> > correctly :p
>
> I'm sorry but in-kernel CR already looks like a major misdesign to me.
By this do you mean the very idea of having CR support in the kernel?
Or our design of it in the kernel? Let's go back to July 2008, at the
containers mini-summit, where it was unanimously agreed upon that the
kernel was the right place (Checkpoint/Resetart [CR] under
http://wiki.openvz.org/Containers/Mini-summit_2008_notes ), and that
we would start by supporting a single task with no resources. Was that
whole discussion effectively misguided, in your opinion? Or do you
feel that since the first steps outlined in that discussion we've
either "gone too far" or strayed in the subsequent design?
-serge
Hello, Serge.
On 11/17/2010 04:39 PM, Serge E. Hallyn wrote:
>> I'm sorry but in-kernel CR already looks like a major misdesign to me.
>
> By this do you mean the very idea of having CR support in the kernel?
> Or our design of it in the kernel?
The former, I'm afraid.
> Let's go back to July 2008, at the containers mini-summit, where it
> was unanimously agreed upon that the kernel was the right place
> (Checkpoint/Resetart [CR] under
> http://wiki.openvz.org/Containers/Mini-summit_2008_notes ), and that
> we would start by supporting a single task with no resources. Was
> that whole discussion effectively misguided, in your opinion? Or do
> you feel that since the first steps outlined in that discussion
> we've either "gone too far" or strayed in the subsequent design?
The conclusion doesn't seem like such a good idea, well, at least to
me for what it's worth. Conclusions at summits don't carry decisive
weight. It'll still have to prove its worthiness for mainline all the
same and in light of already working userland alternative and the
expanded area now covered by virtualization, the arguments in this
thread don't seem too strong.
Thanks.
--
tejun
On Wed, Nov 17, 2010 at 12:57:40PM +0100, Tejun Heo wrote:
> Hello, Oren.
>
> On 11/07/2010 10:59 PM, Oren Laadan wrote:
<snip>
>
> > Even if that happens (which is very unlikely and unnecessary),
> > it will generate all the very same code in the kernel that Tejun
> > has been complaining about, and _more_. And we will still suffer
> > from issues such as lack of atomicity and being unable to do many
> > simple and advanced optimizations.
>
> It may be harder but those will be localized for specific features
> which would be useful for other purposes too. With in-kernel CR,
> you're adding a bunch of intrusive changes which can't be tested or
> used apart from CR.
You seem to be arguing "Z is only testable/useful for doing the things Z
was made for". I couldn't agree more with that. CR is useful for:
Fault-tolerance (typical HPC)
Load-balancing (less-typical HPC)
Debugging (simple [e.g. instead of coredumps] or complex
time-reversible)
Embedded devices that need to deal with persistent low-memory
situations.
I think Oren's Kernel Summit presentation succinctly summarized these:
http://www.cs.columbia.edu/~orenl/talks/ksummit-2010.pdf
My personal favorite idea (that hasn't been implemented yet) is an
application startup cache. I've been wondering if caching bash startup
after all the shared libraries have been searched, loaded, and linked
couldn't save a bunch of time spent in shell scripts. Post-link actually
seems like a checkpoint in application startup which would be generally
useful too. Of course you'd want to flush [portions of] the cache when
packages get upgraded/removed or shell PATHs change and the caches
would have to be per-user.
I'm less confident but still curious about caching after running rc
scripts (less confident because it would depend highly on the content
of the rc scripts). A scripted boot, for example, might be able to save
some time if the same rc scripts are run and they don't vary over time.
That in turn might be useful for carefully-tuned boots on embedded devices.
That said we don't currently have code for application caching. Yet we
can't be expected to write tools for every possible use of our API in
order to show just how true your tautology is.
>
> > Or we could use linux-cr for that: do the c/r in the kernel,
> > keep the know-how in the kernel, expose (and commit to) a
> > per-kernel-version ABI (not vow to keep countless new individual
Oren, that statement might be read to imply that it's based on
something as useless as kernel version numbers. Arnd has pointed out in the
past how unsuitable that is and I tend to agree. There are at least two
possible things we can relate it to: the SHA of the compiled kernel tree
(which doesn't quite work because it assumes everybody uses git trees :( ),
or perhaps the SHA/hash of the cpp-processed checkpoint_hdr.h. We could
also stuff that header into the kernel (much like kconfigs are output from
/proc) for programs that want the kernel to describe the ABI to them.
> > ABIs forever after getting them wrongly...), be able to do all
> > sorts of useful optimization and provide atomicity and guarantees
> > (see under "leak detection" in the OLS linux-cr paper). Also,
> > once the c/r infrastructure is in the kernel, it will be easy
> > (and encouraged) to support new =ly introduced features.
>
> And the only reason it seems easier is because you're working around
> the ABI problem by declaring that these binary blobs wouldn't be kept
> compatible between different kernel versions and configurations. That
Not true. First of all, if you look at checkpoint_hdr.h, the contents and
layout of the structs don't vary according to kernel configurations.
Secondly, we have taken measures to increase the likelihood that the
structures will remain compatible. We've designed them to layout the
same on 32-bit and 64-bit variants of an arch. We add to the end of the
structs. We use an explicit length field in a header to each section
to ensure that changes in the size of the structures don't necessarily
break compatibility.
That said, yes, these measures don't absolutely preclude incompatibility.
They will however make compatibility more likely.
Then there's the fact that structures like siginfo (for example) so rarely
change because they're already part of an ABI. That in turn means that the
corresponding checkpoint ABI rarely changes (we don't reuse the existing
struct because that would require compat-syscall-style code).
Most of the time, in fact, the fields we output are there only because
they reflect the 'model' of how things work that the kernel presents to
userspace. That model also rarely changes (we've never gotten rid of the
POSIX concept of process groups in one extreme example). Perhaps the
closest thing we have to wholly-kernel-internal data structures are the
signal/sighand structs which echo the way these fields are split from the
task struct and shared between tasks. Though I'd argue that gets back into
the 'model' presented to userspace (via fork/clone) anyway...
I'd estimate that the biggest 'model' changes have come via various
filesystem interfaces over the years. We don't checkpoint tasks with open
sysfs, /proc, or debugfs files (for example) so that's not part of our
ABI and we don't intend to make it so.
Nor do we output wholly kernel-internal structures and fields that are
often chosen for their performance benefits (e.g. rbtrees, linked lists,
hash tables, idrs, various caches, locks, RCU heads, refcounts, etc). So
the kernel is free to change implementations without affecting our ABI.
The compatibility has natural limits. For instance we can't ever
restart an x86_64 binary on a 32-bit kernel. If you add a new syscall
interface (e.g. fanotify) then you can't use a checkpoint of a task that
makes use of it on fanotify-disabled kernels. Yet these limitations exist
no matter where or how you choose to implement checkpoint/restart.
We've made almost every effort at making this a proper ABI (I say
'almost' because we still need to export a description of it at runtime
and we need to do something better in place of the logfd output). Still,
the essentials of a proper checkpoint/restart ABI are already there.
Cheers,
-Matt Helsley
On 11/17/2010 06:46 PM, Tejun Heo wrote:
> Hello, Serge.
>
> On 11/17/2010 04:39 PM, Serge E. Hallyn wrote:
>>> I'm sorry but in-kernel CR already looks like a major misdesign to me.
>>
>> By this do you mean the very idea of having CR support in the kernel?
>> Or our design of it in the kernel?
>
> The former, I'm afraid.
Can you elaborate on this please?
>> Let's go back to July 2008, at the containers mini-summit, where it
>> was unanimously agreed upon that the kernel was the right place
>> (Checkpoint/Resetart [CR] under
>> http://wiki.openvz.org/Containers/Mini-summit_2008_notes ), and that
>> we would start by supporting a single task with no resources. Was
>> that whole discussion effectively misguided, in your opinion? Or do
>> you feel that since the first steps outlined in that discussion
>> we've either "gone too far" or strayed in the subsequent design?
>
> The conclusion doesn't seem like such a good idea, well, at least to
> me for what it's worth. Conclusions at summits don't carry decisive
> weight. It'll still have to prove its worthiness for mainline all the
> same and in light of already working userland alternative and the
> expanded area now covered by virtualization, the arguments in this
> thread don't seem too strong.
>
> Thanks.
>
Hello, Pavel.
On 11/18/2010 10:13 AM, Pavel Emelyanov wrote:
>>> By this do you mean the very idea of having CR support in the kernel?
>>> Or our design of it in the kernel?
>>
>> The former, I'm afraid.
>
> Can you elaborate on this please?
I think I already did that several times in this thread but here's an
attempt at summary.
* It adds a bunch of pseudo ABI when most of the same information is
available via already established ABI.
* In a way which can only ever be used and tested by CR. If possible,
kernel should provide generic mechanisms which can be used to
implement features in userland. One of the reasons why we'd like to
export small basic building blocks instead of full end-to-end
solutions from the kernel is that we don't know how things will
change in the future. In-kernel CR puts too much in the kernel in a
way too inflexible manner.
* It essentially adds a separate complete set of entry/exit points for
a lot of things, which makes things more error prone and increases
maintenance overhead across the board.
* And, most of all, there are userland implementation and
virtualization, making the benefit to overhead ratio completely off.
Userland implementation _already_ achieves most of what's necessary
for the most important use case of HPC without any special help from
the kernel. The only reasonable thing to do is taking a good look
at it and finding ways to improve it.
Thanks.
--
tejun
Hello, Matt.
On 11/17/2010 11:17 PM, Matt Helsley wrote:
>> It may be harder but those will be localized for specific features
>> which would be useful for other purposes too. With in-kernel CR,
>> you're adding a bunch of intrusive changes which can't be tested or
>> used apart from CR.
>
> You seem to be arguing "Z is only testable/useful for doing the things Z
> was made for". I couldn't agree more with that. CR is useful for:
I'm saying it's way too narrow scoped and inflexible to be a kernel
feature. Kernel features should be like the basic tools, you know,
hammers, saws, drills and stuff. In-kernel CR is more like an over
complicated food processor which usually sits in the top drawer after
first several runs,
> Fault-tolerance (typical HPC)
> Load-balancing (less-typical HPC)
> Debugging (simple [e.g. instead of coredumps] or complex
> time-reversible)
> Embedded devices that need to deal with persistent low-memory
> situations.
which can do all of the above, a lot of which can be achieved in
less messy way than putting the whole thing inside the kernel.
> My personal favorite idea (that hasn't been implemented yet) is an
> application startup cache. I've been wondering if caching bash startup
> after all the shared libraries have been searched, loaded, and linked
> couldn't save a bunch of time spent in shell scripts. Post-link actually
> seems like a checkpoint in application startup which would be generally
> useful too. Of course you'd want to flush [portions of] the cache when
> packages get upgraded/removed or shell PATHs change and the caches
> would have to be per-user.
What does that have anything to do with the kernel? If you want
post-link cache, implement it in ld.so where it belongs. That's like
using food processor to mix cement.
> I'm less confident but still curious about caching after running rc
> scripts (less confident because it would depend highly on the content
> of the rc scripts). A scripted boot, for example, might be able to save
> some time if the same rc scripts are run and they don't vary over time.
> That in turn might be useful for carefully-tuned boots on embedded devices.
>
> That said we don't currently have code for application caching. Yet we
> can't be expected to write tools for every possible use of our API in
> order to show just how true your tautology is.
Continuing the same line of thought. It _CAN_ be used to do that in a
convoluted way but there are better ways to solve those problems.
> Most of the time, in fact, the fields we output are there only because
> they reflect the 'model' of how things work that the kernel presents to
> userspace. That model also rarely changes (we've never gotten rid of the
> POSIX concept of process groups in one extreme example). Perhaps the
> closest thing we have to wholly-kernel-internal data structures are the
> signal/sighand structs which echo the way these fields are split from the
> task struct and shared between tasks. Though I'd argue that gets back into
> the 'model' presented to userspace (via fork/clone) anyway...
Yeah, exactly, so just do it inside the established ABI extending
where it makes sense. No reason to add a whole separate set.
Thanks.
--
tejun
On 11/17/2010 10:46 AM, Tejun Heo wrote:
> Hello, Serge.
>
> On 11/17/2010 04:39 PM, Serge E. Hallyn wrote:
>>> I'm sorry but in-kernel CR already looks like a major misdesign to me.
>>
>> By this do you mean the very idea of having CR support in the kernel?
>> Or our design of it in the kernel?
>
> The former, I'm afraid.
>
>> Let's go back to July 2008, at the containers mini-summit, where it
>> was unanimously agreed upon that the kernel was the right place
>> (Checkpoint/Resetart [CR] under
>> http://wiki.openvz.org/Containers/Mini-summit_2008_notes ), and that
>> we would start by supporting a single task with no resources. Was
>> that whole discussion effectively misguided, in your opinion? Or do
>> you feel that since the first steps outlined in that discussion
>> we've either "gone too far" or strayed in the subsequent design?
>
> The conclusion doesn't seem like such a good idea, well, at least to
> me for what it's worth. Conclusions at summits don't carry decisive
> weight. It'll still have to prove its worthiness for mainline all the
> same and in light of already working userland alternative and the
> expanded area now covered by virtualization, the arguments in this
> thread don't seem too strong.
While it's your opinion that userland alternatives "already work",
in reality they are unsuitable for several real use-cases. The
userland approach has serious restrictions - which I will cover
in a follow-up post to my discussion with Gene soon.
Note that one important point of agreement was that DMTCP's ability
to provide "glue" to restart applications without their original
context is _orthogonal_ to how the core c/r is done. IOW - there
exciting goodies from DMTCP are useful with either form of c/r.
You also argue that "virtualization" (VMs?) covers everything else,
implying that lightweight virtualization is useless. In reality it
is an important technology, already in the kernel (surely you don't
suggest to pull it out ?!) and for a reason. That is already a very
good reason to provide, e.g. containers c/r and live-migration to
keep it competitive and useful.
Thanks,
Oren.
On Thu, 18 Nov 2010 10:48:34 +0100
Tejun Heo <[email protected]> wrote:
> Hello, Pavel.
>
> On 11/18/2010 10:13 AM, Pavel Emelyanov wrote:
> >>> By this do you mean the very idea of having CR support in the
> >>> kernel? Or our design of it in the kernel?
> >>
> >> The former, I'm afraid.
> >
> > Can you elaborate on this please?
>
> I think I already did that several times in this thread but here's an
> attempt at summary.
Yet the arguments seem to be vague enough not to be convincing to the
people working on the code.
> * It adds a bunch of pseudo ABI when most of the same information is
> available via already established ABI.
Can you elaborate on this? What established ABI are you proposing we
use here. Hopefully we can turn this into a more technical discussion.
> * In a way which can only ever be used and tested by CR. If possible,
So what if it can only be tested with CR as long as we can make CR work
on a variety of environments? Scalability changes for _really_ large
SMP boxes can only be reliably tested by people such equipment. We are
not imposing any such restriction and this code can be tested on very
wide range of setups.
> kernel should provide generic mechanisms which can be used to
> implement features in userland. One of the reasons why we'd like to
> export small basic building blocks instead of full end-to-end
> solutions from the kernel is that we don't know how things will
> change in the future. In-kernel CR puts too much in the kernel in a
> way too inflexible manner.
>
> * It essentially adds a separate complete set of entry/exit points for
> a lot of things, which makes things more error prone and increases
> maintenance overhead across the board.
I partially agree with you here. There will be maintenance overhead
every time you add code to the kernel that _may_ make changes in the
future more complicated. This true for _any_ code that is added to the
core kernel. Now in my experience such maintenance burden is most
disruptive when the code being added creates a lot of new state that
need to be tracked in multiple places unrelated to CR (in this case).
Our argument is that the CR code is not creating new state that will
cause painful future changes to the kernel. If you have specific
example that you are concerned with, great. Lets discuss those.
Are we promising zero maintenance cost? But guess what, neither do most
features that make into the kernel.
Now, if we change the argument around... What would be the maintenance
cost keeping this outside the kernel. I would argue that it is much
higher and would use SystemTap as the first example that come to mind.
> * And, most of all, there are userland implementation and
> virtualization, making the benefit to overhead ratio completely off.
Can we keep virtualization out of this. Every time someone mentions
virtualization as a solution, it makes me feel like these people just
don't understand the problem we are trying to solve. It is just not
practical to create a new VM for every application you want to CR.
These are two different tools to attack two different problems.
> Userland implementation _already_ achieves most of what's necessary
> for the most important use case of HPC without any special help from
What are these _most_ important cases of HPC that you are referring too?
Can we do a lot of these cases from userspace? Sure, but why are the
ones that can't be done from userspace any less important. If nobody
cared about those, we would not be having this conversation.
> the kernel. The only reasonable thing to do is taking a good look
> at it and finding ways to improve it.
The userspace vs in-kernel discussion has been done before as multiple
people have already said in this thread. Show me a version of userspace
CR that can correctly do all that an in-kernel implementation is capable
of.
> Thanks.
>
--
Jose R. Santos
On 11/17/2010 05:17 PM, Matt Helsley wrote:
> On Wed, Nov 17, 2010 at 12:57:40PM +0100, Tejun Heo wrote:
>> Hello, Oren.
>>
>> On 11/07/2010 10:59 PM, Oren Laadan wrote:
<snip>
>>> Or we could use linux-cr for that: do the c/r in the kernel,
>>> keep the know-how in the kernel, expose (and commit to) a
>>> per-kernel-version ABI (not vow to keep countless new individual
>
> Oren, that statement might be read to imply that it's based on
> something as useless as kernel version numbers. Arnd has pointed out in the
> past how unsuitable that is and I tend to agree. There are at least two
> possible things we can relate it to: the SHA of the compiled kernel tree
> (which doesn't quite work because it assumes everybody uses git trees :( ),
> or perhaps the SHA/hash of the cpp-processed checkpoint_hdr.h. We could
> also stuff that header into the kernel (much like kconfigs are output from
> /proc) for programs that want the kernel to describe the ABI to them.
BTW, it's the same for userspace c/r: for the same set of features,
the format (ABI) remains unchanged. Adding features breaks this and
a new version is necessary, and conversion from old to new will be
needed.
Moreover, supporting a new feature in userspace means adding the
proper API/ABI in the kernel, including refactoring etc, which is
even harder than adding the support for it in linux-cr.
Oren.
On 11/07/2010 09:32 PM, [email protected] wrote:
> On Sun, 7 Nov 2010, Davide Libenzi wrote:
>
>> Please, do not compare things like single file systems, drivers, or
>> otherwise fairly isolated components, with this "thing".
>> This thing touches a freaky-large number of subsystems, effectively
>> adding a glueage between them, which can might end up causing problems
>> (and/or restrict design choices) in the future.
>
> I've got a question about the ABI that would be created
>
> I see two possible areas that could be considered an ABI
>
> 1. control of the C/R process
>
> This is very clearly a userspace ABI, to be figured out and locked
> down like any other ABI
>
> 2. the details of how things are stored and added back into a system
>
> This is not as clear. at one extreme, this could be like the module
> interface, (the checkpointed image is only guaranteed to work on a new
> system with a kernel compiled with the same config options as the system
> it was checkpointed from). At the other extreme, this could be something
> that allows you to ckeckpoint an image on 2.6.40 and restore it on
> 2.6.80. Or it could be something in between.
>
> I don't see any way that it is sane to make the C/R image defiition and
> interface (#2) be an ABI that is guaranteed to never change without
> hurting future kernel development (exactly the type of things that
> Davide is worried about above), but what sort of guarantee are people
> interested in?
Agreed. The guarantee should be to specific kernels, in a sense (see
Matt's post in this thread 11/17).
The image format is tied to "set of features supported" (which boils
down to something like kernel version). The format is constructed
in a modular way such that most new features can be added without
breaking old format. For the rare cases that they do, conversion
can be done in userspace in a straightforward manner. (All you need
is convert from N to N+1).
>
> is it enough to sa that it must be the same kernel version compiled with
> the same options? (or at least the same options for some list of things
> that matter, most device drivers probably would not matter for example)
>
> or would you need compatibility across all compile options for a kernel
> release?
>
> would you require compatibility between 2.6.x.y and 2.6.x.z?
>
> would you require compatibility between 2.6.x and 2.6.x+n (for some
> value of n)?
>
> is this something that could go in with the weakest guarantee initially,
> and then as everyone is more comfortable with it, start extending the
> guarantee (and as-needed adding code to the kernel to maintain
> compatibility with old images)?
>
> would you require compatibility between 2.6.x and 2.6.x-n?
We don't "require" compatibility. The compatibility is defined per
object (type) in the image format. New objects need not break
compatibility. Changes to objects are very rare; and when they happen
they "bump" the version. This can help avoid issues related to kernel
configs/options. Restarting an image incompatible with a particular
kernel will fail, adjustments should be done by userspace filtering.
Thanks,
Oren.
Quoting Tejun Heo ([email protected]):
> * And, most of all, there are userland implementation and
> virtualization, making the benefit to overhead ratio completely off.
> Userland implementation _already_ achieves most of what's necessary
Guess I'll just be offensive here and say, straight-out: I don't
believe it. Can I see the userspace implementation of c/r?
If it's as good as the kernel level c/r, then aweseome - we don't
need the kernel patches.
If it's not as good, then the thing is, we're not drawing arbitrary
lines saying "is this good enough", rather we want completely
reliable and transparent c/r. IOW, the running task and the other
end can't tell that a migration happened, and, if checkpoint says
it worked, then restart must succeed.
-serge
Quoting Tejun Heo ([email protected]):
> Hello, Serge.
Hey Tejun :)
> On 11/17/2010 04:39 PM, Serge E. Hallyn wrote:
> >> I'm sorry but in-kernel CR already looks like a major misdesign to me.
> >
> > By this do you mean the very idea of having CR support in the kernel?
> > Or our design of it in the kernel?
>
> The former, I'm afraid.
>
> > Let's go back to July 2008, at the containers mini-summit, where it
> > was unanimously agreed upon that the kernel was the right place
> > (Checkpoint/Resetart [CR] under
> > http://wiki.openvz.org/Containers/Mini-summit_2008_notes ), and that
> > we would start by supporting a single task with no resources. Was
> > that whole discussion effectively misguided, in your opinion? Or do
> > you feel that since the first steps outlined in that discussion
> > we've either "gone too far" or strayed in the subsequent design?
>
> The conclusion doesn't seem like such a good idea, well, at least to
> me for what it's worth. Conclusions at summits don't carry decisive
> weight.
Of course. It allows us to present at kernel summit and look for early
rejections to save us all some time (which we did, at the container
mini-summit readout at ksummit 2008), but it would be silly to read
anything more into it than that.
> It'll still have to prove its worthiness for mainline all the
> same
100% agreed.
> and in light of already working userland alternative and the
Here's where we disagree. If you are right about a viable userland
alternative ('already working' isn't even a preqeq in my opinion,
so long as it is really viable), then I'm with you, but I'm not buying
it at this point.
Seriously. Truly. Honestly. I am *not* looking for any extra kernel
work at this moment, if we can help it in any way.
> expanded area now covered by virtualization, the arguments in this
> thread don't seem too strong.
-serge
On 11/19/2010 05:10 AM, Serge Hallyn wrote:
> Hey Tejun :)
Hey, :-)
>> and in light of already working userland alternative and the
>
> Here's where we disagree. If you are right about a viable userland
> alternative ('already working' isn't even a preqeq in my opinion,
> so long as it is really viable), then I'm with you, but I'm not buying
> it at this point.
>
> Seriously. Truly. Honestly. I am *not* looking for any extra kernel
> work at this moment, if we can help it in any way.
What's so wrong with Gene's work? Sure, it has some hacky aspects but
let's fix those up. To me, it sure looks like much saner and
manageable approach than in-kernel CR. We can add nested ptrace,
CLONE_SET_PID (or whatever) in pidns, integrate it with various ns
supports, add an ability to adjust brk, export inotify state via
fdinfo and so on.
The thing is already working, the codebase of core part is fairly
small and condor is contemplating integrating it, so at least some
people in HPC segment think it's already viable. Maybe the HPC
cluster I'm currently sitting near is special case but people here
really don't run very fancy stuff. In most cases, they're fairly
simple (from system POV) C programs reading/writing data and burning a
_LOT_ of CPU cycles inbetween and admins here seem to think dmtcp
integrated with condor would work well enough for them.
Sure, in-kernel CR has better or more reliable coverage now but by how
much? The basic things are already there in userland. The tradeoff
simply doesn't make any sense. If it were a well separated self
sustained feature, it probably would be able to get in, but it's all
over the place and requires a completely new concept - the
quasi-ABI'ish binary blob which would probably be portable across
different kernel versions with some massaging. I personally think the
idea is fundamentally flawed (just go through the usual ABI!) but even
if it were not it would require _MUCH_ stronger rationale than it
currently has to be even considered for mainline inclusion.
Maybe it's just me but most of the arguments for in-kernel CR look
very weak. They're either about remote toy use cases or along the
line that userland CR currently doesn't do everything kernel CR does
(yet). Even if it weren't for me, I frankly can't see how it would be
included in mainline.
I think it would be best for everyone to improve userland CR. A lot
of knowdledge and experience gained through kernel CR would be
applicable and won't go wasted. Strong resistance against direction
change certainly is understandable but IMHO pushing the current
direction would only increase loss. I of course could be completely
wrong and might end up getting mails filled up with megabytes of "told
you so" later, but, well, at this point, in-kernel CR already looks
half dead to me.
Thank you.
--
tejun
Tejun,
Sorry for getting into the middle of the discussion, but...
Can you imagine how many userland APIs are needed to make userspace C/R?
Do you really want APIs in user-space which allow to:
- send signals with siginfo attached (kill() doesn't work...)
- read inotify configuration
- insert SKB's into socket buffers
- setup all TCP/IP parameters for sockets
- wait for AIO pending in other processes
- setting different statistics counters (like netdev stats etc.)
and so on...
For every small piece of functionality you will need to export ABI and maintain it forever.
It's thousands of APIs! And why the hell they are needed in user space at all?
BTW, HPC case you are talking about is probably the simplest one. Last time I looked into it, IBM Meiosis c/r
didn't even bother with tty's migration. In OpenVZ we really do need much more then that like
autofs/NFS support, preserve statistics, TTYs, etc. etc. etc.
Thanks,
Kirill
On Nov 19, 2010, at 17:04 , Tejun Heo wrote:
> On 11/19/2010 05:10 AM, Serge Hallyn wrote:
>> Hey Tejun :)
>
> Hey, :-)
>
>>> and in light of already working userland alternative and the
>>
>> Here's where we disagree. If you are right about a viable userland
>> alternative ('already working' isn't even a preqeq in my opinion,
>> so long as it is really viable), then I'm with you, but I'm not buying
>> it at this point.
>>
>> Seriously. Truly. Honestly. I am *not* looking for any extra kernel
>> work at this moment, if we can help it in any way.
>
> What's so wrong with Gene's work? Sure, it has some hacky aspects but
> let's fix those up. To me, it sure looks like much saner and
> manageable approach than in-kernel CR. We can add nested ptrace,
> CLONE_SET_PID (or whatever) in pidns, integrate it with various ns
> supports, add an ability to adjust brk, export inotify state via
> fdinfo and so on.
>
> The thing is already working, the codebase of core part is fairly
> small and condor is contemplating integrating it, so at least some
> people in HPC segment think it's already viable. Maybe the HPC
> cluster I'm currently sitting near is special case but people here
> really don't run very fancy stuff. In most cases, they're fairly
> simple (from system POV) C programs reading/writing data and burning a
> _LOT_ of CPU cycles inbetween and admins here seem to think dmtcp
> integrated with condor would work well enough for them.
>
> Sure, in-kernel CR has better or more reliable coverage now but by how
> much? The basic things are already there in userland. The tradeoff
> simply doesn't make any sense. If it were a well separated self
> sustained feature, it probably would be able to get in, but it's all
> over the place and requires a completely new concept - the
> quasi-ABI'ish binary blob which would probably be portable across
> different kernel versions with some massaging. I personally think the
> idea is fundamentally flawed (just go through the usual ABI!) but even
> if it were not it would require _MUCH_ stronger rationale than it
> currently has to be even considered for mainline inclusion.
>
> Maybe it's just me but most of the arguments for in-kernel CR look
> very weak. They're either about remote toy use cases or along the
> line that userland CR currently doesn't do everything kernel CR does
> (yet). Even if it weren't for me, I frankly can't see how it would be
> included in mainline.
>
> I think it would be best for everyone to improve userland CR. A lot
> of knowdledge and experience gained through kernel CR would be
> applicable and won't go wasted. Strong resistance against direction
> change certainly is understandable but IMHO pushing the current
> direction would only increase loss. I of course could be completely
> wrong and might end up getting mails filled up with megabytes of "told
> you so" later, but, well, at this point, in-kernel CR already looks
> half dead to me.
>
> Thank you.
>
> --
> tejun
> _______________________________________________
> Containers mailing list
> [email protected]
> https://lists.linux-foundation.org/mailman/listinfo/containers
Hello,
On 11/19/2010 03:36 PM, Kirill Korotaev wrote:
> Can you imagine how many userland APIs are needed to make userspace C/R?
>
> Do you really want APIs in user-space which allow to:
> - send signals with siginfo attached (kill() doesn't work...)
Doesn't rt_sigqueueinfo() already do this?
> - read inotify configuration
This would be nice even apart from CR.
> - insert SKB's into socket buffers
Can't we drain kernel buffers? ie. Stop further writing and wait the
send-q to drop to zero.
> - setup all TCP/IP parameters for sockets
I _think_ most can be restored by talking to netfilter module.
Setting outgoing sequence number might be beneficial tho.
> - wait for AIO pending in other processes
I haven't looked at aio implementation for a while now but can't we
drain these upon checkpointing and just carry the completion status?
Also, if aio is what you're concerned about, I would say the problem
is mostly solved.
> - setting different statistics counters (like netdev stats etc.)
> and so on...
Why would this matter?
> For every small piece of functionality you will need to export ABI
> and maintain it forever. It's thousands of APIs! And why the hell
> they are needed in user space at all?
I think it's actually quite the contrary. Most things are already
visible to userland. They _have_ to be and that's the reason why
userland implementation can already get most things working without
any change to the kernel with some amount of hackery. To me in-kernel
CR seems to approach the problem from the exactly wrong direction -
rather than dealing with specific exceptions, it create a completely
new framework which is very foreign and not useful outside of CR.
Also, think about it. Which one is better? A kernel which can fully
show its ABI visible states to userland or one which dumps its
internal data structurs in binary blobs. To me, the latter seems
multiple orders of magnitude uglier.
> BTW, HPC case you are talking about is probably the simplest
> one.
Yet, it is one of the the most important / relevant use cases.
> Last time I looked into it, IBM Meiosis c/r didn't even bother with
> tty's migration. In OpenVZ we really do need much more then that
> like autofs/NFS support, preserve statistics, TTYs, etc. etc. etc.
Would it be impossible to preserve autofs/NFS and TTYs from userland?
Then, why so? For statistics, I'm a bit lost. Why does it matter and
even if it does would it justify putting the whole CR inside kernel?
Thank you.
--
tejun
On Fri, Nov 19, 2010 at 5:33 PM, Tejun Heo <[email protected]> wrote:
>> - insert SKB's into socket buffers
>
> Can't we drain kernel buffers? ?ie. Stop further writing and wait the
> send-q to drop to zero.
On send:
if network dies right after freeze, you lose.
On receive:
packets arrive after process freeze, but before network device freeze.
>> - setting different statistics counters (like netdev stats etc.)
>> and so on...
>
> Why would this matter?
Because you'll introduce million stupid interfaces not interesting to
anyone but C/R.
On Fri, Nov 19, 2010 at 6:00 PM, Alexey Dobriyan <[email protected]> wrote:
>>> - setting different statistics counters (like netdev stats etc.)
>>> and so on...
>>
>> Why would this matter?
>
> Because you'll introduce million stupid interfaces not interesting to
> anyone but C/R.
Just like CLONE_SET_PID.
Hello,
On 11/19/2010 05:00 PM, Alexey Dobriyan wrote:
> On Fri, Nov 19, 2010 at 5:33 PM, Tejun Heo <[email protected]> wrote:
>>> - insert SKB's into socket buffers
>>
>> Can't we drain kernel buffers? ie. Stop further writing and wait the
>> send-q to drop to zero.
>
> On send:
> if network dies right after freeze, you lose.
Gosh, if you're really worried about that, put a netfilter module
which would buffer and simulate acks to extract the packets before
initiating freeze. These are fringe problems. Use fringe solutions.
> On receive:
> packets arrive after process freeze, but before network device freeze.
Just store the data somewhere. The checkpointer can drain the socket,
right?
>>> - setting different statistics counters (like netdev stats etc.)
>>> and so on...
>>
>> Why would this matter?
>
> Because you'll introduce million stupid interfaces not interesting to
> anyone but C/R.
In this thread, how many have you guys come up with? Not even a dozen
and most can be sovled almost trivially. Seriously, what the hell..
Thanks.
--
tejun
On 11/19/2010 05:01 PM, Alexey Dobriyan wrote:
> On Fri, Nov 19, 2010 at 6:00 PM, Alexey Dobriyan <[email protected]> wrote:
>>>> - setting different statistics counters (like netdev stats etc.)
>>>> and so on...
>>>
>>> Why would this matter?
>>
>> Because you'll introduce million stupid interfaces not interesting to
>> anyone but C/R.
>
> Just like CLONE_SET_PID.
Well, if you ask me, having pidns w/o a way to reinstate PID from
userland is pretty silly and you and I might not know yet but it's
quite imaginable that there will be other use cases for the capability
unlike in-kernel CR. Kernel provides building blocks not the whole
frigging package and for very good reasons.
--
tejun
On Fri, Nov 19, 2010 at 6:06 PM, Tejun Heo <[email protected]> wrote:
>>>> - setting different statistics counters (like netdev stats etc.)
>>>> and so on...
>>>
>>> Why would this matter?
>>
>> Because you'll introduce million stupid interfaces not interesting to
>> anyone but C/R.
>
> In this thread, how many have you guys come up with? ?Not even a dozen
> and most can be sovled almost trivially. ?Seriously, what the hell..
I do not count them.
The paragon of absurdity is struct task_struct::did_exec .
On 11/19/2010 05:16 PM, Alexey Dobriyan wrote:
> On Fri, Nov 19, 2010 at 6:06 PM, Tejun Heo <[email protected]> wrote:
>>>>> - setting different statistics counters (like netdev stats etc.)
>>>>> and so on...
>>>>
>>>> Why would this matter?
>>>
>>> Because you'll introduce million stupid interfaces not interesting to
>>> anyone but C/R.
>>
>> In this thread, how many have you guys come up with? Not even a dozen
>> and most can be sovled almost trivially. Seriously, what the hell..
>
> I do not count them.
>
> The paragon of absurdity is struct task_struct::did_exec .
Yeah, then go and figure how to do that in a way which would be useful
for other purposes too instead of trying to shove the whole
checkpointer inside the kernel. It sure would be harder but hey
that's the way it is.
--
tejun
On Fri, Nov 19, 2010 at 6:10 PM, Tejun Heo <[email protected]> wrote:
> Well, if you ask me, having pidns w/o a way to reinstate PID from
> userland is pretty silly
No.
Chrome uses CLONE_PID so that exploit couldn't attach to processes in
parent pidns.
> and you and I might not know yet but it's
> quite imaginable that there will be other use cases for the capability
> unlike in-kernel CR. ?Kernel provides building blocks not the whole
> frigging package and for very good reasons.
Speaking of pids, pid's value itself is never interesing (except maybe pid 1).
It's a cookie.
CLONE_SET_PID came up only now because only C/R wants it.
On Fri, Nov 19, 2010 at 6:19 PM, Tejun Heo <[email protected]> wrote:
>> The paragon of absurdity is struct task_struct::did_exec .
>
> Yeah, then go and figure how to do that in a way which would be useful
> for other purposes too instead of trying to shove the whole
> checkpointer inside the kernel. ?It sure would be harder but hey
> that's the way it is.
System call for one bit? This is ridiculous.
Doing execve(2) for userspace C/R is ridicoulous too (and likely doesn't work).
On 11/19/2010 05:27 PM, Alexey Dobriyan wrote:
> On Fri, Nov 19, 2010 at 6:19 PM, Tejun Heo <[email protected]> wrote:
>>> The paragon of absurdity is struct task_struct::did_exec .
>>
>> Yeah, then go and figure how to do that in a way which would be useful
>> for other purposes too instead of trying to shove the whole
>> checkpointer inside the kernel. It sure would be harder but hey
>> that's the way it is.
>
> System call for one bit? This is ridiculous.
Why not just a flag in proc entry? It's a frigging single bit.
> Doing execve(2) for userspace C/R is ridicoulous too (and likely
> doesn't work).
Really, whatever. Just keep doing what you're doing. Hey, if it
makes you happy, it can't be too wrong.
--
tejun
On Fri, Nov 19, 2010 at 6:32 PM, Tejun Heo <[email protected]> wrote:
> On 11/19/2010 05:27 PM, Alexey Dobriyan wrote:
>> On Fri, Nov 19, 2010 at 6:19 PM, Tejun Heo <[email protected]> wrote:
>>>> The paragon of absurdity is struct task_struct::did_exec .
>>>
>>> Yeah, then go and figure how to do that in a way which would be useful
>>> for other purposes too instead of trying to shove the whole
>>> checkpointer inside the kernel. ?It sure would be harder but hey
>>> that's the way it is.
>>
>> System call for one bit? This is ridiculous.
>
> Why not just a flag in proc entry? ?It's a frigging single bit.
Because /proc/*/did_exec useless to anyone but C/R (even for reading!).
Because code is much simpler:
tsk->did_exec = !!tsk_img->did_exec;
+
__u8 did_exec;
On 11/19/2010 05:38 PM, Alexey Dobriyan wrote:
> On Fri, Nov 19, 2010 at 6:32 PM, Tejun Heo <[email protected]> wrote:
>> On 11/19/2010 05:27 PM, Alexey Dobriyan wrote:
>>> On Fri, Nov 19, 2010 at 6:19 PM, Tejun Heo <[email protected]> wrote:
>>>>> The paragon of absurdity is struct task_struct::did_exec .
>>>>
>>>> Yeah, then go and figure how to do that in a way which would be useful
>>>> for other purposes too instead of trying to shove the whole
>>>> checkpointer inside the kernel. It sure would be harder but hey
>>>> that's the way it is.
>>>
>>> System call for one bit? This is ridiculous.
>>
>> Why not just a flag in proc entry? It's a frigging single bit.
>
> Because /proc/*/did_exec useless to anyone but C/R (even for reading!).
I don't think you'll need a full file. Just shove it in status or
somewhere. Your argument is completely absurd. So, because exporting
single bit is so horrible to everyone else, you want to shove the
whole frigging checkpointer inside the kernel?
> Because code is much simpler:
>
> tsk->did_exec = !!tsk_img->did_exec;
> +
> __u8 did_exec;
Sigh, yeah, except for the horror show to create tsk_img. Your
"paragon of absurdity" is did_exec which is only ever used to decide
whether setpgid() should fail with -EACCES, seriously? Here's a
thought. Ignore it for now and concentrate on more relevant problems.
I'm fairly sure CR'd program malfunctioning over did_exec wouldn't
mark the beginning of the end of our civilization. You gotta be
kidding me.
--
tejun
On Fri, Nov 19, 2010 at 6:50 PM, Tejun Heo <[email protected]> wrote:
> On 11/19/2010 05:38 PM, Alexey Dobriyan wrote:
>> On Fri, Nov 19, 2010 at 6:32 PM, Tejun Heo <[email protected]> wrote:
>>> On 11/19/2010 05:27 PM, Alexey Dobriyan wrote:
>>>> On Fri, Nov 19, 2010 at 6:19 PM, Tejun Heo <[email protected]> wrote:
>>>>>> The paragon of absurdity is struct task_struct::did_exec .
>>>>>
>>>>> Yeah, then go and figure how to do that in a way which would be useful
>>>>> for other purposes too instead of trying to shove the whole
>>>>> checkpointer inside the kernel. ?It sure would be harder but hey
>>>>> that's the way it is.
>>>>
>>>> System call for one bit? This is ridiculous.
>>>
>>> Why not just a flag in proc entry? ?It's a frigging single bit.
>>
>> Because /proc/*/did_exec useless to anyone but C/R (even for reading!).
>
> I don't think you'll need a full file. ?Just shove it in status or
> somewhere. ?Your argument is completely absurd. ?So, because exporting
> single bit is so horrible to everyone else, you want to shove the
> whole frigging checkpointer inside the kernel?
>
>> Because code is much simpler:
>>
>> ? ? tsk->did_exec = !!tsk_img->did_exec;
>> +
>> ? ? __u8 did_exec;
>
> Sigh, yeah, except for the horror show to create tsk_img.
task_struct image work is common for both userspace C/R and in-kernel.
You _have_ to define it.
Simpler code is only first line.
> Your "paragon of absurdity" is did_exec which is only ever used
> to decide whether setpgid() should fail with -EACCES, seriously?
> Here's a thought. ?Ignore it for now and concentrate on more
> relevant problems.
You're so newjerseyly now.
On Fri, 19 Nov 2010, Tejun Heo wrote:
> Hello,
>
> On 11/19/2010 03:36 PM, Kirill Korotaev wrote:
> > Can you imagine how many userland APIs are needed to make userspace C/R?
> >
> > Do you really want APIs in user-space which allow to:
> > - send signals with siginfo attached (kill() doesn't work...)
>
> Doesn't rt_sigqueueinfo() already do this?
>
You assume that c/r is done by the checkpointed processes _themselves_,
that is that to checkpoint a process that process need to be made runnable
and it will save its own state (which is the model of dmtcp, but not of
using ptrace).
This model is restrictive: it requires that you hijack the execution of
that process somehow and make it run. What if the process isn't runnable
(e.g. in vfork waiting for completion, or ptraced deep in the kernel) ?
letting it run even just a bit may modify its state. It also means that
if you have many processes in the checkpointed session, e.g. 1000, then
_all_ of them will have to be scheduled to run !
With kernel c/r this is unnecessary: you can use an auxiliary process
to checkpoint other processes without scheduling the other processes.
I.e. it's _transparent_ and _preemptive_.
Another advantage is that if anything fails during checkpoint (for
whatever reason), there are no side-effects (which is not the case with
the other method).
> > For every small piece of functionality you will need to export ABI
> > and maintain it forever. It's thousands of APIs! And why the hell
> > they are needed in user space at all?
>
> I think it's actually quite the contrary. Most things are already
> visible to userland. They _have_ to be and that's the reason why
> userland implementation can already get most things working without
> any change to the kernel with some amount of hackery. To me in-kernel
> CR seems to approach the problem from the exactly wrong direction -
> rather than dealing with specific exceptions, it create a completely
> new framework which is very foreign and not useful outside of CR.
>
> Also, think about it. Which one is better? A kernel which can fully
> show its ABI visible states to userland or one which dumps its
> internal data structurs in binary blobs. To me, the latter seems
> multiple orders of magnitude uglier.
Are we jusding aesteics ? To me the former looks uglier...
The amount of fragile hacks you need to go through to make it work
in userspace for the generic cases (including userspace trickery
and new crazy APIs from the kernel for state that was never even an
ABI, like skb's), and the restrictions it posses simply suggest that
userspace is not the right place to do it.
Thanks,
Oren.
Hi,
Based on discussion with Gene, I'd like to clarify key points and
difference between kernel and userspace approaches (specifically
linux-cr and dmtcp): three parts to break the long post...
part I: perpsectice about the types of scopes of c/r in discussion
part II: linux-cr design adn objectives
part III: comparison kernel/userspace approaches
[now relax, grab (another) cup of coffee and read on...]
PART I: ==PERSPECTIVE==
A rough classification of c/r categories:
* container-c/r: important use-case, e.g. c/r and migration of an
application containers like VPS (virtual private server), VDI
(desktop) or other self-contained application (e.g. Oracle server).
Here _all_ the relevant processes are included in the checkpoint.
* standalone-c/r: another use-case is standalone-c/r where a set of
processes is checkpointed, but not the entire environment, and then
those processes are restarted in a different "eco-system".
* distributed-c/r: meaning several sets of processes, each running
on a different host. (Each set may be a separate container there).
In container-c/r, the main challenge is to be _reliable_ in the sense
that a restart from a successful checkpoint should always succeed.
In standalone-c/r, the main challenge is that an application resumes
execution after a restart in a possible _different_ eco-system. Some
application don't care (e.g 'bc'). Other applications do care, and to
different degrees; for these we need "glue" to pacify the application.
There are generally three types of "glue":
(1) Modify the application or selected libraries to be c/r-aware, and
notify it when restart completes. (e.g. CoCheck MPI library).
(2) Add a userspace helper that will run post-restart to do necessary
trickery (eg. send a SIGWINCH to 'screen'; mount proper filesystem
at the new host after migration; reconnect a socket to a peer).
(3) Use interposition on selected library calls and add wrapper code
that will glue in what's missing (e.g. dbus or nscd calls to
reconnect an application to those services).
IMPORTANT: the glueing method is _orthogonal_ to how the c/r is done !
We are strictly discussion the core c/r functionality.
(next part: linux-cr philosophy...)
Thanks,
Oren.
login as: orenl
Using keyboard-interactive authentication.
Password:
Access denied
Using keyboard-interactive authentication.
Password:
Last login: Fri Nov 19 10:17:21 2010 from 192.117.42.81.static.012.net.il
499:takamine[~]$ pine
PINE 4.64 COMPOSE MESSAGE
Folder: Drafts 8 Messages +
To : Tejun Heo <[email protected]>
Cc : Serge Hallyn <[email protected]>,
Kapil Arya <[email protected]>,
Gene Cooperman <[email protected]>,
[email protected],
[email protected],
"Eric W. Biederman" <[email protected]>,
Linux Containers <[email protected]>
Attchmnt:
Subject : Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch
----- Message Text -----
Hi,
[continuation of posting regarding kernel vs userspace approach]
part I: perpsectice about the types of scopes of c/r in discussion
part II: linux-cr design adn objectives
part III: comparison kernel/userspace approaches
PART II: ==PHILOSOPHY==
Linux-cr is a _generic_ c/r-engine with multiple capabilities. It can
checkpoint a full container, a process hierarchy, or a single process,
For containers, it provides guarantees like restart-ability; For the
others, it provides the flexibility so that c/r-aware applications,
libraries, helpers, and wrappers can glue what they wish to glue.
1) Transparent - completely transparent for container-c/r, and largely
so for standalone-cr ("largely" - as in except for the glue which is
needed due to loss of eco-system, not due to restarting).
2) Reliable - if checkpoint succeeds that it is guaranteed for
to succeed too (for container-c/r).
3) Preemtptive - works without requiring that checkpointed processes
be scheduled to run (and thus "collaborate")
4) Complete - covers all visible and hidden state in the kernel
about processes (even if not directly visible to userspace)
5) Efficient - can be optimized along multiple axes: _zero_ impact on
runtime, low downtime during checkpoint, partial and incremental
checkpoint, live-migration, etc.
6) Flexible - can integrate nicely with different userspace "glueing"
methods.
7) Maintainable - small part of the code is to refactor kernel code
so that it can be reused in restart; the rest is new code that in
our experience rarely changes. Same hods for the image format.
What linux-cr _does not_ do in the kernel, nor plans to support is:
1) Hardware devices: their state is per-device/vendor. Instead one
should use virtual devices (VNC for dislpay, pulseaudio for sound,
screen for ttys), or have a userspace glue to restore the state of
the device. That said, in the future vendors may opt to provide
logic for c/r in drivers, e.g. ->checkpoint, ->restart methods.
2) Userspace glue: (as defined for standalone-c/r above) the kernel
knows about processes and their state, not about their intentions.
We leave that for userspace.
3) External dependencies: (outside of the local host) the kernel does
not control what's outside the host. That is the responsibility of
userspace. (Even with live-migration, the linux-cr only restores
the local state of the TCP connections).
Oren.
login as: orenl
Using keyboard-interactive authentication.
Password:
Access denied
Using keyboard-interactive authentication.
Password:
Last login: Fri Nov 19 10:17:21 2010 from 192.117.42.81.static.012.net.il
499:takamine[~]$ pine
PINE 4.64 COMPOSE MESSAGE
Folder: Drafts 8 Messages +
To : Tejun Heo <[email protected]>
Cc : Serge Hallyn <[email protected]>,
Kapil Arya <[email protected]>,
Gene Cooperman <[email protected]>,
[email protected],
[email protected],
"Eric W. Biederman" <[email protected]>,
Linux Containers <[email protected]>
Fcc : imap://[email protected]/Sent
Attchmnt:
Subject : Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch
----- Message Text -----
Hi,
[continuation of discussion of kernel vs userspace c/r approach]
part I: perpsectice about the types of scopes of c/r in discussion
part II: linux-cr design adn objectives
part III: comparison kernel/userspace approaches
PART III: ==SOME TECHNICAL ASPECTS==
Important to know about userspace (DMTCP example) before presenting a
comparison between kernel and userspace approaches:
DMTCP has two components: 1) c/r-engine to save/restore process state,
and 2) glue to restart processes out of their original context. They
are _orthogonal_: the glue can be used with of other c/r-engines, like
linux-cr. This discussion refers to the c/r-engine _only_.
Focusing on the c/r-engine of DMTCP - it uses syscall interposition
for three reasons:
1) To take control of processes at checkpoint
2) To always track state of resources not visible to userspace
3) To virtualize identifiers after restart
#1 is needed because processes saves their own state (and need to run
the checkpoint code for that).
#2 is needed because the kernel does not expose all state, and #3 is
needed because the kernel does not give ways to restore all state. So
these two logics are used to mirror in userspace functionality that
already exists in the kernel.
The main advantages of the approach: (a) portability to other system
(like BSD), though with considerable effort (b) it's "good enough" for
several use-cases, without kernel changes.
Putting the c/r-engine in the kernel provides many advantages, which I
summarize in the following table:
category linux-cr userspace
--------------------------------------------------------------------------------
PERFORMANCE has _zero_ runtime overhead visible overhead due to syscalls
interposition and state tracking
even w/o checkpoints;
OPTIMIZATIONS many optimizations possible limited, less effective
only in kernel, for downtime, w/ much larger overhead.
image size, live-migration
OPERATION applications run unmodified to do c/r, needs 'controller'
task (launch and manage _entire_
execution) - point of failure.
restricts how a system is used.
PREEMPTIVE checkpoint at any time, use processes must be runnable and
auxiliary task to save state; "collaborate" for checkpoint;
non-intrusive: failure does long task coordination time
not impact checkpointees. with many tasks/threads. alters
state of checkpointee if fails.
e.g. cannot checkpoint when in
vfork(), ptrace states, etc.
COVERAGE save/restore _all_ task state; needs new ABI for everything:
identify shared resources; can expose state, provide means to
extend for new kernel features restore state (e.g. TCP protocol
easily options negotiated with peers)
RELIABILITY checkpoint w/ single syscall; non-atomic, cannot find leaks
atomic operation. guaranteed to determine restartability
restartability for containers
USERSPACE GLUE possible possible
SECURITY root and non-root modes root and non-root modes
native support for LSM
MAINTENANCE changes mainly for features changes mainly for features;
create new ABI for features
I'm not saying Gene's work isn't good - on the contrary, it's a fine
piece of engineering. However, the part of it that does c/r poses many
constraints that limits the generality, mode of use, and performance of
the whole. That may be enough for Tejun, for your cluster. But not
for other users of the technology.
And by all means, I intend to cooperate with Gene to see how to
make the other part of DMTCP, namely the userspace "glue", work on
top of linux-cr to have the benefits of all worlds !
All in all, kernel c/r is far more generic and less restrictive than
userspace, can provide nice guarantees, and has superior performance.
It can do everything the a userspace c/r can do, and much more - and
that "much more" is crucial for important use cases.
Last word about maintenance - once the core code is in mainline (which
means a code "spike"), experience (both kernel/userspace) shows that
both code and image format hardly change. The format is tied to specific
set of features supported (i.e. kernel versions) so that the kernel
does not need to maintain backward compatibility.
Thanks,
Oren
[[apologies for the silly prefix on last two posts - a combination
of windows, putty, pine andslow connection is not helping me :( ]]
Hello,
On 11/20/2010 07:15 PM, Oren Laadan wrote:
>
> [[apologies for the silly prefix on last two posts - a combination
> of windows, putty, pine andslow connection is not helping me :( ]]
Maybe it's a good idea to post a clean concatenated version for later
reference?
Thanks.
--
tejun
In this post, Kapil and I will provide our own summary of how we
see the issues for discussion so far. In the next post, we'll reply
specifically to comment on Oren's table of comparison between
linux-cr and userspace.
In general, we'd like to add that the conversation with Oren was very
useful for us, and I think Oren will also agree that we were able to
converge on the purely technical questions.
Concerning opinions, we want to be cautious on opinions, since we're
still learning the context of this ongoing discussion on LKML. There is
probably still some context that we're missing.
Below, we'll summarize the four major questions that we've understood from
this discussion so far. But before doing so, I want to point out that a single
process or process tree will always have many possible interactions with
the rest of the world. Within our own group, we have an internal slogan:
"You can't checkpoint the world."
A virtual machine can have a relatively closed world, which makes it more
robust, but checkpointing will always have some fragile parts.
We give four examples below:
a. time virtualization
b. external database
c. NSCD daemon
d. screen and other full-screen text programs
These are not the only examples of difficult interactions with the
rest of the world.
Anyway, in my opinion, the conversation with Oren seemed to converge
into two larger cases:
1. In a pure userland C/R like DMTCP, how many corner cases are not handled,
or could not be handled, in a pure userland approach?
Also, how important are those corner cases? Do some
have important use cases that rise above just a corner case?
[ inotify is one of those examples. For DMTCP to support this,
it would have to put wrappers around inotify_add_watch,
inotify_rm_watch, read, etc., and maybe even tracking inodes in case
the file had been renamed after the inotify_add_watch. Something
could be made to work for the common cases, but it would
still be a hack --- to be done only if a use case demands it. ]
2. In a Linux C/R approach, it's already recognized that one needs
a userland component (for example, for convenience of recreating
the process tree on restart). How many other cases are there
that require a userland component?
[ One example here is the shared memory segment of NSCD, which
has to be re-initialized on restart. Another example is
a screen process that talks to an ANSI terminal emulator
(e.g. gnome-terminal), which talks to an X server or VNC server.
Below, we discuss these examples in more detail. ]
One can add a third and fourth question here:
3. [Originally posed by Oren] Given Linux C/R, how much work would
it be to add the higher layers of DMTCP on top of Linux C/R?
[ This is a non-trivial question. As just one example, DMTCP
handles sockets uniformly, regardless of whether they
are intra-host or inter-host. Linux C/R handles certain
types of intra-host sockets. So, merging the two would
require some thought. ]
4. [Originally posed by Tejun, e.g. Fri Nov 19 2010 - 09:04:42 EST]
Given that DMTCP checkpoints many common applications, how much work
would it be to add a small number of restricted kernel interfaces
to enable one to remove some of the hacks in DMTCP, and to cover
the more important corner cases that DMTCP might be missing?
I'd also like to add some points of my own here. First, there are certain
cases where I believe that a checkpoint-restart system (in-kernel
or userland or hybrid) can never be completely transparent. It's because you
can't completely cut the connection with the rest of the world. In these
examples, I'm thinking primarily of the Linux C/R mode used to checkpoint
a tree of processes.
To the extent that Linux C/R is used with containers, it seems
to me to be closer to lightweight virtualization. From there, I've
seen that the conversation goes to comparing lightweight virtualization
versus traditional virtual machines, but that discussion goes beyond my
own personal expertise.
Here are some examples that I believe that every checkpointing system
would suffer from the syndrome of trying to "checkpoint the world".
1. Time virtualization --- Right now, neither system does time virtualization.
Both systems could do it. But what is the right policy?
For example, one process may set a deadline for a task an hour
in the future, and then periodically poll the kernel for the current time
to see if one hour has passed. This use case seems to require time
virtualization.
A second process wants to know the current day and time, because a certain
web service updates its information at midnight each day. This use case seems
seems to argue that time virtualization is bad.
2. External database file on another host --- It's not possible to
checkpoint the remote database file. In our work with the Condor developers,
they asked us to add a "Condor mode", which says that if there are any
external socket connections, then delay the checkpoint until the external
socket connections are closed. In a different joint project with CERN (Geneva),
we considered a checkpointing application in which an application
saves much of the database, and then on restart, discovers how much
of its data is stale, and re-loads only the stale portion.
3. NSCD (Network Services Caching Daemon) --- Glibc arranges for
certain information to be cached in the NSCD. The information is
in a memory segment shared between the NSCD and the application.
Upon restart, the application doesn't know that the memory segment
is no longer shared with the NSCD, or that the information is stale.
The DMTCP "hack" is to zero out this memory page on restart. Then glibc
recognizes that it needs to create a new shared memory segment.
3. screen --- The screen application sets the scrolling region of
its ANSI terminal emulator, in order to create a status line
at the bottom, while scrolling the remaining lines of the terminal.
Upon restart, screen assumes that the scrolling region
has already been set up, and doesn't have to be re-initialized.
So, on restart, DMTCP uses SIGWINCH to fool screen (or any
full-screen text-based application) into believing that its
window size has been changed. So, screen (or vim, or emacs)
then re-initializes the state of its ANSI terminal, including
scrolling regions and so on.
So, a userland component is helpful in doing the kind of hacks above.
I recognize that the Linux C/R team agrees that some userland component
can be useful. I just want to show why some userland hacks will always be
needed. Let's consider a pure in-kernel approach to checkpointing 'screen'
(or almost any full-screen application that uses a status bar at the bottom).
Screen sets the scrolling region of an ANSI terminal emulator,
which might be a gnome-terminal. So, a pure in-kernel approach
needs to also checkpoint the gnome-terminal. But the gnome-terminal
needs to talk to an X server. So, now one also needs to start
up inside a VNC server to emulate the X server. So, either
one adds a "hack" in userland to force screen to re-initialize
its ANSI terminal emulator, or else one is forced to include
an entire VNC server just to checkpoint a screen process. ]
Finally, this excerpt below from Tejun's post sums up our views too. We don't
have the kernel expertise of the people on this list, but we've had
to do a little bit of reading the kernel code where the documentation
was sparse and in teaching O/S. We would certainly be very happy to work
closely with the kernel developers, if there was interest in extending
DMTCP to directly use more kernel support.
- Gene and Kapil
Tejun Heo wrote Fri Nov 19 2010 - 09:04:42 EST
> What's so wrong with Gene's work? Sure, it has some hacky aspects but
> let's fix those up. To me, it sure looks like much saner and
> manageable approach than in-kernel CR. We can add nested ptrace,
> CLONE_SET_PID (or whatever) in pidns, integrate it with various ns
> supports, add an ability to adjust brk, export inotify state via
> fdinfo and so on.
>
> The thing is already working, the codebase of core part is fairly
> small and condor is contemplating integrating it, so at least some
> people in HPC segment think it's already viable. Maybe the HPC
> cluster I'm currently sitting near is special case but people here
> really don't run very fancy stuff. In most cases, they're fairly
> simple (from system POV) C programs reading/writing data and burning a
> _LOT_ of CPU cycles inbetween and admins here seem to think dmtcp
> integrated with condor would work well enough for them.
>
> Sure, in-kernel CR has better or more reliable coverage now but by how
> much? The basic things are already there in userland.
As Kapil and I wrote before, we benefited greatly from having talked with Oren,
and learning some more about the context of the discussion. We were able
to understand better the good technical points that Oren was making.
Since the comparison table below concerns DMTCP, we'd like to
state some additional technical points that could affect the conlusions.
> category linux-cr userspace
> --------------------------------------------------------------------------------
> PERFORMANCE has _zero_ runtime overhead visible overhead due to syscalls
> interposition and state tracking
> even w/o checkpoints;
In our experiments so far, the overhead of system calls has been
unmeasurable. We never wrap read() or write(), in order to keep overhead low.
We also never wrap pthread synchronization primitives such as locks,
for the same reason. The other system calls are used much less often, and so
the overhead has been too small to measure in our experiments.
> OPTIMIZATIONS many optimizations possible limited, less effective
> only in kernel, for downtime, w/ much larger overhead.
> image size, live-migration
As above, we believe that the overhead while running is negligible. I'm
assuming that image size refers to in-kernel advantages for incremental
checkpointing. This is useful for apps where the modified pages tend
not to dominate. We agree with this point. As an orthogonal point,
by default DMTCP compresses all checkpoint images using gzip on the fly.
This is useful even when most pages are modified between checkpoints.
Still, as Oren writes, Linux C/R could also add a userland component
to compress checkpoint images on the fly.
Next, live migration is a question that we simply haven't thought much
about. If it's important, we could think about what userland approaches might
exist, but we have no near-term plans to tackle live migration.
> OPERATION applications run unmodified to do c/r, needs 'controller'
> task (launch and manage _entire_
> execution) - point of failure.
> restricts how a system is used.
We'd like to clarify what may be some misconceptions. The DMTCP
controller does not launch or manage any tasks. The DMTCP controller
is stateless, and is only there to provide a barrier, namespace server,
and single point of contact to relay ckpt/restart commands. Recall that
the DMTCP controller handls processes across hosts --- not just on a
single host.
Also, in any computation involving multiple processes, _every_ process
of the computation is a point of failure. If any process of the computation
dies, then the simple application strategy is to give up and revert to an
earlier checkpoint. There are techniques by which an app or DMTCP can
recreate certain failed processes. DMTCP doesn't currently recreate
a dead controller (no demand for it), but it's not hard to do technically.
> PREEMPTIVE checkpoint at any time, use processes must be runnable and
> auxiliary task to save state; "collaborate" for checkpoint;
> non-intrusive: failure does long task coordination time
> not impact checkpointees. with many tasks/threads. alters
> state of checkpointee if fails.
> e.g. cannot checkpoint when in
> vfork(), ptrace states, etc.
Our current support of vfork and ptrace has some of the issues that Oren points
out. One example occurs if a process is in the kernel, and a ptrace state has
changed. If it was important for some application, we would either have
to think of some "hack", or follow Tejun's alternative suggestion to work
with the developers to add further kernel support. The kernel developers
on this list can estimate the difficulties of kernel support better than I can.
> COVERAGE save/restore _all_ task state; needs new ABI for everything:
> identify shared resources; can expose state, provide means to
> extend for new kernel features restore state (e.g. TCP protocol
> easily options negotiated with peers)
Currently, the only kernel support used by DMTCP is system calls (wrappers),
/proc/*/fd, /proc/*/maps, /proc/*/cmdline, /proc/*/exe, /proc/*/stat. (I think
I've named them all now.) The kernel developers will know better
than us what other kernel state one might want to support for C/R, and what
types of applications would need that.
> RELIABILITY checkpoint w/ single syscall; non-atomic, cannot find leaks
> atomic operation. guaranteed to determine restartability
> restartability for containers
My understanding is that the guarantees apply for Linux containers, but not
for a tree of processes. Does this imply that linux-cr would have some
of the same reliability issues as DMTCP for a tree of processes? (I mean
the question sincerely, and am not intending to be rude.) In any case,
won't DMTCP and Linux C/R have to handle orthogonal reliability issues
such as external database, time virtualization, and other examples
from our previous post?
> USERSPACE GLUE possible possible
>
> SECURITY root and non-root modes root and non-root modes
> native support for LSM
>
> MAINTENANCE changes mainly for features changes mainly for features;
> create new ABI for features
> iAnd by all means, I intend to cooperate with Gene to see how to
> make the other part of DMTCP, namely the userspace "glue", work on
> top of linux-cr to have the benefits of all worlds !
This is true, and we strongly welcome the cooperation. We don't know how
this experiment will turn out, but the only way to find out is to sincerely
try it. Whether we succeed or fail, we will learn something either way!
- Gene and Kapil
On Sun, Nov 21, 2010 at 03:18:53AM -0500, Gene Cooperman wrote:
> In this post, Kapil and I will provide our own summary of how we
> see the issues for discussion so far. In the next post, we'll reply
> specifically to comment on Oren's table of comparison between
> linux-cr and userspace.
>
> In general, we'd like to add that the conversation with Oren was very
> useful for us, and I think Oren will also agree that we were able to
> converge on the purely technical questions.
Hi Gene,
Thanks for the good summary, it helps. Some random comments below...
>
> Concerning opinions, we want to be cautious on opinions, since we're
> still learning the context of this ongoing discussion on LKML. There is
> probably still some context that we're missing.
>
> Below, we'll summarize the four major questions that we've understood from
> this discussion so far. But before doing so, I want to point out that a single
> process or process tree will always have many possible interactions with
> the rest of the world. Within our own group, we have an internal slogan:
> "You can't checkpoint the world."
> A virtual machine can have a relatively closed world, which makes it more
> robust, but checkpointing will always have some fragile parts.
> We give four examples below:
> a. time virtualization
> b. external database
> c. NSCD daemon
> d. screen and other full-screen text programs
> These are not the only examples of difficult interactions with the
> rest of the world.
>
> Anyway, in my opinion, the conversation with Oren seemed to converge
> into two larger cases:
> 1. In a pure userland C/R like DMTCP, how many corner cases are not handled,
> or could not be handled, in a pure userland approach?
> Also, how important are those corner cases? Do some
> have important use cases that rise above just a corner case?
> [ inotify is one of those examples. For DMTCP to support this,
> it would have to put wrappers around inotify_add_watch,
> inotify_rm_watch, read, etc., and maybe even tracking inodes in case
> the file had been renamed after the inotify_add_watch. Something
> could be made to work for the common cases, but it would
> still be a hack --- to be done only if a use case demands it. ]
> 2. In a Linux C/R approach, it's already recognized that one needs
> a userland component (for example, for convenience of recreating
> the process tree on restart). How many other cases are there
> that require a userland component?
> [ One example here is the shared memory segment of NSCD, which
> has to be re-initialized on restart. Another example is
> a screen process that talks to an ANSI terminal emulator
> (e.g. gnome-terminal), which talks to an X server or VNC server.
> Below, we discuss these examples in more detail. ]
>
> One can add a third and fourth question here:
>
> 3. [Originally posed by Oren] Given Linux C/R, how much work would
> it be to add the higher layers of DMTCP on top of Linux C/R?
> [ This is a non-trivial question. As just one example, DMTCP
> handles sockets uniformly, regardless of whether they
> are intra-host or inter-host. Linux C/R handles certain
> types of intra-host sockets. So, merging the two would
> require some thought. ]
> 4. [Originally posed by Tejun, e.g. Fri Nov 19 2010 - 09:04:42 EST]
> Given that DMTCP checkpoints many common applications, how much work
> would it be to add a small number of restricted kernel interfaces
> to enable one to remove some of the hacks in DMTCP, and to cover
> the more important corner cases that DMTCP might be missing?
>
>
> I'd also like to add some points of my own here. First, there are certain
> cases where I believe that a checkpoint-restart system (in-kernel
> or userland or hybrid) can never be completely transparent. It's because you
> can't completely cut the connection with the rest of the world. In these
> examples, I'm thinking primarily of the Linux C/R mode used to checkpoint
> a tree of processes.
> To the extent that Linux C/R is used with containers, it seems
> to me to be closer to lightweight virtualization. From there, I've
> seen that the conversation goes to comparing lightweight virtualization
> versus traditional virtual machines, but that discussion goes beyond my
> own personal expertise.
At the risk of restating already applied arguments, and as a c/r
outsider, this touches on the real crux of the issue for me. What is
the complete set of boundaries between a c/r group of processes and
the outside world? Is it bounded and is it understandable by mere
kernel engineers? Does it change the assumptions about what a Linux
process /is/, and how to handle it? How much? The broad strokes seem
to be straight forward, but as already pointed out, the devil is in
the details.
> Here are some examples that I believe that every checkpointing system
> would suffer from the syndrome of trying to "checkpoint the world".
>
> 1. Time virtualization --- Right now, neither system does time virtualization.
> Both systems could do it. But what is the right policy?
> For example, one process may set a deadline for a task an hour
> in the future, and then periodically poll the kernel for the current time
> to see if one hour has passed. This use case seems to require time
> virtualization.
> A second process wants to know the current day and time, because a certain
> web service updates its information at midnight each day. This use case seems
> seems to argue that time virtualization is bad.
Temporal issues need to be (are being?) addressed regardless. In
certain respects, I'm sure c/r can be seen as a *really long*
scheduler latency, and would have the same effect as a system going
into suspend, or a vm-level checkpoint. I would think the same
behaviour would be desirable in all cases, include c/r.
> 2. External database file on another host --- It's not possible to
> checkpoint the remote database file. In our work with the Condor developers,
> they asked us to add a "Condor mode", which says that if there are any
> external socket connections, then delay the checkpoint until the external
> socket connections are closed. In a different joint project with CERN (Geneva),
> we considered a checkpointing application in which an application
> saves much of the database, and then on restart, discovers how much
> of its data is stale, and re-loads only the stale portion.
>
> 3. NSCD (Network Services Caching Daemon) --- Glibc arranges for
> certain information to be cached in the NSCD. The information is
> in a memory segment shared between the NSCD and the application.
> Upon restart, the application doesn't know that the memory segment
> is no longer shared with the NSCD, or that the information is stale.
> The DMTCP "hack" is to zero out this memory page on restart. Then glibc
> recognizes that it needs to create a new shared memory segment.
Right here is exactly the example of a boundary that needs explicit
rules. When a pair of processes have a shared region, and only one of
them is checkpointed, then what is the behaviour on restore? In this
specific example, a context-specific hack is used to achieve the
desired result, but that doesn't work (as I believe you agree) in the
general case. What behaviour will in-kernel support need to enforce?
> 3. screen --- The screen application sets the scrolling region of
> its ANSI terminal emulator, in order to create a status line
> at the bottom, while scrolling the remaining lines of the terminal.
> Upon restart, screen assumes that the scrolling region
> has already been set up, and doesn't have to be re-initialized.
> So, on restart, DMTCP uses SIGWINCH to fool screen (or any
> full-screen text-based application) into believing that its
> window size has been changed. So, screen (or vim, or emacs)
> then re-initializes the state of its ANSI terminal, including
> scrolling regions and so on.
> So, a userland component is helpful in doing the kind of hacks above.
> I recognize that the Linux C/R team agrees that some userland component
> can be useful. I just want to show why some userland hacks will always be
> needed. Let's consider a pure in-kernel approach to checkpointing 'screen'
> (or almost any full-screen application that uses a status bar at the bottom).
> Screen sets the scrolling region of an ANSI terminal emulator,
> which might be a gnome-terminal. So, a pure in-kernel approach
> needs to also checkpoint the gnome-terminal. But the gnome-terminal
> needs to talk to an X server. So, now one also needs to start
> up inside a VNC server to emulate the X server. So, either
> one adds a "hack" in userland to force screen to re-initialize
> its ANSI terminal emulator, or else one is forced to include
> an entire VNC server just to checkpoint a screen process. ]
>
> Finally, this excerpt below from Tejun's post sums up our views too. We don't
> have the kernel expertise of the people on this list, but we've had
> to do a little bit of reading the kernel code where the documentation
> was sparse and in teaching O/S. We would certainly be very happy to work
> closely with the kernel developers, if there was interest in extending
> DMTCP to directly use more kernel support.
>
> - Gene and Kapil
>
> Tejun Heo wrote Fri Nov 19 2010 - 09:04:42 EST
> > What's so wrong with Gene's work? Sure, it has some hacky aspects but
> > let's fix those up. To me, it sure looks like much saner and
> > manageable approach than in-kernel CR. We can add nested ptrace,
> > CLONE_SET_PID (or whatever) in pidns, integrate it with various ns
> > supports, add an ability to adjust brk, export inotify state via
> > fdinfo and so on.
> >
> > The thing is already working, the codebase of core part is fairly
> > small and condor is contemplating integrating it, so at least some
> > people in HPC segment think it's already viable. Maybe the HPC
> > cluster I'm currently sitting near is special case but people here
> > really don't run very fancy stuff. In most cases, they're fairly
> > simple (from system POV) C programs reading/writing data and burning a
> > _LOT_ of CPU cycles inbetween and admins here seem to think dmtcp
> > integrated with condor would work well enough for them.
> >
> > Sure, in-kernel CR has better or more reliable coverage now but by how
> > much? The basic things are already there in userland.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
On Sat, 20 Nov 2010, Tejun Heo wrote:
> Hello,
>
> On 11/20/2010 07:15 PM, Oren Laadan wrote:
> >
> > [[apologies for the silly prefix on last two posts - a combination
> > of windows, putty, pine andslow connection is not helping me :( ]]
>
> Maybe it's a good idea to post a clean concatenated version for later
> reference?
>
Sure, as soon I am back on sane connection (~1 week)
(I cut it in three to make it easier for people to digest ...)
Oren.
On Sun, 21 Nov 2010, Gene Cooperman wrote:
> Below, we'll summarize the four major questions that we've understood from
> this discussion so far. But before doing so, I want to point out that a single
> process or process tree will always have many possible interactions with
> the rest of the world. Within our own group, we have an internal slogan:
> "You can't checkpoint the world."
> A virtual machine can have a relatively closed world, which makes it more
> robust, but checkpointing will always have some fragile parts.
That depends of what your definition of "world". One definition
is "world := VM", as you state above. Another is "world := container"
which I stated in my post(s). You can checkpoint both.
For those cases where the "world" cannot be fully checkpointed,
I explicitly pointed that we should focus on the core c/r
functionality, because the "glue" can be done either way.
> We give four examples below:
> a. time virtualization
IMHO, irrelevant to current discussion. And btw, this is done in
linux-cr for live migration of tcp connections.
> b. external database
> c. NSCD daemon
This falls within the category of "glue", and is - as I try once
again to remind - tentirely oorthogonal to the topic of where
to do c/r.
> d. screen and other full-screen text programs
> These are not the only examples of difficult interactions with the
> rest of the world.
This actually never required a userspace "component" with Zap
or linux-cr (to the best of my knowledge)..
Even if it did - the question is not how to deal with "glue"
(you demonstrated quite well how to do that with DMTCP), but
how should teh basic, core c/r functionality work - which is
below, and orthogonal to the "glue".
Let us please focus on the base c/r engine functionality...
(gotta disconnect now .. more later)
Oren.
Gene Cooperman [[email protected]] wrote:
| > RELIABILITY checkpoint w/ single syscall; non-atomic, cannot find leaks
| > atomic operation. guaranteed to determine restartability
| > restartability for containers
|
| My understanding is that the guarantees apply for Linux containers, but not
| for a tree of processes. Does this imply that linux-cr would have some
| of the same reliability issues as DMTCP for a tree of processes? (I mean
| the question sincerely, and am not intending to be rude.) In any case,
| won't DMTCP and Linux C/R have to handle orthogonal reliability issues
| such as external database, time virtualization, and other examples
| from our previous post?
Yes if the user attempts to checkpoint a partial container (what we refer
to process subtree) or fails to snapshot/restore filesystem there could be
leaks that we cannot detect.
But one guarantee we are trying to provide is that if the user checkpoints
a _complete_ container, then we will detect a leak if one exists.
Is there a way to establish a set of constraints (eg: run application in a
container, snapshot/restore filesystem) and then provide leak detection with
a pure userpsace implementation ?
Sukadev
On Sun, 21 Nov 2010, Gene Cooperman wrote:
> As Kapil and I wrote before, we benefited greatly from having talked with Oren,
> and learning some more about the context of the discussion. We were able
> to understand better the good technical points that Oren was making.
> Since the comparison table below concerns DMTCP, we'd like to
> state some additional technical points that could affect the conlusions.
>
> > category linux-cr userspace
> > --------------------------------------------------------------------------------
> > PERFORMANCE has _zero_ runtime overhead visible overhead due to syscalls
> > interposition and state tracking
> > even w/o checkpoints;
>
> In our experiments so far, the overhead of system calls has been
> unmeasurable. We never wrap read() or write(), in order to keep overhead low.
> We also never wrap pthread synchronization primitives such as locks,
> for the same reason. The other system calls are used much less often, and so
> the overhead has been too small to measure in our experiments.
Syscall interception will have visible effect on applications that
use those syscalls. You may not observe overheasd with HPC ones,
but do you have numbers on server apps ? apps that use fork/clone
and pipes extensively ? threads benchmarks et ? compare that
to aboslute zero overhead of linux-cr.
>
> > OPTIMIZATIONS many optimizations possible limited, less effective
> > only in kernel, for downtime, w/ much larger overhead.
> > image size, live-migration
>
> As above, we believe that the overhead while running is negligible. I'm
For the HPC apps that you use.
> assuming that image size refers to in-kernel advantages for incremental
> checkpointing. This is useful for apps where the modified pages tend
> not to dominate. We agree with this point. As an orthogonal point,
> by default DMTCP compresses all checkpoint images using gzip on the fly.
> This is useful even when most pages are modified between checkpoints.
> Still, as Oren writes, Linux C/R could also add a userland component
> to compress checkpoint images on the fly.
This is not "userland component", it's "checkpoint | gzip > image.out"...
> Next, live migration is a question that we simply haven't thought much
> about. If it's important, we could think about what userland approaches might
> exist, but we have no near-term plans to tackle live migration.
As it is, live-migration _is_ a very important use case.
>
> > OPERATION applications run unmodified to do c/r, needs 'controller'
> > task (launch and manage _entire_
> > execution) - point of failure.
> > restricts how a system is used.
>
> We'd like to clarify what may be some misconceptions. The DMTCP
> controller does not launch or manage any tasks. The DMTCP controller
> is stateless, and is only there to provide a barrier, namespace server,
> and single point of contact to relay ckpt/restart commands. Recall that
> the DMTCP controller handls processes across hosts --- not just on a
> single host.
The controller is another point of failure. I already pointed that
the (controlled) application crashes when your controller dies, and
you mentioned it's a bug that should be fixed. But then there will always
be a risk for another, and another ... You also mentioned that if the
controller dies, then the app should contionue to run, but will not be
checkpointable anymore (IIUC).
The point is, that the controller is another point of failure, and makes
the execution/checkpoint intrusive. It also adds security and
user-management issues as you'll need one (or more ?) controller per user
(right now, it's one for all, no ?). and so on.
Plus, because the restarted apps get their virtualized IDs from the
controller, then they can't now "see" existing/new processes that
may get the "same" pids (virtualization is not in the kernel).
> Also, in any computation involving multiple processes, _every_ process
> of the computation is a point of failure. If any process of the computation
> dies, then the simple application strategy is to give up and revert to an
> earlier checkpoint. There are techniques by which an app or DMTCP can
> recreate certain failed processes. DMTCP doesn't currently recreate
> a dead controller (no demand for it), but it's not hard to do technically.
The point is that you _add_ a point of failure: you make the "checkpoint"
operation a possible reason for the application to crash. In contrast, in
linux-cr the checkpoiint is idempotent - nunharmful because it does not
make the applications execute. Instead, it merely observes their state.
> > PREEMPTIVE checkpoint at any time, use processes must be runnable and
> > auxiliary task to save state; "collaborate" for checkpoint;
> > non-intrusive: failure does long task coordination time
> > not impact checkpointees. with many tasks/threads. alters
> > state of checkpointee if fails.
> > e.g. cannot checkpoint when in
> > vfork(), ptrace states, etc.
>
> Our current support of vfork and ptrace has some of the issues that Oren points
> out. One example occurs if a process is in the kernel, and a ptrace state has
> changed. If it was important for some application, we would either have
> to think of some "hack", or follow Tejun's alternative suggestion to work
> with the developers to add further kernel support. The kernel developers
> on this list can estimate the difficulties of kernel support better than I can.
>
> > COVERAGE save/restore _all_ task state; needs new ABI for everything:
> > identify shared resources; can expose state, provide means to
> > extend for new kernel features restore state (e.g. TCP protocol
> > easily options negotiated with peers)
>
> Currently, the only kernel support used by DMTCP is system calls (wrappers),
> /proc/*/fd, /proc/*/maps, /proc/*/cmdline, /proc/*/exe, /proc/*/stat. (I think
> I've named them all now.) The kernel developers will know better
> than us what other kernel state one might want to support for C/R, and what
> types of applications would need that.
>
> > RELIABILITY checkpoint w/ single syscall; non-atomic, cannot find leaks
> > atomic operation. guaranteed to determine restartability
> > restartability for containers
>
> My understanding is that the guarantees apply for Linux containers, but not
> for a tree of processes. Does this imply that linux-cr would have some
> of the same reliability issues as DMTCP for a tree of processes? (I mean
> the question sincerely, and am not intending to be rude.) In any case,
> won't DMTCP and Linux C/R have to handle orthogonal reliability issues
> such as external database, time virtualization, and other examples
> from our previous post?
There are two points in the claim above:
1) linux-cr can checkpoint with a single syscall - it's atomic. This
gives you more guarantees about the consistency of the checkpointed
application(s), and less "opportunitites" for the operation as a whole to
fail.
2) restartability - for full-container checkpoint only.
There is no "reliability" issue with c/r of non-containers - it's a matter
of definition: it depends on what your requirements from the userspace
application and what sort of "glue" you have for it.
And I request again - let's leave out the questions of "time
virtualization" and "external databases" - how are they different for the
VM virtalization solution ? they are conpletely orthogonal to the
question we are debating.
Thanks,
Oren.
>
> > USERSPACE GLUE possible possible
> >
> > SECURITY root and non-root modes root and non-root modes
> > native support for LSM
> >
> > MAINTENANCE changes mainly for features changes mainly for features;
> > create new ABI for features
>
> > iAnd by all means, I intend to cooperate with Gene to see how to
> > make the other part of DMTCP, namely the userspace "glue", work on
> > top of linux-cr to have the benefits of all worlds !
>
> This is true, and we strongly welcome the cooperation. We don't know how
> this experiment will turn out, but the only way to find out is to sincerely
> try it. Whether we succeed or fail, we will learn something either way!
>
> - Gene and Kapil
>
>
(Our first comment below actually replies to an earlier post by Oren. It seemed
simpler to combine our comments.)
> > d. ?screen and other full-screen text programs These are not the only
> > examples of difficult interactions with the rest of the world.
>
> This actually never required a userspace "component" with Zap or linux-cr (to
> the best of my knowledge).
We would guess that Zap would not be able to support screen without a user
space component. The bug occurs when screen is configured to have a status line
at the bottom. We would be interested if you want to try it and let us know the
results.
=============================================
> > > category ? ? ? ?linux-cr
> > > userspace
> > > --------------------------------------------------------------------------------
> > > PERFORMANCE ? ? has _zero_ runtime overhead ? ? visible overhead due to
> > > syscalls interposition and state tracking even w/o checkpoints;
> >
> > In our experiments so far, the overhead of system calls has been
> > unmeasurable. ?We never wrap read() or write(), in order to keep overhead
> > low. ?We also never wrap pthread synchronization primitives such as locks,
> > for the same reason. ?The other system calls are used much less often, and
> > so the overhead has been too small to measure in our experiments.
>
> Syscall interception will have visible effect on applications that use those
> syscalls. You may not observe overheasd with HPC ones, but do you have
> numbers on server apps ? ?apps that use fork/clone and pipes extensively ?
> threads benchmarks et ? ?compare that to aboslute zero overhead of linux-cr.
Its true that we haven't taken serious data on overhead with server apps. Is
there a particular server app that you are thinking of as an example? I would
expect fork/clone and pipes to be invoked infrequently in the server apps and do
not add measurably to CPU time. In most server apps such as MySQL, it is
common to maintain a pool of threads for reuse rather than to repeatedly call
clone for a new thread. This is done to ensure that the overhead of the clone
calls is not significant. I would expect a similar policy for fork and pipes.
<snip>
> > > OPERATION ? ? ? applications run unmodified ? ? to do c/r, needs
> > > 'controller' task (launch and manage _entire_ execution) - point of
> > > failure. ?restricts how a system is used.
> >
> > We'd like to clarify what may be some misconceptions. ?The DMTCP controller
> > does not launch or manage any tasks. ?The DMTCP controller is stateless,
> > and is only there to provide a barrier, namespace server, and single point
> > of contact to relay ckpt/restart commands. ?Recall that the DMTCP
> > controller handls processes across hosts --- not just on a single host.
>
> The controller is another point of failure. I already pointed that the
> (controlled) application crashes when your controller dies, and you mentioned
> it's a bug that should be fixed. But then there will always be a risk for
> another, and another ... ? You also mentioned that if the controller dies,
> then the app should contionue to run, but will not be checkpointable anymore
> (IIUC).
>
> The point is, that the controller is another point of failure, and makes the
> execution/checkpoint intrusive. It also adds security and user-management
> issues as you'll need one (or more ?) controller per user (right now, it's
> one for all, no ?). and so on.
Just to clarify, DMTCP uses one coordinator for each checkpointable
computation. A single user may be running multiple computations with one
coordinator for each computation. We don't actually use the word controller
in DMTCP terminology because the coordinator is stateless and so in
coordinating but not controlling other processes.
> Plus, because the restarted apps get their virtualized IDs from the
> controller, then they can't now "see" existing/new processes that may get the
> "same" pids (virtualization is not in the kernel).
This appears to be a misconception. The wrappers within the user process
maintain the pid-translation table for that process. The translation table is
the translation between the original pid given by the kernel and the current
pid set by the kernel on restart. This is handled locally and does not involve
the coordinator.
In the case of a fork there could be a pid-clash (the original pid
generated for a
new process that conflicts with someone else's original pid). However, DMTCP
handles this by checking within the fork wrapper for a pid-clash. In the rare
case of a pid-clash, the child process exits and the parent forks again. Same
applies for clone and any pid clash at restart time.
> > ? ? Also, in any computation involving multiple processes, _every_ process
> > ? ? of the computation is a point of failure. ?If any process of the
> > ? ? computation dies, then the simple application strategy is to give up
> > ? ? and revert to an earlier checkpoint. ?There are techniques by which an
> > ? ? app or DMTCP can recreate certain failed processes. ?DMTCP doesn't
> > ? ? currently recreate a dead controller (no demand for it), but it's not
> > ? ? hard to do technically.
>
> The point is that you _add_ a point of failure: you make the "checkpoint"
> operation a possible reason for the application to crash. In contrast, in
> linux-cr the checkpoiint is idempotent - nunharmful because it does not make
> the applications execute. Instead, it merely observes their state.
We were speaking above of the case when the process dies during a
computation. We were not referring to checkpoint time.
<snip>
We would like to add our own comment/question. To set the context we quote an
earlier post:
OL> Even if it did - the question is not how to deal with "glue"
OL> (you demonstrated quite well how to do that with DMTCP), but
OL> how should teh basic, core c/r functionality work - which is
OL> below, and orthogonal to the "glue".
There seems to be an implicit assumption that it is easy to separate the DMTCP
"glue code" from the DMTCP C/R engine as separate modules. DMTCP is modular but
it splits the problems into modules along a different line than Linux C/R. We
look forward to the joint experiment in which we would try to combine DMTCP
with Linux C/R. This will help answer the question in our mind.
In order to explore the issue, let's imagine that we have a successful merge of
DMTCP and Linux C/R. The following are some user-space glue issues. It's not
obvious to us how the merged software will handle these issues.
1. Sockets -- DMTCP handles all sockets in a common manner through a single
module. Sockets are checkpointed independently of whether they are local or
remote. In a merger of DMTCP and Linux C/R, what does Linux C/R do when it sees
remote sockets? Or should DMTCP take down all remote sockets before
checkpointing? If DMTCP has to do this, it would be less efficient than the
current design which keeps the remote sockets connections alive during
checkpoint.
2. XLib and X11-server -- Consider checkpointing a single X11 app without the
X11-server and without VNC. This is something we intend to add to DMTCP in the
next few months. We have already mapped out the design in our minds. An X11
application includes the Xlib library. The data of an X11 window is, by
default, contained in the X11 library -- not in the X11-server. The application
communicates with the X11-server using socket connections, which would be
considered a leak by Linux C/R. At restart time, DMTCP will ask the
X11-server to create a bare window and then make the appropriate Xlib call to
repaint the window based on the data stored in the Xlib ?library.
For checkpoint/resume, the window stays up and does not has to be repainted.
How will the combined DMTCP/Linux C/R work? Will DMTCP have to take
down the window prior to Linux C/R and paint a new window at resume time?
Doesn't this add inefficiency?
3. Checkpointing a single process (e.g. a bash shell) talking to an xterm via
a pty -- We assume that from the viewpoint of Linux C/R a pty is a leak since
there is a second process operating the master end of the pty. In this
case we are
guessing that Linux C/R would checkpoint and restart without the gurantees of
reliability. We are guessing that Linux C/R would not save and restore the pty,
instead it would be the responsibility of DMTCP to restore the current settings
of the pty (e.g. packet mode vs. regular mode). Is our understanding correct?
Would this work?
Thanks,
Gene and Kapil
On Tue, 23 Nov 2010, Kapil Arya wrote:
> OL> Even if it did - the question is not how to deal with "glue"
> OL> (you demonstrated quite well how to do that with DMTCP), but
> OL> how should teh basic, core c/r functionality work - which is
> OL> below, and orthogonal to the "glue".
>
> There seems to be an implicit assumption that it is easy to separate the DMTCP
> "glue code" from the DMTCP C/R engine as separate modules. DMTCP is modular but
> it splits the problems into modules along a different line than Linux C/R. We
> look forward to the joint experiment in which we would try to combine DMTCP
> with Linux C/R. This will help answer the question in our mind.
I apologize for being blunt - but this is probably an issue specific to
DMTCP's engineering...
> In order to explore the issue, let's imagine that we have a successful merge of
> DMTCP and Linux C/R. The following are some user-space glue issues. It's not
> obvious to us how the merged software will handle these issues.
>
> 1. Sockets -- DMTCP handles all sockets in a common manner through a single
> module. Sockets are checkpointed independently of whether they are local or
> remote. In a merger of DMTCP and Linux C/R, what does Linux C/R do when it sees
> remote sockets? Or should DMTCP take down all remote sockets before
> checkpointing? If DMTCP has to do this, it would be less efficient than the
> current design which keeps the remote sockets connections alive during
> checkpoint.
What is a "local" socket ? af_unix, or locally connected af_inet ?
Anyway, with linux-cr you'd do what's needed after the restarted tasks are
created, but before their state is restored. For each such "old" socket
that you want to replace, you'd create (in userspace with arbitrary glue"
code!) a new socket, and use this socket when restoring the state of the
task. Similarly, you could replace any other resource, not only sockets.
>
> 2. XLib and X11-server -- Consider checkpointing a single X11 app without the
> X11-server and without VNC. This is something we intend to add to DMTCP in the
> next few months. We have already mapped out the design in our minds. An X11
> application includes the Xlib library. The data of an X11 window is, by
> default, contained in the X11 library -- not in the X11-server. The application
> communicates with the X11-server using socket connections, which would be
> considered a leak by Linux C/R. At restart time, DMTCP will ask the
> X11-server to create a bare window and then make the appropriate Xlib call to
> repaint the window based on the data stored in the Xlib ?library.
> For checkpoint/resume, the window stays up and does not has to be repainted.
> How will the combined DMTCP/Linux C/R work? Will DMTCP have to take
> down the window prior to Linux C/R and paint a new window at resume time?
> Doesn't this add inefficiency?
Repainting during restart is the least of your problems.
Leak detection is not a problem:
If the socket connects out of the containers (like af_inet) - then it is
not a leak, andyou treat it as described above.
If the sockets connects within the container but you don't checkpoint the
"peer" process - then it is not a container-c/r (in which case you don't
look for leaks).
Also, the application could mark resources to not be checkpointed (e.g.
scratch memory to save storage, or sockets to not count as leaks).
I don't see any problem with X11 or any other library and "glue".
>
> 3. Checkpointing a single process (e.g. a bash shell) talking to an xterm via
> a pty -- We assume that from the viewpoint of Linux C/R a pty is a leak since
> there is a second process operating the master end of the pty. In this
> case we are
> guessing that Linux C/R would checkpoint and restart without the gurantees of
> reliability. We are guessing that Linux C/R would not save and restore the pty,
> instead it would be the responsibility of DMTCP to restore the current settings
> of the pty (e.g. packet mode vs. regular mode). Is our understanding correct?
> Would this work?
I explain again - in case it wasn't clear from my 3-part post: leak
detection is relevant _only_ for full container-c/r. It doesn't make
sense otherwise.
If you want to checkpoint individual components of an application,
then it's up to userspace to produce/provide the relevant "glue" to
make it "make sense" when those components restart without their
original eco-system.
Thanks,
Oren.
Hi Oren,
On Thu, Nov 25, 2010 at 11:04:16AM -0500, Oren Laadan wrote:
> On Tue, 23 Nov 2010, Kapil Arya wrote:
>
> > OL> Even if it did - the question is not how to deal with "glue"
> > OL> (you demonstrated quite well how to do that with DMTCP), but
> > OL> how should teh basic, core c/r functionality work - which is
> > OL> below, and orthogonal to the "glue".
> >
> > There seems to be an implicit assumption that it is easy to separate the DMTCP
> > "glue code" from the DMTCP C/R engine as separate modules. DMTCP is modular but
> > it splits the problems into modules along a different line than Linux C/R. We
> > look forward to the joint experiment in which we would try to combine DMTCP
> > with Linux C/R. This will help answer the question in our mind.
>
> I apologize for being blunt - but this is probably an issue specific to
> DMTCP's engineering...
>
I completely agree with you, Oren. DMTCP was never designed to be split
into a userland and in-kernel replacement. We will want to re-factor
DMTCP to make this happen.
I'm sorry if my e-mail came off as confrontational. That was not my
intention. I was just looking forward to an interesting intellectual
experiment --- how to go about combining DMTCP and Linux C/R. I was
trying to guess ahead of time where there are interesting challenges, and
my hope is that we will find a way to solve them together.
Best wishes,
- Gene