2010-11-08 16:56:05

by Grant Likely

[permalink] [raw]
Subject: Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch

On Tue, Nov 2, 2010 at 3:30 PM, Oren Laadan <[email protected]> wrote:
> Hi,
>
> Following the discussion yesterday, here is a linux-cr diff that
> that is limited to changes to existing code.
>
> The diff doesn't include the eclone() patches. I also tried to strip
> off the new c/r code (either code in new files, or new code within
> #ifdef CONFIG_CHECKPOINT in existing files).
>
> I left a few such snippets in, e.g. c/r syscalls templates and
> declaration of c/r specific methods in, e.g. file_operations.
>
> The remaining changes in this patch include new freezer state
> ("CHECKPOINTING"), mostly refactoring of exsiting code, and a bit
> of new helpers.
>
> Disclaimer: don't try to compile (or apply) - this is only intended
> to give a ballpark of how the c/r patches change existing code.
[...]
> ?159 files changed, 2031 insertions(+), 587 deletions(-)

FWIW...

This patch has far reaching changes which quite frankly scare me;
primarily because c/r changes many long-held assumptions about how
Linux processes work. It needs to track a large amount of state with
lots of corner cases, and the Linux process model is already quite
complex. I know this is a fluffy hand-waving critique, but without
being convinced of a strong general-purpose use-case, it is hard to
get excited about a solution that touches large amounts of common
code.

c/r of desktop processes doesn't seem interesting other that as a test
case, but I can possibly be convinced about HPC, embedded, industrial,
or telecom use-cases, but for custom/specific-purpose applications the
question must be asked if a fully user space or joint user/kernel
method would better solve the problem.

You mentioned in a reply that this overview diff includes both
cleanups and required changes. I suggest posting the cleanup patches
as soon as possible so that this diff becomes simpler.

Also:

> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 9458685..335a4b3 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -93,6 +93,10 @@ config STACKTRACE_SUPPORT
> config HAVE_LATENCYTOP_SUPPORT
> def_bool y
>
> +config CHECKPOINT_SUPPORT
> + bool
> + default y
> +

Definitely should not default to 'y', and needs to be user-selectable.

g.


2010-11-08 21:01:18

by Nathan Lynch

[permalink] [raw]
Subject: Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch

Hi Grant,

On Mon, 2010-11-08 at 11:55 -0500, Grant Likely wrote:
> Also:
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index 9458685..335a4b3 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -93,6 +93,10 @@ config STACKTRACE_SUPPORT
> > config HAVE_LATENCYTOP_SUPPORT
> > def_bool y
> >
> > +config CHECKPOINT_SUPPORT
> > + bool
> > + default y
> > +
>
> Definitely should not default to 'y', and needs to be user-selectable.

CHECKPOINT_SUPPORT is what an arch sets to indicate that it has support
for C/R -- the user selectable option is in a generic location (and
defaults to n).

2010-11-11 06:27:59

by Nathan Lynch

[permalink] [raw]
Subject: Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch

On Mon, 2010-11-08 at 11:55 -0500, Grant Likely wrote:
> On Tue, Nov 2, 2010 at 3:30 PM, Oren Laadan <[email protected]> wrote:
> > Hi,
> >
> > Following the discussion yesterday, here is a linux-cr diff that
> > that is limited to changes to existing code.
> >
> > The diff doesn't include the eclone() patches. I also tried to strip
> > off the new c/r code (either code in new files, or new code within
> > #ifdef CONFIG_CHECKPOINT in existing files).
> >
> > I left a few such snippets in, e.g. c/r syscalls templates and
> > declaration of c/r specific methods in, e.g. file_operations.
> >
> > The remaining changes in this patch include new freezer state
> > ("CHECKPOINTING"), mostly refactoring of exsiting code, and a bit
> > of new helpers.
> >
> > Disclaimer: don't try to compile (or apply) - this is only intended
> > to give a ballpark of how the c/r patches change existing code.
> [...]
> > 159 files changed, 2031 insertions(+), 587 deletions(-)
>
> FWIW...
>
> This patch has far reaching changes which quite frankly scare me;
> primarily because c/r changes many long-held assumptions about how
> Linux processes work. It needs to track a large amount of state with
> lots of corner cases, and the Linux process model is already quite
> complex. I know this is a fluffy hand-waving critique, but without
> being convinced of a strong general-purpose use-case, it is hard to
> get excited about a solution that touches large amounts of common
> code.

For the most part the c/r patch set is "merely" adding code and not
changing the way existing code works -- I'm pretty sure we haven't had
to alter anything hairy like locking or object lifetime rules. Maybe
I've had my head in this code for too long, but I'm not seeing how
assumptions about the process model are changed significantly. All the
process-related APIs like fork, clone, exec, wait, and exit all work as
they have before and if you're not actively using C/R you'd never know
the capability is there.

As for the lack of a general-purpose use-case... well, it's not terribly
unusual for Linux to sustain significant changes to satisfy what some
may consider a niche need. Things like NUMA support, CPU and memory
hotplug - these were not "generally" useful features when they were
introduced. So I don't think we're trying to break new ground in that
respect.


> c/r of desktop processes doesn't seem interesting other that as a test
> case, but I can possibly be convinced about HPC, embedded, industrial,
> or telecom use-cases, but for custom/specific-purpose applications the
> question must be asked if a fully user space or joint user/kernel
> method would better solve the problem.

This is in fact a joint approach -- the process tree is recreated in
user space at restart (not to mention that the user is responsible for
providing the restarted job a coherent view of the filesystem).

In any case, with HPC, C/R isn't about just fault tolerance necessarily;
it's for load-balancing and migration too. So the checkpoint operation
needs to be as fast and efficient as possible, and ideally the image
should be readable/writable as a stream e.g. over a socket. User space
really isn't up to this - for example, a user space implementation
generally cannot know which user pages are safe to omit from the image
(at least not without faulting them all in).

Users who need C/R on Linux today are resorting to LD_PRELOAD hacks and
moribund out-of-tree kernel patches, and I'm afraid they're going to
keep doing that until Linux provides a better alternative built-in.

2010-11-17 05:29:31

by Anton Blanchard

[permalink] [raw]
Subject: Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch


Hi Grant,

> This patch has far reaching changes which quite frankly scare me;
> primarily because c/r changes many long-held assumptions about how
> Linux processes work. It needs to track a large amount of state with
> lots of corner cases, and the Linux process model is already quite
> complex. I know this is a fluffy hand-waving critique, but without
> being convinced of a strong general-purpose use-case, it is hard to
> get excited about a solution that touches large amounts of common
> code.
>
> c/r of desktop processes doesn't seem interesting other that as a test
> case, but I can possibly be convinced about HPC, embedded, industrial,
> or telecom use-cases, but for custom/specific-purpose applications the
> question must be asked if a fully user space or joint user/kernel
> method would better solve the problem.

It seems like there are a number of questions around the utility of
C/R so I'd like to take a step back from the technical discussion
around implementation and hopefully convince you, Tejun (and anyone
else interested) that C/R is something we want to solve in Linux.

Here at IBM we are working on the next generation of HPC systems. One
example of this will be the NCSA Bluewaters supercomputer:

http://www.ncsa.illinois.edu/BlueWaters/

The aim is not to build yet another linpack special, but a supercomputer
that achieves more than 1 petaflop sustained on a wide range of
applications. There is also a strong focus on improving the
productivity and reliability of the cluster.

There are two usage scenarios for C/R in this environment:

1. Resource management. Any large HPC cluster should be 100% busy and
as such you will often fill in the gaps with low priority jobs which
may need to be preempted. These low priority jobs need to give up their
resources (memory, interconnect resources etc) whenever something
important comes in.

2. Fault tolerance. Failures are a fact of life for any decent sized
cluster. As the cluster gets larger these failures become very common.
Speaking from an industry perspective, MTBF rates measured in the order
of several hours for large commodity clusters are not surprising. We at
IBM improve on that with hardware and system design, but there is only
so much you can do. The failures also happen at the Linux kernel level
so even if we had 100% reliable systems we would still have issues.

Now this is the pointy end of HPC, but similar issues are happening in
the meat of the HPC market. One area we are seeing a lot of C/R
interest is the EDA space. As ICs become more and more complex the
amount of cluster compute power it takes to route, check, create masks
etc grows so large that system reliability becomes an issue. Some tool
vendors write their own application C/R, but there are a multitude of
in house applications that have no C/R capability today.

You could argue that we should just add C/R capability to every HPC
application and library people care about or rework them to be
fault tolerant in software. Unfortunately I don't see either as being
viable. There are so many applications, libraries and even programming
languages in use for HPC that it would be a losing battle. If we
did go down this route we would also be unable to leverage C/R for
anything else. I can understand the concern around finding a general
purpose case, but I do believe many other solid uses for C/R outside of
HPC will emerge. For example, there was interest from the embedded guys
during the KS discussion and I can easily imagine using C/R to bring up
firefox faster on a TV.

The problems found in HPC often turn into more general problems down
the track. I think back to the heated discussions we had around SMP back
in the early 2000s when we had 32 core POWER4s and SGI had similar sized
machines. Now a 24 core machine fits in 1U and can be purchased for
under $5k. NUMA support, CPU affinity and multi queue scheduling are
other areas that initially had a very small user base but have since
become important features for many users.

Anton

2010-11-17 11:08:58

by Tejun Heo

[permalink] [raw]
Subject: Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch

Hello,

On 11/17/2010 06:29 AM, Anton Blanchard wrote:
> It seems like there are a number of questions around the utility of
> C/R so I'd like to take a step back from the technical discussion
> around implementation and hopefully convince you, Tejun (and anyone
> else interested) that C/R is something we want to solve in Linux.

I'm not arguing CR isn't that useful. My argument was that it's a
solution for a fairly niche problems and that the implementation isn't
transparent at all for general use cases.

> Here at IBM we are working on the next generation of HPC systems. One
> example of this will be the NCSA Bluewaters supercomputer:

And, yeah, I agree that it is a very useful thing for HPC.

> You could argue that we should just add C/R capability to every HPC
> application and library people care about or rework them to be
> fault tolerant in software. Unfortunately I don't see either as being
> viable. There are so many applications, libraries and even programming
> languages in use for HPC that it would be a losing battle. If we
> did go down this route we would also be unable to leverage C/R for
> anything else. I can understand the concern around finding a general
> purpose case, but I do believe many other solid uses for C/R outside of
> HPC will emerge. For example, there was interest from the embedded guys
> during the KS discussion and I can easily imagine using C/R to bring up
> firefox faster on a TV.

Thanks for pointing out the use cases although for the last one it
would be much wiser to just use webkit.

> The problems found in HPC often turn into more general problems down
> the track. I think back to the heated discussions we had around SMP back
> in the early 2000s when we had 32 core POWER4s and SGI had similar sized
> machines. Now a 24 core machine fits in 1U and can be purchased for
> under $5k. NUMA support, CPU affinity and multi queue scheduling are
> other areas that initially had a very small user base but have since
> become important features for many users.

Sure, the pointy edges can discover general problems of future early.
At the same time, they also encounter problems which no one else would
care about ever, so in itself it isn't much of an argument. I'm no
analyst and it is very difficult to foretell the future but comparing
CR to SMP and NUMA doesn't seem too valid to me. In-kernel CR is
sandwiched between userland CR and virtualization. Its problem space
is shrinking, not expanding.

Having a generally accepted standard CR implementation would certainly
be very nice and I understand that CR would be a much better fit for
HPC than virtualization, but I fail to see why it should be
implemented in kernel when userland implementation which doesn't
extend the kernel in any way already achieves most of what HPC
workload requires. In this already sizeable thread, the only benefits
presented seem to be the ability to cover some more corner cases and
remote use cases in slightly more transparent manner. Those are very
weak arguments for something as intrusive and backwards (in that it
dumps kernel states in binary blobs unrestrained by ABI) as in-kernel
CR and, as such, I don't really see the in-kernel CR surviving as a
mainline feature.

So, I think the best recourse would be identifying the specific
features which would help userland CR and improve them. The in-kernel
CR people have been working on the problem for a long time now and
gotta know which parts are tricky and how to solve them. In fact, I
don't think the work would be that widely different. It would be
harder but those changes would benefit other use cases too instead of
only useful for in-kernel CR.

Thanks.

--
tejun

2010-11-18 09:54:52

by Alan

[permalink] [raw]
Subject: Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch

> The problems found in HPC often turn into more general problems down
> the track. I think back to the heated discussions we had around SMP back
> in the early 2000s when we had 32 core POWER4s and SGI had similar sized
> machines. Now a 24 core machine fits in 1U and can be purchased for
> under $5k. NUMA support, CPU affinity and multi queue scheduling are
> other areas that initially had a very small user base but have since
> become important features for many users.

I'd prefer the trees to be separate for testing purposes: it
doens't make much sense to have SMP support as a normal
kernel feature when most people won't have SMP anyway"
-- Linus Torvalds


Only in this case I can't help feeling that the virtualisation work
already bypassed C/R, solved the problem space that a lot of people care
about and then moved on.

Alan

2010-11-18 12:27:09

by Alexey Dobriyan

[permalink] [raw]
Subject: Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch

On Thu, Nov 18, 2010 at 11:53 AM, Alan Cox <[email protected]> wrote:
> Only in this case I can't help feeling that the virtualisation work
> already bypassed C/R,

This discussion should have happened like 10 years ago. :-\

> solved the problem space that a lot of people care about and then moved on.

2010-11-19 06:33:44

by Gene Cooperman

[permalink] [raw]
Subject: Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch

> 1. Resource management. Any large HPC cluster should be 100% busy and
> as such you will often fill in the gaps with low priority jobs which
> may need to be preempted. These low priority jobs need to give up their
> resources (memory, interconnect resources etc) whenever something
> important comes in.
>
> 2. Fault tolerance. Failures are a fact of life for any decent sized
> cluster. As the cluster gets larger these failures become very common.
> Speaking from an industry perspective, MTBF rates measured in the order
> of several hours for large commodity clusters are not surprising. We at
> IBM improve on that with hardware and system design, but there is only
> so much you can do. The failures also happen at the Linux kernel level
> so even if we had 100% reliable systems we would still have issues.

We have also been somewhat involved in HPC. Grant provides a nice
summary of the two usage scenarios of checkpoint-restart that we have also
observed.

Since there is continuing discussion of HPC, I was a little surprised that
there has not been more discussion of BLCR (Berkeley Lab Checkpoint/Restart).
A brief introduction to BLCR follows, in case it's of interest.

https://ftg.lbl.gov/CheckpointRestart/CheckpointRestart.shtml

In the HPC space, we have observed that many sites use BLCR for
checkpoint-restart. BLCR is based on a kernel module, and so represents a third
approach. As mentioned on the FAQ, BLCR can checkpoint/restart a
process tree/group/session but has certain limitations, such as not supporting
sockets, ptys, and restoring original pids on restart only if there is no
collision with current pids. Nevertheless, BLCR has achieved wide usage in the
HPC community. Quoting from the BLCR FAQ:

Q: Does BLCR support checkpointing parallel/distributed applications?

Not by itself. But by using checkpoint callbacks (see previous FAQ). some
MPI implementations have made themselves checkpointable by BLCR. You can
checkpoint/restart an MPI application running across an entire cluster of
machines with BLCR, without any application code modifications, if you use
one of these MPI implementations (listed alphabetically):
* LAM/MPI 7.x or later
* MPICH-V 1.0.x
* MVAPICH2 0.9.8 or later
* Open MPI 1.3 or later

Q: Is BLCR integrated with any batch systems?

We are aware of the following, but we are not always informed of new
efforts to integrate with BLCR. For the most up-to-date information you
should consult the support channels of your favorite batch system.
* TORQUE version 2.3 and later
Support for serial and parallel jobs, including periodic checkpoints and
qhold/qrls.
* SLURM version 2.0 and later
Support for automatic (periodic) and manually requested checkpoints.
* SGE (aka Sun Grid Engine)
Information on configuring SGE to use BLCR can be found here. There is
also a thread on the [email protected] list about modifications to those
instructions. The thread begins with this posting.
* LSF
Information on configuring LSF to use BLCR can be found in this posting
on the [email protected] list.
* Condor
Information on configuring Condor to use BLCR to checkpoint "Vanilla
Universe" jobs with the help of Parrot can be found here.

- Gene

2010-11-21 23:20:54

by Grant Likely

[permalink] [raw]
Subject: Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch

On Tue, Nov 16, 2010 at 10:29 PM, Anton Blanchard <[email protected]> wrote:
> Hi Grant,
[...]
> There are two usage scenarios for C/R in this environment:
>
> 1. Resource management. Any large HPC cluster should be 100% busy and
> as such you will often fill in the gaps with low priority jobs which
> may need to be preempted. These low priority jobs need to give up their
> resources (memory, interconnect resources etc) whenever something
> important comes in.
>
> 2. Fault tolerance. Failures are a fact of life for any decent sized
> cluster. As the cluster gets larger these failures become very common.
> Speaking from an industry perspective, MTBF rates measured in the order
> of several hours for large commodity clusters are not surprising. We at
> IBM improve on that with hardware and system design, but there is only
> so much you can do. The failures also happen at the Linux kernel level
> so even if we had 100% reliable systems we would still have issues.
>
> Now this is the pointy end of HPC, but similar issues are happening in
> the meat of the HPC market. One area we are seeing a lot of C/R
> interest is the EDA space. As ICs become more and more complex the
> amount of cluster compute power it takes to route, check, create masks
> etc grows so large that system reliability becomes an issue. Some tool
> vendors write their own application C/R, but there are a multitude of
> in house applications that have no C/R capability today.

I agree, and I think this is exactly the place where the discussions
about c/r need to be focused (the pointy end). I don't tend to swoon
at the idea of c/r'ing my desktop session because it doesn't represent
a real or interesting problem for me. However, I do see the value in
the scenarios described above. I have another for you; I peripherally
worked on a telephone switch system that used a form of C/R for the
call processing task to synchronise with a hot-standby node for
uninterrupted cut-over in the event of failure. /my/ concerns are
more of the, "what is the impact on the kernel?" type.

> You could argue that we should just add C/R capability to every HPC
> application and library people care about or rework them to be
> fault tolerant in software. Unfortunately I don't see either as being
> viable. There are so many applications, libraries and even programming
> languages in use for HPC that it would be a losing battle. If we
> did go down this route we would also be unable to leverage C/R for
> anything else.

Fair enough, and I do somewhat agree with this. However the question
remains, what are the constraints? What are the limitations and
boundaries? Oden describes the constrains on the current c/r patches.
How well do those match up with the use cases discussed above? How
does DMTCP match up with those use cases?

> I can understand the concern around finding a general
> purpose case, but I do believe many other solid uses for C/R outside of
> HPC will emerge.For example, there was interest from the embedded guys
> during the KS discussion and I can easily imagine using C/R to bring up
> firefox faster on a TV.

Heh, sounds like doing the initial-program-load (IPL) stage like I
used to do on telephone switch firmware. :-)

>
> The problems found in HPC often turn into more general problems down
> the track. I think back to the heated discussions we had around SMP back
> in the early 2000s when we had 32 core POWER4s and SGI had similar sized
> machines. Now a 24 core machine fits in 1U and can be purchased for
> under $5k. NUMA support, CPU affinity and multi queue scheduling are
> other areas that initially had a very small user base but have since
> become important features for many users.
>
> Anton
>



--
Grant Likely, B.Sc., P.Eng.
Secret Lab Technologies Ltd.