2004-04-28 19:44:50

by Neal D. Becker

[permalink] [raw]
Subject: State of linux checkpointing?

I wonder if there is a checkpointing that will work with 2.6 kernels?

I only need relatively basic checkpointing. No sockets or fancy stuff.



2004-04-28 20:28:57

by Jeff Garzik

[permalink] [raw]
Subject: Re: State of linux checkpointing?

Neal D. Becker wrote:
> I wonder if there is a checkpointing that will work with 2.6 kernels?
>
> I only need relatively basic checkpointing. No sockets or fancy stuff.


You only need checkpointing when your application programmers are lazy
and don't care about data integrity. :)

Jeff



2004-04-28 23:18:02

by Tim Connors

[permalink] [raw]
Subject: Re: State of linux checkpointing?

Jeff Garzik <[email protected]> said on Wed, 28 Apr 2004 16:23:00 -0400:
> Neal D. Becker wrote:
> > I wonder if there is a checkpointing that will work with 2.6 kernels?
> >
> > I only need relatively basic checkpointing. No sockets or fancy stuff.
>
> You only need checkpointing when your application programmers are lazy
> and don't care about data integrity. :)

Or you are running some kind of cluster where you want the
applications to be checkpointed transparently without the application
knowing the details of how or when they will be swapped out (but this
will need sockets anyway, so won't happen anytime soon).

'Tis a pain that the alpha cluster here can suspend long running jobs
for a pile of smaller jobs, and then resume, but the linux cluster can
do no such fanciness (yes, we do manual checkpointing, but it's prone
to bugs - and finding such a bug after 30 days of compute time really
sucks balls).


--
TimC -- http://astronomy.swin.edu.au/staff/tconnors/
Beware of Programmers who carry screwdrivers.

2004-04-29 09:42:20

by Ihar 'Philips' Filipau

[permalink] [raw]
Subject: Re: State of linux checkpointing?

Neal Becker wrote:
>
> I want checkpointing for:
>
> 1) Protect against job interruption due to system crash, operator error,
> power loss, whatever
>
> 2) Job mygration. Even manual job mygration would be nice.
>

Several months ago some guys from Ukrain were announcing special
patch for kernel which allows you to save task to file, and then later
to load it from this file.
Saving task to disk say every 5 seconds - is some sort of
checkpointing, so later you will be able to go back in time :-)
You might want to check l-k archives - I cannot recall correctly the
name of this stuff.