LinuxLists.cc - holy grail

2002-12-27 00:43:03

Subject: holy grail

a hot swap kernel would be something like the holy grail of kernel
hacking. it would logically go something like this:

void kexec_hot_swap()
{
void *kern = load_kernel_into_mem();
syscall_queue(ENABLE); /* queue all sys calls */
irq_queue(ENABLE); /* queue all irqs */
/* bring new kernel's state inline with current one's.
this includes all data structures, module hooks, etc.
this needs to be very fast as irqs will be pending... */
synch_kernel(kern);
kernel_start(kern); /* fire in the hole... */
}

at this point the new kernel would know it is being started as a
hot swap throught a flag or something, and dequeue the irq's
that are pending, followed by the sys calls that are waiting.
if this goes how i think it should, a user running on the system
wont even know the kernel was swapped.

what do you think? is it do-able?

=====
Main Entry: anom?a?lous
1 : inconsistent with or deviating from what is usual, normal, or expected: IRREGULAR, UNUSUAL
2 (a) : of uncertain nature or classification (b) : marked by incongruity or contradiction : PARADOXICAL
synonym see IRREGULAR

__________________________________________________
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

2002-12-27 03:55:40

by Werner Almesberger

[permalink] [raw]

Subject: Re: holy grail

Anomalous Force wrote:
> a hot swap kernel would be something like the holy grail of kernel
> hacking.

:-) This comes up every once in a while. The closest approximation
you have for this is swsusp. But you'd of course want to start a
non-identical kernel. And that's where the hard problems lie.

An older or newer kernel would have different data structures, and
possibly even data structures which are arranged in a different way
(e.g. a hash becomes some kind of tree, etc.). So you'd need some
translation mechanism that "upgrades" or "downgrades" all kernel
data, including all pointers and offsets that may be sitting
around somewhere. Good luck :-)

Your best bet would be to use a system that already implements some
form of checkpointing or process migration, and use this to
preserve user space state across kexec reboots. openMosix may be
relatively close to being able to do this for general user space.
(I don't know what openMosix currently can do, but many of the
problems they need to solve are similar in nature.)

- Werner

--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/

2002-12-27 07:13:27

by Anomalous Force

[permalink] [raw]

Subject: Re: holy grail

--- Werner Almesberger <[email protected]> wrote:

[snip]

> An older or newer kernel would have different data structures, and
> possibly even data structures which are arranged in a different way
> (e.g. a hash becomes some kind of tree, etc.). So you'd need some
> translation mechanism that "upgrades" or "downgrades" all kernel
> data, including all pointers and offsets that may be sitting
> around somewhere. Good luck :-)

what if the new kernel asked the old kernel to hand over the data in
a form that was understood universally beginning at some kernel
version X (earliest supported kernel in other words). the data
would not have to remain in the optimized form that it would reside
in while under normal operations. it could be serialized as such into
a form that simply contains its content and context. im thinking of
something along the lines of a data packet (tcp/ip comes to mind)
that contains data about its data. a structure similar to that, which
conveys information describing the data its contains. any mechanisms
the newer kernel may institute would get set to a default state
similar to booting just that portion of the kernel.

>
> Your best bet would be to use a system that already implements some
> form of checkpointing or process migration, and use this to
> preserve user space state across kexec reboots. openMosix may be

[snip]

preserving user state would not be so much the problem as would
the various internal kernel data structures (vm stuff, dcache, etc.)
the issue here is to freeze the system state, sys calls, irqs, and
all, and restart the same state where it left off after the switch.
the kernel would not need to boot, as an initial boot has already
been done by the previous kernel.

yes, it would be extremely difficult. but, as with all fields of
endevour, a holy grail is only such because it is. the question
remains, is this do-able? perhaps not now, or in two years, but
what about five? say, kernel 3.x.x or even 4.x.x?

=====
Main Entry: anom?a?lous
1 : inconsistent with or deviating from what is usual, normal, or expected: IRREGULAR, UNUSUAL
2 (a) : of uncertain nature or classification (b) : marked by incongruity or contradiction : PARADOXICAL
synonym see IRREGULAR

__________________________________________________
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

2002-12-27 11:22:40

by Werner Almesberger

[permalink] [raw]

Subject: Re: holy grail

Anomalous Force wrote:
> what if the new kernel asked the old kernel to hand over the data in
> a form that was understood universally beginning at some kernel
> version X (earliest supported kernel in other words).

Yes, and that information would ideally just be what is visible
from user space. This gives you a well-defined abstraction, and
limits the dependency on kernel internals.

> im thinking of something along the lines of a data packet (tcp/ip
> comes to mind) that contains data about its data.

I guess you never looked at how much state TCP really carries
around :-) For a rough idea, you may want to have a look at
tcpcp (TCP Connection Passing), which does pretty much what you'd
have to do for this kind of checkpointing:
http://www.almesberger.net/tcpcp/

Now, there are a few things to consider:

- tcpcp ignores several rare conditions, such as urgent mode
- tcpcp doesn't even try (yet) to preserve congestion control
information, which is about twice the current amount of
information again
- even with all those constraints, there are almost certainly
some things I've overlooked
- that's only TCP, i.e. one of several networking protocols. And
networking is just one of many subsystems. And what tcpcp does
is not even transparent to applications.
- while TCP is certainly not trivial, there is a reasonably well
defined abstraction of its state, which simplifies this kind of
checkpointing

And remember, this is still only about what can be seen from user
space. No attempt is made to transplant timers, memory allocations,
cloned skbs, etc.

> yes, it would be extremely difficult. but, as with all fields of
> endevour, a holy grail is only such because it is. the question
> remains, is this do-able? perhaps not now, or in two years, but
> what about five? say, kernel 3.x.x or even 4.x.x?

For full direct kernel-to-kernel migration, I'm fairly confident
the answer is "never", simply because it doesn't make sense, and
because it would be completely unmaintainable (1,2). (I expect to
see some information passing for things like IDE or SCSI bus scan
results, though.)

(1) Okay, I'll reverse my prognosis when we've had, say, ten new
kernels in a row, without any obvious major bugs or build
problems :-)
(2) If you dig out IFS, you'll see a nice example of why you
don't want to create too many dependencies on kernel
internals :-) http://www.almesberger.net/epfl/ifs.html

For userspace-to-userspace, we can probably do some things already
today (e.g. "classical" batch jobs), and I guess we might be able
to migrate reasonably complete systems in maybe one or two years,
if somebody starts working on the corner cases that aren't of much
interest for process migration (e.g. video, audio, etc.).

- Werner

--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/

2002-12-27 13:05:44

by Ingo Oeser

[permalink] [raw]

Subject: Re: holy grail

Hi,

On Thu, Dec 26, 2002 at 11:21:42PM -0800, Anomalous Force wrote:
[hot swapping the kernel]
> yes, it would be extremely difficult. but, as with all fields of
> endevour, a holy grail is only such because it is. the question
> remains, is this do-able? perhaps not now, or in two years, but
> what about five? say, kernel 3.x.x or even 4.x.x?

I would just say: Start it yourself and see how hard it is. i

You might find contributors and even sponsors, while you actually
work on it. There certainly is a market for this feature, because
there are OSes support this. Would be nice, if Linux could crack
into this market, too.

So happy hacking and good luck!

Regards

Ingo Oeser
--
Science is what we can tell a computer. Art is everything else. --- D.E.Knuth

2002-12-27 13:17:35

by Pavel Machek

[permalink] [raw]

Subject: Re: holy grail

Hi!

> > Your best bet would be to use a system that already implements some
> > form of checkpointing or process migration, and use this to
> > preserve user space state across kexec reboots. openMosix may be
>
> [snip]
>
> preserving user state would not be so much the problem as would
> the various internal kernel data structures (vm stuff, dcache, etc.)

Actually, you want to kill vm structures, dcache etc. You only want
userspace-visible state to be carried forward to minimize possibility
of bringing bugs to new kernel.
Pavel
--
Worst form of spam? Adding advertisment signatures ala sourceforge.net.
What goes next? Inserting advertisment *into* email?

2002-12-28 16:26:59

by Anomalous Force

[permalink] [raw]

Subject: Re: holy grail

--- Werner Almesberger <[email protected]> wrote:

> Anomalous Force wrote:
> > what if the new kernel asked the old kernel to hand over the data
> in
> > a form that was understood universally beginning at some kernel
> > version X (earliest supported kernel in other words).
>
> Yes, and that information would ideally just be what is visible
> from user space. This gives you a well-defined abstraction, and
> limits the dependency on kernel internals.
>
> > im thinking of something along the lines of a data packet (tcp/ip
> > comes to mind) that contains data about its data.
>
> I guess you never looked at how much state TCP really carries
> around :-) For a rough idea, you may want to have a look at
> tcpcp (TCP Connection Passing), which does pretty much what you'd
> have to do for this kind of checkpointing:
> http://www.almesberger.net/tcpcp/

you miss my point. im not saying to model it after tcp/ip. that
was just a reference to a method of data exchange wherein the
data has metadata to describe it.

> > yes, it would be extremely difficult. but, as with all fields of
> > endevour, a holy grail is only such because it is. the question
> > remains, is this do-able? perhaps not now, or in two years, but
> > what about five? say, kernel 3.x.x or even 4.x.x?
>
> For full direct kernel-to-kernel migration, I'm fairly confident
> the answer is "never", simply because it doesn't make sense, and

it makes full sense in an enterprise with 3000+ users that operates
24/7/365. no scheduled down-time for kernel upgrades.

> because it would be completely unmaintainable (1,2). (I expect to

this is not true. if the system were an integral part of the overall
design, then programming would include it apriori.

> see some information passing for things like IDE or SCSI bus scan
> results, though.)

there is a fine distinction between kernel migration, and hot-swap.
in a hot-swap setup, there will be signals pending from devices
that are contextually needed to continue operations that were being
performed. in a migration, the system is for all practical purposes
in a rebooted state after the switch, and thus, no context is
conveyed to the new kernel. keeping only the user-space context
would not allow hot-swap, unless all device activity was guaranteed
to be completed, as the user-space context was queued. _this_, i do
not believe is possible, as some devices will continue to need
context specific updating, and such will not be possible as the
servicing of the device would be based around context information
that would be stuck in a queue. the situation would be something
like a person stuck in mid-sentence trying to tell someone on a
telephone how to disarm a time-bomb. the clock still ticks even
though the instructor is silent.

> (2) If you dig out IFS, you'll see a nice example of why you
> don't want to create too many dependencies on kernel
> internals :-) http://www.almesberger.net/epfl/ifs.html

no comparison. the data transfer mechanism would be integral with
the kernel, and thus the only dependancies would be internal. it
would be an integrated api (from an external viewpoint).

>
> - Werner
>
> --

=====
Main Entry: anom?a?lous
1 : inconsistent with or deviating from what is usual, normal, or expected: IRREGULAR, UNUSUAL
2 (a) : of uncertain nature or classification (b) : marked by incongruity or contradiction : PARADOXICAL
synonym see IRREGULAR

__________________________________________________
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

2002-12-28 20:35:26

by Rik van Riel

[permalink] [raw]

Subject: Re: holy grail

On Sat, 28 Dec 2002, Anomalous Force wrote:
> --- Werner Almesberger <[email protected]> wrote:
>
> > because it would be completely unmaintainable (1,2). (I expect to
>
> this is not true. if the system were an integral part of the overall
> design, then programming would include it apriori.

This has been said before, but "for some reason" everybody
who said it went quiet the moment they started working on
a patch and have never been heard from again.

Either they're still working on the problem (after a four
years) or they've moved on to an easier/realistic project.

regards,

Rik
--
Bravely reimplemented by the knights who say "NIH".
http://www.surriel.com/ http://guru.conectiva.com/
Current spamtrap: <a href=mailto:"[email protected]">[email protected]</a>

2002-12-29 15:47:40

by Anomalous Force

[permalink] [raw]

Subject: Re: holy grail

--- Rik van Riel <[email protected]> wrote:
> On Sat, 28 Dec 2002, Anomalous Force wrote:
> > --- Werner Almesberger <[email protected]> wrote:
> >
> > > because it would be completely unmaintainable (1,2). (I expect
> to
> >
> > this is not true. if the system were an integral part of the
> overall
> > design, then programming would include it apriori.
>
> This has been said before, but "for some reason" everybody
> who said it went quiet the moment they started working on
> a patch and have never been heard from again.
>
> Either they're still working on the problem (after a four
> years) or they've moved on to an easier/realistic project.

i have stated this would be extremely difficult. no single person
could attempt this without the support of the other developers as
the effort must include all aspects of the kernel to some extent.
the original discussion for this was to show that kexec() _could_
become something that is a holy grail amoung kernel developers:
hot-swap. if all of the kernel developers think this can not be done,
then it is not worth discussion any further. for a single person to
make this happen, it would require that a single kernel version
become frozen, and all aspects of it altered to support the
operations of hot swapping. in such a senario, the development of
the mainstream kernel would have progressed to the point that
any attempting to apply a patch would prove futile as the code base
for the patch has become obsolete. it is for this reason that i say
it would take the awareness of all the developers moving forward.

>
> regards,
>
> Rik
> --
> Bravely reimplemented by the knights who say "NIH".
> http://www.surriel.com/ http://guru.conectiva.com/
> Current spamtrap: <a
href=mailto:"[email protected]">[email protected]</a>

=====
Main Entry: anom?a?lous
1 : inconsistent with or deviating from what is usual, normal, or expected: IRREGULAR, UNUSUAL
2 (a) : of uncertain nature or classification (b) : marked by incongruity or contradiction : PARADOXICAL
synonym see IRREGULAR

__________________________________________________
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

2002-12-29 16:36:58

by John Bradford

[permalink] [raw]

Subject: Re: holy grail

> > This has been said before, but "for some reason" everybody
> > who said it went quiet the moment they started working on
> > a patch and have never been heard from again.
> >
> > Either they're still working on the problem (after a four
> > years) or they've moved on to an easier/realistic project.
>
> i have stated this would be extremely difficult. no single person
> could attempt this without the support of the other developers as
> the effort must include all aspects of the kernel to some extent.
> the original discussion for this was to show that kexec() _could_
> become something that is a holy grail amoung kernel developers:
> hot-swap.

Why take the easy road, ( :-) ), and merely make the kernel
hot-swappable? You could use the code from the User Mode Linux
project as a starting point for creating a Meta Kernel Mode linux
project, and run several more kernel images concurrently as user mode
processes of the top-level kernel, and then add necessary to connect
any particular physical hardware to the MKML virtual machines.

Then, you could migrate your applications from kernel to kernel
without ever having to re-boot.

Mainframe power on the desktop :-)

John.

2002-12-29 23:45:16

by Werner Almesberger

[permalink] [raw]

Subject: Re: holy grail

Anomalous Force wrote:
> you miss my point. im not saying to model it after tcp/ip. that
> was just a reference to a method of data exchange wherein the
> data has metadata to describe it.

I understood that. What I was saying is that metadata in a TCP
connection is usually not sufficient for restoring the endpoint
state.

> it makes full sense in an enterprise with 3000+ users that operates
> 24/7/365. no scheduled down-time for kernel upgrades.

I don't disagree with the usefulness of such functionality, but
I disagree with the level at which you suggest to implement this.

The approach of trying to migrate low-level kernel state has the
following problems/disadvantages:

- complexity
- does not allow recovery from corrupt kernel state, as Pavel has
suggested
- does not support recovery from corrupt hardware state
- does not support substitution of infrastructure (e.g. what if
I want to fail over to a different machine, maybe quickly
replace some non-hotpluggable hardware (*), or even swap that
old disk with a new one that has completely different
characteristics ?)

So, compared to an approach that implements this at the kernel to
user space API level, you get a lot of extra complexity, but miss
several very desirable features.

(*) While the "big iron" in your data center may have hot-swappable
CPUs and everything, it would be nice if such things could also
be done with commodity hardware that doesn't provide such
luxury.

> this is not true. if the system were an integral part of the overall
> design, then programming would include it apriori.

Making something part of the design alone doesn't guarantee that
this is a good approach, nor that it will actually work :-)

> there is a fine distinction between kernel migration, and hot-swap.
> in a hot-swap setup, there will be signals pending from devices
[...]

Err, yes, but what does your "hot-swap" do that kernel migration
doesn't ?

- Werner

--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/

2002-12-30 00:15:36

by Alan

[permalink] [raw]

Subject: Re: holy grail

On Sat, 2002-12-28 at 20:43, Rik van Riel wrote:
> Either they're still working on the problem (after a four
> years) or they've moved on to an easier/realistic project.

Or they read a book on clusters and figured it out

Roughly speaking

If you care about uptime to the point of live kernel updates
Additional systems are acceptable costs
Hardware failure is also unacceptable
Clustering is cheaper than solving the kernel on the fly problem

2002-12-30 01:25:06

by Werner Almesberger

[permalink] [raw]

Subject: Re: holy grail

Alan Cox wrote:
> If you care about uptime to the point of live kernel updates

Yes, but there are more applications than improving overall uptime.
E.g. during development or other testing, it would be convenient to
be able to switch back and forth between distinct kernels, without
necessarily taking down the entire machine. Likewise for trivial
hardware changes.

Also, I don't think the instrumentation required would be all that
horrible: things can be done incrementally, and I'd expect a lot
of the functionality to be useful for other purposes, too.

I see a certain trend towards mechanisms that can be useful for
process migration. E.g. the address space manipulations discussed
for UML seem to allow almost perfect reconstruction of processes.
PIDs, signals, anything with externally visible changes in kernel
state that aren't immediately seen by the application (networking,
tty editing, etc.), and such, would need extra instrumentation, of
course.

With this in place, we'd need a set of mechanisms that allow to
find out what the process state actually is like, e.g. determining
what hangs off a certain fd, and what its state is. A lot of this
is already available via /proc, so that may be a starting point.
Programs that talk directly to hardware (e.g. X11) would need a
bit more work.

Then add a bit of synchronization, and we can migrate individual
processes. Add more synchronization, and we can migrate full user
space. Add some really fast disks, and this will be quick enough
for "on the fly" kernel swapping. Add a means for preserving user
memory and swap, and you may not even need fast disks.

Uh, sounds almost too easy ;-)

Of course, as a first step, it would make sense to have a good
look at what projects like (open)Mosix have already done in this
area. After all, they've already solved most aspects of process
migration.

- Werner

--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/

2002-12-30 02:45:16

by Jeff Dike

[permalink] [raw]

Subject: Re: holy grail

[email protected] said:
> I see a certain trend towards mechanisms that can be useful for
> process migration. E.g. the address space manipulations discussed for
> UML seem to allow almost perfect reconstruction of processes. PIDs,
> signals, anything with externally visible changes in kernel state

With a UML running in skas mode, the process address space is identical to
what it would be on the host. Migrating one to the host would be a matter
of
Sticking a process in it
Releasing that process from ptrace
Recreating the required kernel state in the host kernel
Kicking the process out of the UML kernel and into userspace somehow
Letting it run

Step 3 is obviously where the meat of the problem is. The process needs
to have available on it all the resources it had in UML -
the same files
network connections (puntable on a first pass)
process relationships (I have no idea what to do about a parent
process on the host, nor what to do with children whose parent has been
migrated, or ipc mechanisms, except to do the Mosix thing and have little
proxies sitting around passing information between UML and the host).

And since I've brought up Mosix, as did Werner, the fastest way to get
this working is probably to finish off the OpenMosix/UML port (which was
close from what I heard), and cluster a UML and its host. You should get
process migration for free.

Just remember to prevent the host from trying to migrate a UML to itself.
That would be very bad.

Jeff

2002-12-30 03:59:31

by David Lang

[permalink] [raw]

Subject: Re: holy grail

On Sun, 29 Dec 2002, Jeff Dike wrote:

> Step 3 is obviously where the meat of the problem is. The process needs
> to have available on it all the resources it had in UML -
> the same files
> network connections (puntable on a first pass)
> process relationships (I have no idea what to do about a parent
> process on the host, nor what to do with children whose parent has been
> migrated, or ipc mechanisms, except to do the Mosix thing and have little
> proxies sitting around passing information between UML and the host).

I think people are at the point of working on this becouse it sounds like
a worthwhile feature, not becouse it's actually anything that would be
used.

what possible application needs to be able to do a seamless kernel upgrade
that wouldn't be useing a network?

if it's a batch processing task, it can checkpoint itself and restart
after a reboot.

if it's a controller of specialized equipment then you either can have the
process checkpoint itself, or you can't afford to pause long enough to do
the kernel swap (i.e. the device keeps operating regardless and so may
generate ssignals to the program during the time when you are swapping
kernels)

As Alan Cox said, anyone really needing this will have redundant systems
anyway (to cover the case of hardware failure) so they will already be
dealing with things on a cluster leveel and rebooting a machine to
complete the upgrade will not be that bad (they upgrade the backup, reboot
it, sync things up, failover to the backup, upgrade and reboot the primary
and keep running on the backup until the next upgrade cycle)

David Lang

2002-12-30 04:30:47

by Anomalous Force

[permalink] [raw]

Subject: Re: holy grail

--- David Lang <[email protected]> wrote:
>
> I think people are at the point of working on this becouse it
> sounds like
> a worthwhile feature, not becouse it's actually anything that would
> be
> used.

UML sounds like a worthwhile feature, turns out its actually pretty
useful too. kexec() is supported in its current incarnation. why
not simply extend it the one step further?

>
> what possible application needs to be able to do a seamless kernel
> upgrade
> that wouldn't be useing a network?

"programs will never use more than 640K of memory." - bill gates

lets talk clusters... the teragrid system being built out of 2024
redhat 7.2 installs (ncsa alone, not counting the 3 other cluster
sites). imagine a simple system on the network to push a copy of the
new kernel and then telling each node to hot-swap. 0 downtime.
__super__ easy to maintain. how easy would that become??? how about
this... an nfs mount point in the grid for /boot such that each node
then gets the kernel from a central point and hot swaps when a flag
is set, or a change is detected in the /boot directory. no push even
needed then. the cost savings from that alone would be worth the
effort to them.

>
> if it's a batch processing task, it can checkpoint itself and
> restart
> after a reboot.
>

2024 nodes rebooting, how much time needed while the system is in a
degraded state?

> if it's a controller of specialized equipment then you either can
> have the
> process checkpoint itself, or you can't afford to pause long enough
> to do
> the kernel swap (i.e. the device keeps operating regardless and so
> may
> generate ssignals to the program during the time when you are
> swapping
> kernels)
>

hence a queue to catch pending irqs while the system swaps over.

=====
Main Entry: anom?a?lous
1 : inconsistent with or deviating from what is usual, normal, or expected: IRREGULAR, UNUSUAL
2 (a) : of uncertain nature or classification (b) : marked by incongruity or contradiction : PARADOXICAL
synonym see IRREGULAR

__________________________________________________
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

2002-12-30 04:51:52

by Anomalous Force

[permalink] [raw]

Subject: Re: holy grail

--- Anomalous Force <[email protected]> wrote:

> lets talk clusters... the teragrid system being built out of 2024
> redhat 7.2 installs (ncsa alone, not counting the 3 other cluster
> sites).

i stand corrected, 512 installs (2024 total itanium procs) at ncsa.
96 more in in argonne illinois, 192 at the san diego supercomputing
center, and another 96 in pasadena.

=====
Main Entry: anom?a?lous
1 : inconsistent with or deviating from what is usual, normal, or expected: IRREGULAR, UNUSUAL
2 (a) : of uncertain nature or classification (b) : marked by incongruity or contradiction : PARADOXICAL
synonym see IRREGULAR

__________________________________________________
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

2002-12-30 06:39:02

by Ed Sweetman

[permalink] [raw]

Subject: Re: holy grail

Anomalous Force wrote:
> --- Anomalous Force <[email protected]> wrote:
>
>
>>lets talk clusters... the teragrid system being built out of 2024
>>redhat 7.2 installs (ncsa alone, not counting the 3 other cluster
>>sites).
>
>
> i stand corrected, 512 installs (2024 total itanium procs) at ncsa.
> 96 more in in argonne illinois, 192 at the san diego supercomputing
> center, and another 96 in pasadena.
>
>
> =====
> Main Entry: anom?a?lous
> 1 : inconsistent with or deviating from what is usual, normal, or expected: IRREGULAR, UNUSUAL
> 2 (a) : of uncertain nature or classification (b) : marked by incongruity or contradiction : PARADOXICAL
> synonym see IRREGULAR
>
> __________________________________________________
> Do you Yahoo!?
> Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
> http://mailplus.yahoo.com
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

How is avoiding a hardware reboot the holy grail?

Interesting despite the complexity required to impliment but hardly a
very important feature, much less something everyone has been trying to
do.

2002-12-30 12:40:44

by Alan

[permalink] [raw]

Subject: Re: holy grail

On Mon, 2002-12-30 at 01:32, Werner Almesberger wrote:
> Yes, but there are more applications than improving overall uptime.
> E.g. during development or other testing, it would be convenient to
> be able to switch back and forth between distinct kernels, without
> necessarily taking down the entire machine. Likewise for trivial
> hardware changes.

Suspend to disk
Change hardware
Resume

That much sort of works. An Intel guy wrote a very simple piece of code
I need to clean up into Linux format which rescans the pci bus and
generates insert/remove events for any device that took a walk.

Alan

2002-12-30 17:49:51

by Eric W. Biederman

[permalink] [raw]

Subject: Re: holy grail

Anomalous Force <[email protected]> writes:

> --- David Lang <[email protected]> wrote:
> >
> > I think people are at the point of working on this becouse it
> > sounds like
> > a worthwhile feature, not becouse it's actually anything that would
> > be
> > used.
>
> UML sounds like a worthwhile feature, turns out its actually pretty
> useful too. kexec() is supported in its current incarnation. why
> not simply extend it the one step further?

kexec() still has not quite made it into the kernel yet...
Can we at least finish one piece before starting on the next?

>
> >
> > what possible application needs to be able to do a seamless kernel
> > upgrade
> > that wouldn't be useing a network?
>
> "programs will never use more than 640K of memory." - bill gates
>
> lets talk clusters... the teragrid system being built out of 2024
> redhat 7.2 installs (ncsa alone, not counting the 3 other cluster
> sites). imagine a simple system on the network to push a copy of the
> new kernel and then telling each node to hot-swap. 0 downtime.
> __super__ easy to maintain. how easy would that become???

In this case you stagger the reboots, then if you have failover you
get 0 downtime.

> how about
> this... an nfs mount point in the grid for /boot such that each node
> then gets the kernel from a central point and hot swaps when a flag
> is set, or a change is detected in the /boot directory. no push even
> needed then. the cost savings from that alone would be worth the
> effort to them.

KABOOM... you just saturated the network with NFS traffic.

>
> >
> > if it's a batch processing task, it can checkpoint itself and
> > restart
> > after a reboot.
> >
>
> 2024 nodes rebooting, how much time needed while the system is in a
> degraded state?

On MCR (960 nodes at the time) I have rebooted the entire cluster,
including downloading a the kernel over the network in a minute. And
a complete reinstall of all compute notes in the cluster took about 5
minutes. With a little care most of the extra management complexity
of a large cluster is due to hardware problems.

There are two very different problems being considered here: high
availability clustering, and high performance clustering. In high
availability clustering you though hardware at the problem so that you
application continues to run. For high performance computing you
throw even more hardware at the problem so your program runs fast.

At some point the high performance clustering needs the high
reliability techniques because with enough hardware the failure
rate becomes noticeable. Mean time between failure becomes something
you experience and can easily measure instead.

Once the hardware has been made as redundant and as reliable as
possible job check-pointing next becomes the only way to run longer
jobs on the system. Given that one MPI job may span the entire
cluster this is a very interesting problem. Long term at least that
is something that needs to be completed.

> hence a queue to catch pending irqs while the system swaps over.

And back to the heart of the kexec territory. Here you simply drop
irqs until the system comes back up. And then drivers should poll
their hardware to see what state it is in when they come back up.
Additionally it is part of the kexec design to place hardware is a
quiescent state while the kernels are being swapped.

So there should be no special kernel state that needs to be saved
across kernels. Just enough state to recreate the user space
abstractions.

Additionally we need a scalable filesystem for the clusters. Lustre
shows some promise. But it is not done yet. Things like GFS are
o.k. But I believe they rely on all of the disks being on a single
storage area network which is a bit of a scaleability and reliability
problem.

Eric