This posting was sent yesterday but had a forbidden subject line,
borrowed from a Jonathan Swift short story (A Modest Proposal). Here it
is again, with apologies to those on the cc list for the duplicate.
The subject of loading a new kernel without restarting surfaces from
time to time and has come up again on linux-kernel,
http://marc.theaimsgroup.com/?l=linux-kernel&m=105198997207784&w=2
so here are my thoughts on the matter. Comments, alternatives and
reasons why this can't or shouldn't be done are all welcome. Especially
comments about how this could or should be done differently.
Here is yet another road-map for changing the kernel on a machine while
minimizing the disruption to user processes. This method has the
advantage that all of the major pieces here have either been proposed or
are in various stages of development. However, the glue that holds it
all together does not yet exist in any form, as far as I am aware. At
least the major pieces all have merits of their own, regardless of
whether they are used together as described here.
Some disclaimers up front: This may be over-engineered, not possible,
or just a horrible way to accomplish something no one really needs.
This method would only work on a two-way or greater SMP box, and may not
be feasible on 32-bit arches due to the difficulty (or impossibility) of
squeezing more than one kernel into ZONE_NORMAL at the same time,
although it's possible that techniques relevant to using very large
amounts of memory (like page clustering on NUMA) could be adapted to
come to the rescue here. Piece C) may be much easier on 64-bit arches
for that reason, and hopefully 32-bit systems will take the place of
16-bit systems in our vague memories before the decade is out. Better
to plan ahead now.
The major pieces are:
A) Kexec, now in 2.5.68-mm4. Kexec provides a way of Linux loading
Linux. The relieving hot-swap kernel might be given command line
arguments to come up not tweaking/probing any hardware and to not run
init. Information normally gained from hardware probes would be made
available from the still running old kernel (or designated leader in the
case of multiple old kernels on multiple nodes). The new kernel might
be told on which CPU or set of CPUs on which to boot, or there might be
some way for the appropriate CPU or set of CPUs to be reliably detected.
B) BProc, implemented for 2.4 but not in mainline and not yet ported to
2.5. Beowulf Distributed Process Space (BProc) is described here:
http://bproc.sourceforge.net/ and is used to manage this 1024-node
machine: http://www.lanl.gov/projects/pink/ which is located a few miles
down the road from where I work on much more pedestrian projects. In
addition to managing user processes across machines in a traditional
cluster, perhaps this could be developed to manage processes across
nodes in a CC-cluster (and to transfer the functionality of the Master
BProc Node to another), which brings us to C).
C) Cache coherent clustering proposed by Larry McVoy, described here:
http://www.bitmover.com/ml/slide01.html and rather long threads
on linux-kernel start here:
http://marc.theaimsgroup.com/?l=linux-kernel&m=100751282125562&w=2 and
http://marc.theaimsgroup.com/?l=linux-kernel&m=100752000911911&w=2
This was proposed as a way to scale Linux to machines with large numbers
of CPUs. With advances in multiple chip on die and more extreme
hyper-threading, Linux may some day have to deal with, for example, a
512 CPU system. Think of a CC-cluster of 32 nodes of 16 CPUs each, with
a separate kernel running each node. Obviously, there are many very
difficult issues (like how the kernels interact and don't interfere with
one another over i/o buses, etc) to be solved, so this piece is nowhere
near being implemented, at least as far as I know. Many major problems
with this have been pointed out before, so this could turn out to be
infeasible. I hope that is not the case. The degenerate case is a
two-way box with separate kernels on each CPU.
Putting these three pieces together, we could hot-swap the kernel, with
user processes being minimally affected and external connections perhaps
not even noticing.
For the simplest case of a dual-CPU box:
1) One of the CPUs is halted and declared unavailable. The user
processes now have only one CPU on which to run, but this disruption
will be temporary.
2) Using Kexec, the new kernel is booted by the old kernel on the halted
CPU, with command line arguments to come up in a relieving mode, not
probing hardware, not running init, getting necessary hardware
configuration details from the old kernel. The new kernel also has to
come up shoe-horned into the same space as the old kernel in a
CC-cluster mode (this is the acknowledged really hard part again).
3) Once up as a separate and autonomous kernel, the new kernel checks to
see that it is properly configured for the hardware which it has just
been told about and presents its qualifications to the old kernel. If
it passes these tests, the old kernel uses BProc to transfer all user
processes to the new kernel. A human analog for this exists in the
formal transfer of authority between the on-coming and off-going officer
of the deck on a naval vessel:
New kernel: "I am ready to relieve you"
(After assessing the situation. This would include
determining which file systems and drivers were needed, which
modules need to be loaded, etc. and perhaps which daemons
need to be running prior to user process transfer)
Old kernel: "I am ready to be relieved"
(After assessing the relief's ability to take over. Same
as above, but a double check on the new kernel's configuration.
If either of these two steps fail, the kernel-swap is aborted,
the old kernel tells the new kernel to shutdown and takes back
control of the halted CPU.)
New kernel: "I relieve you"
(user processes and daemons are now transferred with BProc)
Old kernel: "I stand relieved"
(all user processes are verified to be successfully transferred)
Now the old kernel can exit (or be told to shutdown properly).
4) The new kernel assimilates the CPU on which the old kernel was
running (resistance is futile), marks it as available and user processes
and then new kernel can now be scheduled on it. The hot-swap of kernels
is now complete and for this simple case, the system is back to a
regular single SMP kernel. The CC-cluster configuration was just an
intermediate step.
Since the new kernel doesn't have to probe any hardware, this hot-swap
could in principle be very fast, so the time that the system's
capability is degraded could be very short, on the order of a few
seconds or less. For a 2-CPU system, this temporary degradation would
be at least 50%. For the extreme case of the 512 CPU system with 32
nodes, the new kernels could be brought up on one node at at time, so
the degradation might be as little as 1/32 or 3%.
The interfaces for doing all this should probably remain stable during a
major release cycle, so that any properly configured 2.8.x kernel would
be able to hot-swap with any other properly configured 2.8.x kernel.
The earliest that this could be done is 2.7.something, but perhaps even
later, judging from all the problems identified in the cc/smp cluster
threads.
Perhaps a brief note about why a kernel hot-swap is even desired might
be in order here. As systems become more and more complex (and
therefore important), their boot times seem to increase. My experience
with production systems is that taking them down for even a short time
can be hard to schedule. Rebooting for a needed upgrade with even a
well-tested vendor-kernel is a hard sell sometimes. Yes, I know that
these are issues which are orthogonal to this discussion, but having the
ability to install a new kernel with almost no disruption to the
customer could be worth it for some customers.
Simpler and easier strategies for almost accomplishing a hot-swap
involving user process check-pointing have been suggested, but those
involve a short duration halting of the system. If the gain of a
no-halt hot-swap is worth the considerable pain, then perhaps this
road-map is worth investigating further.
If you made it this far, thanks in advance for reading this to the end.
Steven
You may want to read this too:
http://marc.theaimsgroup.com/?l=linux-kernel&m=102734704625524&w=2
Karim
===================================================
Karim Yaghmour
[email protected]
Embedded and Real-Time Linux Expert
===================================================
So summarize:
1) Run multiple kernels (minimally kernels A and B)
2) Migrate processes from kernel A to kernel B
3) Use kexec to replace kernel A once all processes have left.
4) Repeat for all other kernels.
On two simple machines working in tandem (The most common variation
used for high availability this should be easy to do). And it is
preferable to a reboot because of the additional control and speed.
Thank you for the perspective. This looks like I line I can
sell to get some official time to work on kexec and it's friends
more actively.
>From what I have seen process migration/process check-pointing is
currently the very rough area.
The interesting thing becomes how do you measure system uptime.
Eric
On Mon, 2003-05-05 at 11:34, Eric W. Biederman wrote:
> So summarize:
> 1) Run multiple kernels (minimally kernels A and B)
> 2) Migrate processes from kernel A to kernel B
> 3) Use kexec to replace kernel A once all processes have left.
> 4) Repeat for all other kernels.
>
> On two simple machines working in tandem (The most common variation
> used for high availability this should be easy to do). And it is
> preferable to a reboot because of the additional control and speed.
>
> Thank you for the perspective. This looks like I line I can
> sell to get some official time to work on kexec and it's friends
> more actively.
Cutting boot time in half is pretty good as it is right now.
>
> >From what I have seen process migration/process check-pointing is
> currently the very rough area.
>
> The interesting thing becomes how do you measure system uptime.
>
> Eric
Perhaps two uptimes could be kept. The current concept of uptime would
remain as is, analogous to the reign of a king (the current kernel), and
a new integrated uptime would be analogous to the life of a dynasty. The
dynasty uptime would be one of the many things the new kernel learned
about on booting. This new dynasty uptime could become quite long if
everything keeps on ticking.
Steven
On Mon, 05 May 2003 12:00:15 MDT, Steven Cole said:
> Perhaps two uptimes could be kept. The current concept of uptime would
> remain as is, analogous to the reign of a king (the current kernel), and
> a new integrated uptime would be analogous to the life of a dynasty. The
> dynasty uptime would be one of the many things the new kernel learned
> about on booting. This new dynasty uptime could become quite long if
> everything keeps on ticking.
Make sure you handle the case of a dynasty that starts on a 2.7.13 kernel
and is finally deposed by a power failure in 2.7.39.
Eric W. Biederman wrote:
> The interesting thing becomes how do you measure system uptime.
In telecom at least, as long as the service which you are providing is
available, you're "up". The assumption is that you're "always" up, with brief
(hopefully) interruptions for faults or upgrades.
Because of this, it may turn out that measuring service downtime is more
meaningful than system uptime.
Chris
--
Chris Friesen | MailStop: 043/33/F10
Nortel Networks | work: (613) 765-0557
3500 Carling Avenue | fax: (613) 765-2986
Nepean, ON K2H 8E9 Canada | email: [email protected]
On Mon, 2003-05-05 at 12:17, [email protected] wrote:
> On Mon, 05 May 2003 12:00:15 MDT, Steven Cole said:
>
> > Perhaps two uptimes could be kept. The current concept of uptime would
> > remain as is, analogous to the reign of a king (the current kernel), and
> > a new integrated uptime would be analogous to the life of a dynasty. The
> > dynasty uptime would be one of the many things the new kernel learned
> > about on booting. This new dynasty uptime could become quite long if
> > everything keeps on ticking.
>
> Make sure you handle the case of a dynasty that starts on a 2.7.13 kernel
> and is finally deposed by a power failure in 2.7.39.
>
2.7.13 eh? Wow, that's optimistic. I guess Karim and others better get
busy. Unless Linus throws in about 50 kernels with the -preX naming
scheme like this last time. ;)
Here's nice long uptime:
tstad% uptime
12:58pm up 503 days, 1:30, 3 users, load average: 0.23, 0.04, 0.00
tstad% uname -a
ULTRIX tstad 4.3 1 RISC
I guess Ultrix didn't have a jiffie wraparound problem at 497 days.
That DEC 5000/200 has run almost continuously for 12 years, except for
the occasional palace revolution/forest fire fiasco.
Steven
On Mon, 5 May 2003, Steven Cole wrote:
> On Mon, 2003-05-05 at 12:17, [email protected] wrote:
> > On Mon, 05 May 2003 12:00:15 MDT, Steven Cole said:
> >
> > > Perhaps two uptimes could be kept. The current concept of uptime would
> > > remain as is, analogous to the reign of a king (the current kernel), and
> > > a new integrated uptime would be analogous to the life of a dynasty. The
> > > dynasty uptime would be one of the many things the new kernel learned
> > > about on booting. This new dynasty uptime could become quite long if
> > > everything keeps on ticking.
> >
> > Make sure you handle the case of a dynasty that starts on a 2.7.13 kernel
> > and is finally deposed by a power failure in 2.7.39.
> >
> 2.7.13 eh? Wow, that's optimistic. I guess Karim and others better get
> busy. Unless Linus throws in about 50 kernels with the -preX naming
> scheme like this last time. ;)
>
> Here's nice long uptime:
>
> tstad% uptime
> 12:58pm up 503 days, 1:30, 3 users, load average: 0.23, 0.04, 0.00
> tstad% uname -a
> ULTRIX tstad 4.3 1 RISC
>
> I guess Ultrix didn't have a jiffie wraparound problem at 497 days.
> That DEC 5000/200 has run almost continuously for 12 years, except for
> the occasional palace revolution/forest fire fiasco.
>
> Steven
VAXen including Ultrix start a clock at zero when booted. They
set boottime to the hwclock with the hard-to-find batteries behind
the rear door or under the board in the VAX/Station 3000. So, you
don't have a time that started in 1970 like other Unix machines
altough a conversion takes place when you actually read the time.
Raw time is in a quadword, in microfortnights (14 days / 1,000,000) =
24 * 14 = 336 hrs/* 60 = 2,160 seconds / 1,000,000 = 0.02 seconds.
Cheers,
Dick Johnson
Penguin : Linux version 2.4.20 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.
On Mon, 2003-05-05 at 11:34, Eric W. Biederman wrote:
> So summarize:
> 1) Run multiple kernels (minimally kernels A and B)
> 2) Migrate processes from kernel A to kernel B
> 3) Use kexec to replace kernel A once all processes have left.
> 4) Repeat for all other kernels.
Just a small correction to the summary. I was not assuming that
multiple kernels are running at the beginning So the summary is more
like:
1) Make hardware available and use kexec to boot kernel B.
2) Migrate processes from kernel A to kernel B.
3) Once all processes have left kernel A, kernel B takes over A's turf,
maybe with a really big kfree().
4) The end state is the same as the beginning, but with a new kernel.
For a machine already partitioned into clusters, your original summary
is correct.
> On two simple machines working in tandem (The most common variation
> used for high availability this should be easy to do). And it is
> preferable to a reboot because of the additional control and speed.
Doing this on separate machines would be a good warm-up to doing it on
one machine partitioned into CC clusters.
Here is a direct link to Karim's paper on CC clusters (html version):
http://www.opersys.com/adeos/practical-smp-clusters/
I had forgotten about the nanokernel approach when I made my original
post. If doing it this way is not everyone's cup of tea, there could be
other solutions.
Also, I've been assuming that processes to be moved would have their
dirty pages written out prior to migration. This should eliminate the
need for moving a lot of data structures around, which would probably be
difficult anyway since those structures could arbitrarily change from
one kernel version to the next.
Steven
Assuming this would work, any reason it should not be doable
on a HT uniprocessor rather than a "real" smp box?
Steve
"Stephen M. Kenton" <[email protected]> writes:
> Assuming this would work, any reason it should not be doable
> on a HT uniprocessor rather than a "real" smp box?
No but currently we are a significant ways from both process
migration, and running multiple kernels on one box.
Eric