2002-06-18 21:21:57

by zaimi

[permalink] [raw]
Subject: kernel upgrade on the fly

Hi all,

has anybody worked or thought about a property to upgrade the kernel
while the system is running? ie. with all processes waiting in their
queues while the resident-older kernel gets replaced by a newer one.

I can see the advantage of such a thing when a server can have the kernel
upgraded (major or minor upgrade) without disrupting the ongoing services
(ok, maybe a small few-seconds delay). Another instance would be to
switch between different kernels in the /boot/ directory (for testing
purposes, etc.) without rebooting the machine.

A search of the web resulted in no related information to the above so I
dont know if such an issue has been raised before.

Would anybody else think this to be an interesting property to have for
the linux kernel or care to comment on this idea?

Cheers,

Adi Zaimi
Rutgers University


2002-06-18 21:30:11

by Russell King

[permalink] [raw]
Subject: Re: kernel upgrade on the fly

On Tue, Jun 18, 2002 at 05:21:49PM -0400, [email protected] wrote:
> has anybody worked or thought about a property to upgrade the kernel
> while the system is running? ie. with all processes waiting in their
> queues while the resident-older kernel gets replaced by a newer one.

This has been discussed over and over and over and over and over and over
and over and over and over and over and over and over and over and over
here; typically it comes up about once every six months. Please see
the FAQ: http://www.tux.org/lkml or alternatively search the lkml
archives.

--
Russell King ([email protected]) The developer of ARM Linux
http://www.arm.linux.org.uk/personal/aboutme.html

2002-06-19 01:35:47

by Rob Landley

[permalink] [raw]
Subject: Re: kernel upgrade on the fly

On Tuesday 18 June 2002 05:21 pm, [email protected] wrote:
> Hi all,
>
> has anybody worked or thought about a property to upgrade the kernel
> while the system is running? ie. with all processes waiting in their
> queues while the resident-older kernel gets replaced by a newer one.

Thought about, yes. At length. That's why it hasn't been done. :)

Closest you'll get at the moment is some variant of two kernel monte, I.E. a
reboot to a new kernel with all processes offed, but at least without
involving the bios.

The new swsup infrastructure from pavel machek theoretically lets you freeze
the state of your system to disk, so we're a heck of a lot farther ahead then
we were. If you want to re-open this can of worms, the only way to go is to
start with some combination of these two projects:

http://falcon.sch.bme.hu/~seasons/linux/swsusp.html
http://sourceforge.net/projects/monte/

That said, the fundamental problem is that when you change kernels, run-time
state structures change. Parsing your run-time state from oldvers to feed
into newvers can't really be done automatically because your tool wouldn't
know what any of the changes MEAN, so you would probably have to write a
custom frozen process converter, which would be a pain and a half to debug,
to say the least. (And by the time you've got that even half debugged you
need to do it for the NEXT kernel...)

Of course software suspend theoretically deals with at least some of the
device driver issues, so there's a certain amount of handwaving you can do on
that end. And migrating hot network connections is something people have in
fact done before, although you'll have to ask around about who. (Ask the
security nuts, they consider it a bad thing. :)

Nothing is impossible for anyone impervious to reason, and you might suprise
us (it'd make a heck of a graduate project). Hot migration isn't IMPOSSIBLE,
it's just a flipping pain in the ass. But the issue's a bit threadbare in
these parts (somewhere between "are we there yet mommy?" and "can I buy a
pony?"). Try the swsup mailing list, they might be willing to humor you...

(And the people most likely to WANT this feature ("this system never goes
down" types) are also the least likely to want to deal with subtle bugs from
a bad conversion that don't manifest until a week after the new system comes
up when cron goes nuts at 3 am. Of course whether hot migration it's more
dangerous to your data than the interaction between Andre's and Martin's
egoes in the ATAPI layer is an open question... :) Ahem. Right...)

The SANE answer always has been to just schedule some down time for the box.
The insane answer involves giving an awful lot of money to Sun or IBM or some
such for hot-pluggable backplanes. (How do you swap out THE BACKPLANE?
That's an answer nobody seems to have...)

Clusters. Migrating tasks in the cluster, potentially similar problem. Look
at mosix and the NUMA stuff as well, if you're actually serious about this.
You have to reduce a process to its vital data, once all the resources you
can peel away from it have been peeled away, swapped out, freed, etc. If you
can suspend and save an individual running process to a disk image (just a
file in the filesystem), in such a way that it can be individually re-loaded
later (by the same kernel), you're halfway there. No, it's not as easy as it
sounds. :)

> I can see the advantage of such a thing when a server can have the kernel
> upgraded (major or minor upgrade) without disrupting the ongoing services
> (ok, maybe a small few-seconds delay). Another instance would be to
> switch between different kernels in the /boot/ directory (for testing
> purposes, etc.) without rebooting the machine.

See "belling the cat". Yeah, it's a great idea. The implementation's the
tricky bit.

> Would anybody else think this to be an interesting property to have for
> the linux kernel or care to comment on this idea?
>
> Cheers,
>
> Adi Zaimi
> Rutgers University

Don't you guys have professors you can ask about this sort of thing? (Or are
you going to the camden campus, says the alumni who survived the first year
of Whitman's budget cuts...)

Rob

2002-06-19 05:18:23

by Michael S. Zick

[permalink] [raw]
Subject: Re: kernel upgrade on the fly

On Tuesday 18 June 2002 02:37 pm, Rob Landley wrote:
> On Tuesday 18 June 2002 05:21 pm, [email protected] wrote:
> > Hi all,
> >
> > has anybody worked or thought about a property to upgrade the kernel
> > while the system is running? ie. with all processes waiting in their
> > queues while the resident-older kernel gets replaced by a newer one.
>
> > Would anybody else think this to be an interesting property to have for
> > the linux kernel or care to comment on this idea?
> >
Sure,
I know two industries that do such a thing (almost);
Spacecraft and the Telephone Company (any/all);

I did say almost...
I'll speak of the telephone industry, because I am more familar with it...
There they use two (or more) machines, running near the same program...
The one connected to the outside world of hardware is duplicating the
event in a message, sent to the second...
The second, instead of listening to the outside world is listening to the
messages, duplicating all of the program logic except the hardware i/o.
The memory data structures are identical between the two.
When disaster happens...
Machine two rolls out it's listening modules, rolls in the i/o modules,
sends signal to hardware buss switch to give it the system buss.
Then the fun begins...
Recover the hardware (or at least the billing information).
Note the three points above:
1) Near identical programs
2) Identical data structures
3) Two sets of CPU hardware

Switching from linux-2.4.x to linux-2.6.x doesn't qualify;

The person who asked this question wants to do it on
a single machine - The price just went way up...

Linux uses internal data structures when and wherever
they are needed. Updating them all to be consistant
would be a real b....
Probably you would have to start from scratch and
rebuild them...

Hmm, I think I just said "reboot" the machine with
the new kernel.

Mike

2002-06-19 17:23:03

by John Alvord

[permalink] [raw]
Subject: Re: kernel upgrade on the fly

On Tue, 18 Jun 2002 15:37:23 -0400, Rob Landley
<[email protected]> wrote:

>On Tuesday 18 June 2002 05:21 pm, [email protected] wrote:
>> Hi all,
>>
>> has anybody worked or thought about a property to upgrade the kernel
>> while the system is running? ie. with all processes waiting in their
>> queues while the resident-older kernel gets replaced by a newer one.
>
>Thought about, yes. At length. That's why it hasn't been done. :)

IMO the biggest reason it hasn't been done is the existence of
loadable modules. Most driver-type development work can be tested
without rebooting.

john

2002-06-19 22:54:27

by Rob Landley

[permalink] [raw]
Subject: Re: kernel upgrade on the fly

On Wednesday 19 June 2002 01:22 pm, John Alvord wrote:
> On Tue, 18 Jun 2002 15:37:23 -0400, Rob Landley
>
> <[email protected]> wrote:
> >On Tuesday 18 June 2002 05:21 pm, [email protected] wrote:
> >> Hi all,
> >>
> >> has anybody worked or thought about a property to upgrade the kernel
> >> while the system is running? ie. with all processes waiting in their
> >> queues while the resident-older kernel gets replaced by a newer one.
> >
> >Thought about, yes. At length. That's why it hasn't been done. :)
>
> IMO the biggest reason it hasn't been done is the existence of
> loadable modules. Most driver-type development work can be tested
> without rebooting.

That's part of it, sure. (And I'm sure the software suspend work is
leveraging the ability to unload modules.)

There's a dependency tree: processes need resources like mounted filesystems
and open file handles to the network stack and such, and you can't unmount
filesystems and unload devices while they're in use. Taking a running system
apart and keeping track of the pieces needed to put it back together again is
a bit of a challenge.

The software suspend work can't freeze processees individually to seperate
files (that I know of), but I've heard blue-sky talk about potentially adding
it. (Dunno what the actual plans are, pavel machek probably would). If
processes could be frozen in a somewhat kernel independent way (so that their
run-time state was parsed in again in a known format and flung into any
functioning kernel), then upgrading to a new kernel would just be a question
of suspending all the processes you care about preserving, doing a two kernel
monte, and restoring the processes. Migrating a process from one machine to
another in a network clsuter would be possible too.

I'm sure it's not as easy as it sounds, but looking at the software suspend
work would be a necessary first step. They are, at least, serializing
processes to disk and bringing them back afterwards. I'm fairly certain it's
happening the microsoft word saves *.doc files (block write the run-time
structures to disk and block read them back in verbatim later, and hope all
your compiler alignment offsets and such match if there's any version skew).

Then again, the star office people reverse engineered that and made it
(mostly) work without even having access to the source code... :)

Hmmm, what would be involved in serializing a process to disk? Obviously you
start by sending it a suspend signal. There's the process stuff, of course.
(Priority, etc.) That's not too bad. You'd need to record all the memory
mappings (not just the contents of the physical and swapped out memory
mappings (which should be saved to the serializing file), but also the memory
protection states and memory mapped file ranges and such, so you can map it
all back in at the appropriate location later). I'd bug whoever did the
recent shared page table work (daniel philips?) for information about what
that really MEANS.

You'd need to record all the open file handles, of course. (For actual files
this includes position in file, corresponding locks, etc. For the zillions
of things that just LOOK like files, pipes and sockets and character and
block devices, expect special case code).

Pipes bring up a fun point: you can't always serialize just one process.
Sometimes they clump together, and if you kill one more go down with it.
Thread groups are easy to spot, as well as parent/child relationships that
share memory maps and file handles and such, but even just a simple "cat blah
| less" means there are two processes connected by a pipe which pretty much
need to be serialized together. (A common real-world case is that one of
those processes is going to be the X11 server, this brings up a WORLD of fun.
For a 1.00 release it's an obvious "Don't Do That Then", and later on might
have special case behavior.)

If an actual file handle is open to an otherwise unlinked file, you need to
either make a link to that file somewhere (not too hard, that info is already
in proc/###/fs) or maybe cache the contents of the file as part of the
serialized image...

Which brings up the whole question of how portable a serialized program image
should be. Forget swapping kernels, I mean running the system for a while
before resuming the "frozen" executable. Rename a couple files and the
resume is going to get confused. You kind of have to restore to the exact
same system you left off at, because if you have an open fiile handle to file
or device driver that isn't there on the resumed system, you basically have
some variant of a "broken pipe" scenario. (Then again, forced unmount of
filesystems can sort of give you this problem anyway, so infrastructure to
deal with it is going to have to be faced at some point...)

For rebooting a running system with the same mounted partitions and hopefully
the same set of device drivers, this isn't really any worse than software
suspend. And detecting a missing file and having the resume fail with an
error would be pretty easy. But also pretty darn easy to trigger, but that's
the user's problem...

What other resources attach to a process? The process infos itself (user ID,
capabilities), memory mappings, file handles... Bound sockets... Signal
handlers and masks... I/O port mappings and such if you're running as root...

It's not an unsolvable problem, but it IS a can of worms. Just plain
reparenting a process turned out to be complicated enough they made
reparent_to_init (see kernel/sched.c).

> john

Rob

2002-06-20 20:20:08

by zaimi

[permalink] [raw]
Subject: Re: kernel upgrade on the fly

Thanks for the responses especially Rob. I was trying to find previous
threads about this and could not find them. Agreed, swsusp is a step
further to that goal; the way that memory is saved though may not make it
necessarily easier, at least in the current state of swsusp.

As you were mentioning, the processes information needs
to be summarised and saved in such a way that the new kernel can pick up
and construct its own queues of processes independent on the differences
between the kernels being swapped.

Well, this does touch the idea of having migrating processes from one
machine to others in a network. In fact, I dont understand why is it so
hard to reparent a process. If it can be reparented within a machine, then
it can migrate to other machines as well, no?

Rob, I am going to the Newark campus FYI, and have interests in some AI
stuff.
Thanks again,

Adi

2002-06-20 20:40:57

by Jesse Pollard

[permalink] [raw]
Subject: Re: kernel upgrade on the fly

[email protected]:
>
> Thanks for the responses especially Rob. I was trying to find previous
> threads about this and could not find them. Agreed, swsusp is a step
> further to that goal; the way that memory is saved though may not make it
> necessarily easier, at least in the current state of swsusp.
>
> As you were mentioning, the processes information needs
> to be summarised and saved in such a way that the new kernel can pick up
> and construct its own queues of processes independent on the differences
> between the kernels being swapped.
>
> Well, this does touch the idea of having migrating processes from one
> machine to others in a network. In fact, I dont understand why is it so
> hard to reparent a process. If it can be reparented within a machine, then
> it can migrate to other machines as well, no?

No.

Reparenting a process only changes the identity of the parent reference of a
process.

Migrating to another machine has to handle (at a minimum):

1. open files - file id, file pointer values must be moved.
2. network connections must be redirected, existing queues must be transferred
3. shared memory segment references must be transferred, possibly even the
file referenced by mmap operations (see item 5)
4. semaphores must be transferred
5. disk files may have to be transferred (currently open files in /tmp ?)
6. pipes to other processes must be re-established, as the current contents
of any pipe buffers, and even the other process(s) attached to the pipe
7. process kernel stack must be preserved (current syscall activity?) and
process control block state
8. the curren process data must be transferred (memory image, shared
library references)
9. recreating the same/equivalent process context (pid, ppid,uid,gid, and
all the kernel setup may/will have to be transferred)

A lot of things NOT mentioned (what about the active buffer cache for open
files... shared file access...)
-------------------------------------------------------------------------
Jesse I Pollard, II
Email: [email protected]

Any opinions expressed are solely my own.

2002-06-21 19:41:11

by Rob Landley

[permalink] [raw]
Subject: Re: kernel upgrade on the fly

On Thursday 20 June 2002 04:19 pm, [email protected] wrote:
> Thanks for the responses especially Rob. I was trying to find previous
> threads about this and could not find them. Agreed, swsusp is a step
> further to that goal; the way that memory is saved though may not make it
> necessarily easier, at least in the current state of swsusp.

Several people have mentioned process migration in clusters. Jessee Pollard
says he expects to see checkpointing of arbitrary user processes working this
fall, and then Nick LeRoy replied to him about the condor project, which
apparently does something similar in user space...

http://www.uwsg.iu.edu/hypermail/linux/kernel/0206.2/1017.html

http://www.cs.wisc.edu/condor/

You might also want to look at the crash dump code (and the multithreaded
crash dump patch floating around in the 2.5 to-do list) as another starting
point, since A) it's flushing user info for a single process into a file in a
well-known format, B) such a file can already be loaded back in and at least
somewhat resumed by the Gnu Debugger (gdb).

> As you were mentioning, the processes information needs
> to be summarised and saved in such a way that the new kernel can pick up
> and construct its own queues of processes independent on the differences
> between the kernels being swapped.

Which isn't impossible, I remember migrating WWIV message base files from
version to version a dozen years ago. Good old brute force did the job:
new->field=old->field; There's almost certainly a more elegant way to do it,
but brute force has the advantage that we know it could be made to work...

As for maintaining a "convert 2.4.36->2.4.37" executable goes, (to be
released with each kernel version,) the fact there's a patch file to take the
kernel's source from version to version should help a LOT with figuring out
what structures got touched and what exactly needs to be converted. Still
needs a human maintainer, though. It's also bound to lag the kernel releases
a bit, but that's not such a bad thing...

> Well, this does touch the idea of having migrating processes from one
> machine to others in a network. In fact, I dont understand why is it so
> hard to reparent a process. If it can be reparented within a machine, then
> it can migrate to other machines as well, no?

A process can touch zillions of arbitrary resources, which may not BE there
on the other machine. If you have an mmap into
"/usr/thingy/rutabega/arbitrary/database/filename.fred" and on the remote
machine fred is there, the contents are identical, but the directory
"arbitrary" is owned by the wrong user so you don't have permission to
descend into it (or the /etc/passwd file gives the same username a different
pid/assigns that pid to a different username...)

Or how about fifos: are they all there on the resume? Fifos are kind of
brain damaged so it's hard to re-use them, so "create, two connects, delete"
is a pretty common strategy. The program has the initial setup and
negotiation code, but not And can the processes at each end be restored, in
pairs, such that they still communicate with each other properly? What about
a process talking to a one-to-many server like X11 or apache or some such?
Freezing the server to go with your client is kind of overkill, eh? Gotta
draw a line somewhere if you're going to cut out a running process and stick
it in an envelope...

The easy answer is have the restore fail easily and verbosely, and have
attempt 0.1 only able to freeze and restore a fairly small subset of
processes (like the distributed.net client and equivalents that sit in their
corner twiddling their thumbs really fast), and then add on as you need more.
The wonderful world of shared library version skew is not something
checkpointing code should really HAVE to deal with, just fail if the
environment isn't pretty darn spotless and hand these problems over to the
"migration" utility.

If you're restoring back on top of the same set of mounted filesystems, and
you're only doing so once (freeze processes, reboot to new kernel, thaw
processes, discard checkpoint files), your problem gets much simpler. Still,
did your reboot wipe out stuff in /tmp that running processes need? (Hey, if
it's on shmfs and you didn't save it...)

Also, restoring one of these frozen processes has a certain amount of
security implications, doesn't it? All well and good to say "well the
process's file belongs to user 'barbie', and the saved uid matches, so load
it back in", except that what if it was originally an suid executable so it
could bind to some resource and then drop privelidges? How do you know some
user trying to attack the system didn't edit a frozen process file? You
pretty much have to cryptographically sign the files to allow non-root users
to load them back in (public key cryptography, not md5sum. Gotta be a secret
key or a user, with your source code, could replicate the process of creating
one of these suckers with arbitrary contents in userspace...)

Again, less of a problem in a "trusted" environment, but this is unix we're
talking about, and unless you're makng an embedded system to put in a toaster
it will probably be attached to the internet. And another easy answer is
"don't do that then", or "only allow root to restore the suckers" (that last
one probably has to be the case anyway, make an suid executable to verify the
save files via a gpg signature if you REALLY want users to be able to do
this, I.E. shove this problem into user space... :)

> Rob, I am going to the Newark campus FYI, and have interests in some AI
> stuff.
> Thanks again,

I'm just trying to give you some idea how much work you're in for. Then
again, Linus is on record as saying that if he knew how much work the kernel
would turn out to be, he probably never would have started it... :)

> Adi

Rob

2002-06-24 14:36:57

by Pavel Machek

[permalink] [raw]
Subject: Re: kernel upgrade on the fly

Hi!

> Nothing is impossible for anyone impervious to reason, and you might suprise
> us (it'd make a heck of a graduate project). Hot migration isn't IMPOSSIBLE,
> it's just a flipping pain in the ass. But the issue's a bit threadbare in
> these parts (somewhere between "are we there yet mommy?" and "can I buy a
> pony?").

Actually, getting a pony is easy compared to *this* ;-).

> The SANE answer always has been to just schedule some down time for the box.
> The insane answer involves giving an awful lot of money to Sun or IBM or some
> such for hot-pluggable backplanes. (How do you swap out THE BACKPLANE?
> That's an answer nobody seems to have...)

You have two back backplanes and you use the other one during the switch?

> Clusters. Migrating tasks in the cluster, potentially similar problem. Look
> at mosix and the NUMA stuff as well, if you're actually serious
> about this.
> You have to reduce a process to its vital data, once all the resources you
> can peel away from it have been peeled away, swapped out, freed, etc. If you
> can suspend and save an individual running process to a disk image (just a
> file in the filesystem), in such a way that it can be individually re-loaded
> later (by the same kernel), you're halfway there. No, it's not as easy as it
> sounds. :)

Actually, if you can select few "important" processes, and only care
about them, it can be done from userspace. Martin Mares did something
like that, involving ptrace() and lots of limitations.

> > I can see the advantage of such a thing when a server can have the kernel
> > upgraded (major or minor upgrade) without disrupting the ongoing services
> > (ok, maybe a small few-seconds delay). Another instance would be to
> > switch between different kernels in the /boot/ directory (for testing
> > purposes, etc.) without rebooting the machine.
>
> See "belling the cat". Yeah, it's a great idea. The implementation's the
> tricky bit.

My dictionary is too weak for this.
Pavel
--
(about SSSCA) "I don't say this lightly. However, I really think that the U.S.
no longer is classifiable as a democracy, but rather as a plutocracy." --hpa

2002-06-24 14:48:56

by Pavel Machek

[permalink] [raw]
Subject: Re: kernel upgrade on the fly

Hi!

> > >> has anybody worked or thought about a property to upgrade the kernel
> > >> while the system is running? ie. with all processes waiting in their
> > >> queues while the resident-older kernel gets replaced by a newer one.
> > >
> > >Thought about, yes. At length. That's why it hasn't been done. :)
> >
> > IMO the biggest reason it hasn't been done is the existence of
> > loadable modules. Most driver-type development work can be tested
> > without rebooting.
>
> That's part of it, sure. (And I'm sure the software suspend work is
> leveraging the ability to unload modules.)
>
> There's a dependency tree: processes need resources like mounted filesystems
> and open file handles to the network stack and such, and you can't unmount
> filesystems and unload devices while they're in use. Taking a running system
> apart and keeping track of the pieces needed to put it back together again is
> a bit of a challenge.

It depends on what limitations you can live with.

> The software suspend work can't freeze processees individually to seperate
> files (that I know of), but I've heard blue-sky talk about potentially adding
> it. (Dunno what the actual plans are, pavel machek probably would).
> If

Its not software suspend's goal; something similar can be done from
userspace using ptrace, try googling for freezer. Martin Mares did that.

> processes could be frozen in a somewhat kernel independent way (so that their
> run-time state was parsed in again in a known format and flung into any
> functioning kernel), then upgrading to a new kernel would just be a question
> of suspending all the processes you care about preserving, doing a two kernel
> monte, and restoring the processes. Migrating a process from one machine to
> another in a network clsuter would be possible too.

Yeah, that would be very nice.

> Hmmm, what would be involved in serializing a process to disk? Obviously you
> start by sending it a suspend signal. There's the process stuff, of
> course.

You don't. You don't want process being frozen known it was
freezed. You just stop it in a special way.

> (Priority, etc.) That's not too bad. You'd need to record all the memory
> mappings (not just the contents of the physical and swapped out
> memory

Doable from userspace, its in /proc.

> You'd need to record all the open file handles, of course. (For actual files
> this includes position in file, corresponding locks, etc. For the zillions
> of things that just LOOK like files, pipes and sockets and character and
> block devices, expect special case code).

There's not enough info in /proc to do this, I believe. Plus this is
nasty to restore -- like forcing code into processes's address space
to do opening for you.
Pavel
--
(about SSSCA) "I don't say this lightly. However, I really think that the U.S.
no longer is classifiable as a democracy, but rather as a plutocracy." --hpa