LinuxLists.cc - A kexec approach to hibernation

2007-06-01 20:45:29

by Jeremy Maitin-Shepard

[permalink] [raw]

Subject: A kexec approach to hibernation

I figured I'd throw this idea out, since although it is not perfect, it
has the potential to elegantly solve a lot of issues with hibernate.

Just as kexec can now be used to write a crashdump after a kernel panic,
a fresh kexec-loaded kernel (loaded into unused memory) could be used to
write the hibernate image of the existing kernel to disk, and then power
off the system (or suspend to ram, or anything else). This avoids the
need for the original kernel to jump through hoops to hibernate itself
in place.

A hibernate sequence would be approximately as follows:

1. Free some memory if needed or desired, and disable the swap device
if it is going to be used to write the hibernate image.

2. Load the fresh kernel in a chunk of available (possibly
pre-allocated) memory (there must also be enough available memory
for this kernel to use).

3. Disable interrupts and stop all devices.

4. Jump to the new kernel, passing whatever state information will be
needed by it to know how to write the image.

5. The new kernel loads, and then either kernel space or user space
writes the necessary data from the old kernel to disk.

6. The new kernel either powers off or suspends to ram. If it suspends
to ram, then it would need to be able to jump back to the old kernel
when it resumes from ram.

The advantages of this approach include:

- having a completely functional system (with a completely functional
userspace) from which the image is written, without having to worry
about messing up the state that is being saved (hell, the user could
even do it via an interactive shell on the new kernel);

- no need to worry about trying to use drivers while some processes are
frozen;

- no need for complicated process freezing; the same logic that can be
used for suspend to ram should be sufficient;

- no need for an atomic copy of memory, or any other complicated memory
copying; the memory of the old kernel, including the page cache, can
be written directly;

- instead of needing a significant amount of free memory to store the
atomic copy, only a few megabytes would needed to load and run the
new kernel.

It may or may not be necessary to require that the new kernel used to
write the image is the same as the existing kernel; it will likely be
useful to require that it is built from the same sources and with a
similar config. It would likely be useful, however, to either compile
out or (e.g. via the kernel command-line) disable the initialization of
drivers that will not be needed to write the image, such as sound
drivers, cdrom drivers, filesystems, and network drivers (if the image
is not to be written via the network).

Of course, if special initialization was needed under the original
kernel to set up the devices that will be used to write the image, such
as device mapper setup, or network initialization, that will have to be
repeated under the new kernel as well. This is the principal
disadvantage to this approach, but since it must be done during resume
from hibernation in any case, it doesn't seem like a very significant
disadvantage. The other disadvantage is that there would be the
delay of loading the fresh kernel; this may, however, only take a second
or two, which is relatively insignificant compared to the time required
to actually write the image, and the delay could be reduced by stripping
out unnecessary drivers from the image-writing kernel.

--
Jeremy Maitin-Shepard

2007-06-01 21:33:41

by Rafael J. Wysocki

[permalink] [raw]

Subject: Re: A kexec approach to hibernation

On Friday, 1 June 2007 22:39, Jeremy Maitin-Shepard wrote:
> I figured I'd throw this idea out, since although it is not perfect, it
> has the potential to elegantly solve a lot of issues with hibernate.
>
> Just as kexec can now be used to write a crashdump after a kernel panic,
> a fresh kexec-loaded kernel (loaded into unused memory) could be used to
> write the hibernate image of the existing kernel to disk, and then power
> off the system (or suspend to ram, or anything else). This avoids the
> need for the original kernel to jump through hoops to hibernate itself
> in place.
>
> A hibernate sequence would be approximately as follows:
>
> 1. Free some memory if needed or desired, and disable the swap device
> if it is going to be used to write the hibernate image.

Why to disable it?

> 2. Load the fresh kernel in a chunk of available (possibly
> pre-allocated) memory (there must also be enough available memory
> for this kernel to use).
>
> 3. Disable interrupts and stop all devices.

Well, this is one of the hardest parts of hibernation, so no advantage here.

> 4. Jump to the new kernel, passing whatever state information will be
> needed by it to know how to write the image.

How would we know which data to write (more precisely, which data to tell
the other kernel to write)? How do we pass this information to the new kernel?

> 5. The new kernel loads, and then either kernel space or user space
> writes the necessary data from the old kernel to disk.

You also need to reinitialize devices needed to write the image.

> 6. The new kernel either powers off or suspends to ram. If it suspends
> to ram, then it would need to be able to jump back to the old kernel
> when it resumes from ram.

What if the user wants to abort the hibernation?

> The advantages of this approach include:
>
> - having a completely functional system (with a completely functional
> userspace) from which the image is written, without having to worry
> about messing up the state that is being saved (hell, the user could
> even do it via an interactive shell on the new kernel);
>
> - no need to worry about trying to use drivers while some processes are
> frozen;

We're rather worried about running processes when the devices are frozen. ;-)

> - no need for complicated process freezing;

In fact it's not complicated, at least as far as the user land is concerned.

> the same logic that can be used for suspend to ram should be sufficient;
>
> - no need for an atomic copy of memory, or any other complicated memory
> copying; the memory of the old kernel, including the page cache, can
> be written directly;
>
> - instead of needing a significant amount of free memory to store the
> atomic copy, only a few megabytes would needed to load and run the
> new kernel.

Yes, this sounds good in theory.

> It may or may not be necessary to require that the new kernel used to
> write the image is the same as the existing kernel; it will likely be
> useful to require that it is built from the same sources and with a
> similar config. It would likely be useful, however, to either compile
> out or (e.g. via the kernel command-line) disable the initialization of
> drivers that will not be needed to write the image, such as sound
> drivers, cdrom drivers, filesystems, and network drivers (if the image
> is not to be written via the network).

I think that, for average users, this would be difficult.

> Of course, if special initialization was needed under the original
> kernel to set up the devices that will be used to write the image, such
> as device mapper setup, or network initialization, that will have to be
> repeated under the new kernel as well. This is the principal
> disadvantage to this approach, but since it must be done during resume
> from hibernation in any case, it doesn't seem like a very significant
> disadvantage. The other disadvantage is that there would be the
> delay of loading the fresh kernel; this may, however, only take a second
> or two, which is relatively insignificant compared to the time required
> to actually write the image, and the delay could be reduced by stripping
> out unnecessary drivers from the image-writing kernel.

One more thing: How do we restore the system state?

Greetings,
Rafael

--
"Premature optimization is the root of all evil." - Donald Knuth

2007-06-01 22:25:46

by Jeremy Maitin-Shepard

[permalink] [raw]

Subject: Re: A kexec approach to hibernation

"Rafael J. Wysocki" <[email protected]> writes:

> On Friday, 1 June 2007 22:39, Jeremy Maitin-Shepard wrote:
>> I figured I'd throw this idea out, since although it is not perfect, it
>> has the potential to elegantly solve a lot of issues with hibernate.
>>
>> Just as kexec can now be used to write a crashdump after a kernel panic,
>> a fresh kexec-loaded kernel (loaded into unused memory) could be used to
>> write the hibernate image of the existing kernel to disk, and then power
>> off the system (or suspend to ram, or anything else). This avoids the
>> need for the original kernel to jump through hoops to hibernate itself
>> in place.
>>
>> A hibernate sequence would be approximately as follows:
>>
>> 1. Free some memory if needed or desired, and disable the swap device
>> if it is going to be used to write the hibernate image.

> Why to disable it?

To make sure that the swap data won't get clobbered by the writing of
the image, if the swap device is to be used to write the hibernate
image. Presumably something similar is already done. In any case this
is not an important point.

>> 2. Load the fresh kernel in a chunk of available (possibly
>> pre-allocated) memory (there must also be enough available memory
>> for this kernel to use).
>>
>> 3. Disable interrupts and stop all devices.

> Well, this is one of the hardest parts of hibernation, so no advantage
> here.

It seems like support for this is mostly already in place though, and it
needs to be done for suspend to ram, kexec, and shutdown anyway.

>> 4. Jump to the new kernel, passing whatever state information will be
>> needed by it to know how to write the image.

> How would we know which data to write (more precisely, which data to
> tell the other kernel to write)? How do we pass this information to
> the new kernel?

Just before jumping into the new kernel, with interrupts disabled, the
old kernel could either prepare a data structure that specifies what
pages are allocated, or alternatively simply provide a pointer to the
relevant data structure in the old kernel. I can't say exactly how this
data would be given to the new kernel, but I can't imagine it being
difficult. (For instance, multiboot headers, the kernel command line,
initrd, or some other mechanism could be used.)

>> 5. The new kernel loads, and then either kernel space or user space
>> writes the necessary data from the old kernel to disk.

> You also need to reinitialize devices needed to write the image.

Yes. That would be done, as normal, when the kernel loads. Currently
devices are suspended or stopped anyway before the atomic copy, and then
reinitialized to write the image. In theory, this stopping shouldn't be
needed, and I mentioned that if additional support were added to some
drivers for passing some information about the current state of the
device, the device might only need to be partially shut down before
jumping to the new kernel. This might allow, for instance, avoiding
spinning down and then up again the disks.

>> 6. The new kernel either powers off or suspends to ram. If it suspends
>> to ram, then it would need to be able to jump back to the old kernel
>> when it resumes from ram.

> What if the user wants to abort the hibernation?

This would be handled in effectively the same way as if the user wants
to suspend to ram after writing the image: it would be necessary to jump
back to the old kernel. This would effectively be handled in the same
way as a resume, except that the copying back of memory would be
avoided. Presumably the image writing kernel would have devices in
approximately the same state as the image loading kernel, and so the old
kernel needs to be prepared to receive the devices in that state anyway.

>> The advantages of this approach include:
>>
>> - having a completely functional system (with a completely functional
>> userspace) from which the image is written, without having to worry
>> about messing up the state that is being saved (hell, the user could
>> even do it via an interactive shell on the new kernel);
>>
>> - no need to worry about trying to use drivers while some processes are
>> frozen;

> We're rather worried about running processes when the devices are
> frozen. ;-)

The point is, with this kexec approach, essentially no code at all runs
under the old kernel after the very initial steps of the hibernation
have begun, but any code, kernel or user, can run under the new kernel,
because the new kernel provides a completely functional system, while at
the same time not clobbering any of the memory of the old kernel. In
particular, it will be possible to write the image to a fuse file
system.

>> - no need for complicated process freezing;

> In fact it's not complicated, at least as far as the user land is
> concerned.

I think given all of the issues about whether to freeze kernel or users
tasks, which tasks to freeze, etc., it is hard to argue that there are
not complications, and many possible bugs and deadlocks. Linus made the
point that trying to freeze certain things and then continue using some
of the devices is fundamentally broken, because the freezer doesn't know
anything about dependencies.

>> the same logic that can be used for suspend to ram should be sufficient;
>>
>> - no need for an atomic copy of memory, or any other complicated memory
>> copying; the memory of the old kernel, including the page cache, can
>> be written directly;
>>
>> - instead of needing a significant amount of free memory to store the
>> atomic copy, only a few megabytes would needed to load and run the
>> new kernel.

> Yes, this sounds good in theory.

>> It may or may not be necessary to require that the new kernel used to
>> write the image is the same as the existing kernel; it will likely be
>> useful to require that it is built from the same sources and with a
>> similar config. It would likely be useful, however, to either compile
>> out or (e.g. via the kernel command-line) disable the initialization of
>> drivers that will not be needed to write the image, such as sound
>> drivers, cdrom drivers, filesystems, and network drivers (if the image
>> is not to be written via the network).

> I think that, for average users, this would be difficult.

They could use the same kernel; it would simply be slower. The
distributions could perhaps help with this. If kernel command line
parameters could be used to disable certain subsystems, that should be
relatively easy for users.

>> Of course, if special initialization was needed under the original
>> kernel to set up the devices that will be used to write the image, such
>> as device mapper setup, or network initialization, that will have to be
>> repeated under the new kernel as well. This is the principal
>> disadvantage to this approach, but since it must be done during resume
>> from hibernation in any case, it doesn't seem like a very significant
>> disadvantage. The other disadvantage is that there would be the
>> delay of loading the fresh kernel; this may, however, only take a second
>> or two, which is relatively insignificant compared to the time required
>> to actually write the image, and the delay could be reduced by stripping
>> out unnecessary drivers from the image-writing kernel.

> One more thing: How do we restore the system state?

The "resume kernel" would be loaded at the same address as the "save
kernel" was loaded (it should probably be the same kernel), and then
have it copy back the image completely without any need for atomic
copies. It can then place the devices in the desired state, and jump to
it. The old kernel would then do what it needs to do with the devices,
and start running things again.

Presumably it would be most convenient to have the normal boot loader
load the resume kernel directly at the desired address. The
disadvantage is that at the same time the image is written, something
would have to be done so that the boot loader would know to load the
resume kernel, rather than the normal kernel. (E.g. the image writing
kernel would need to modify the grub config file.) This shouldn't be a
significant problem in practice.

If resume fails, the resume kernel could just print an error message.
The user would be forced to manually tell the boot loader to not load
the resume kernel (perhaps by selecting a non-default menu option).
Alternatively, the resume kernel could just initialize all devices and
load the system as normal (this won't be possible if the resume kernel
was compiled with certain device support not needed for suspending
removed, though), or kexec another kernel.

It shouldn't matter too much how convenient it is when resume from
hibernate fails, because that shouldn't happen very often anyway.

--
Jeremy Maitin-Shepard

2007-06-01 23:09:21

by Rafael J. Wysocki

[permalink] [raw]

Subject: Re: A kexec approach to hibernation

On Saturday, 2 June 2007 00:25, Jeremy Maitin-Shepard wrote:
> "Rafael J. Wysocki" <[email protected]> writes:
>
> > On Friday, 1 June 2007 22:39, Jeremy Maitin-Shepard wrote:
> >> I figured I'd throw this idea out, since although it is not perfect, it
> >> has the potential to elegantly solve a lot of issues with hibernate.
> >>
> >> Just as kexec can now be used to write a crashdump after a kernel panic,
> >> a fresh kexec-loaded kernel (loaded into unused memory) could be used to
> >> write the hibernate image of the existing kernel to disk, and then power
> >> off the system (or suspend to ram, or anything else). This avoids the
> >> need for the original kernel to jump through hoops to hibernate itself
> >> in place.
> >>
> >> A hibernate sequence would be approximately as follows:
> >>
> >> 1. Free some memory if needed or desired, and disable the swap device
> >> if it is going to be used to write the hibernate image.
>
> > Why to disable it?
>
> To make sure that the swap data won't get clobbered by the writing of
> the image, if the swap device is to be used to write the hibernate
> image. Presumably something similar is already done. In any case this
> is not an important point.
>
> >> 2. Load the fresh kernel in a chunk of available (possibly
> >> pre-allocated) memory (there must also be enough available memory
> >> for this kernel to use).
> >>
> >> 3. Disable interrupts and stop all devices.
>
> > Well, this is one of the hardest parts of hibernation, so no advantage
> > here.
>
> It seems like support for this is mostly already in place though, and it
> needs to be done for suspend to ram, kexec, and shutdown anyway.
>
> >> 4. Jump to the new kernel, passing whatever state information will be
> >> needed by it to know how to write the image.
>
> > How would we know which data to write (more precisely, which data to
> > tell the other kernel to write)? How do we pass this information to
> > the new kernel?
>
> Just before jumping into the new kernel, with interrupts disabled, the
> old kernel could either prepare a data structure that specifies what
> pages are allocated, or alternatively simply provide a pointer to the
> relevant data structure in the old kernel.

But for this purpose the old kernel will actually need to do what is currently
done in swsusp while the image is being created (the only difference is that
we allocate memory in the process, but that's a detail only).

> I can't say exactly how this data would be given to the new kernel, but I
> can't imagine it being difficult. (For instance, multiboot headers, the
> kernel command line, initrd, or some other mechanism could be used.)

Besides, you need to load the new kernel somehow. If that's to work without
problems, that should be done before we switch off devices.

> >> 5. The new kernel loads, and then either kernel space or user space
> >> writes the necessary data from the old kernel to disk.
>
> > You also need to reinitialize devices needed to write the image.
>
> Yes. That would be done, as normal, when the kernel loads. Currently
> devices are suspended or stopped anyway before the atomic copy, and then
> reinitialized to write the image. In theory, this stopping shouldn't be
> needed, and I mentioned that if additional support were added to some
> drivers for passing some information about the current state of the
> device, the device might only need to be partially shut down before
> jumping to the new kernel. This might allow, for instance, avoiding
> spinning down and then up again the disks.

Well, I don't quite agree. I think that for this purpose we'll need devices to
be initialized from scratch by the new kernel, so the old kernel should put
them into states that allow this to be done.

We are going to implement something like this anyway, but that's a rather long
way to go.

> >> 6. The new kernel either powers off or suspends to ram. If it suspends
> >> to ram, then it would need to be able to jump back to the old kernel
> >> when it resumes from ram.
>
> > What if the user wants to abort the hibernation?
>
> This would be handled in effectively the same way as if the user wants
> to suspend to ram after writing the image: it would be necessary to jump
> back to the old kernel. This would effectively be handled in the same
> way as a resume, except that the copying back of memory would be
> avoided. Presumably the image writing kernel would have devices in
> approximately the same state as the image loading kernel, and so the old
> kernel needs to be prepared to receive the devices in that state anyway.

Please see above. I don't think that would be easy to arrange for.

> >> The advantages of this approach include:
> >>
> >> - having a completely functional system (with a completely functional
> >> userspace) from which the image is written, without having to worry
> >> about messing up the state that is being saved (hell, the user could
> >> even do it via an interactive shell on the new kernel);
> >>
> >> - no need to worry about trying to use drivers while some processes are
> >> frozen;
>
> > We're rather worried about running processes when the devices are
> > frozen. ;-)
>
> The point is, with this kexec approach, essentially no code at all runs
> under the old kernel after the very initial steps of the hibernation
> have begun, but any code, kernel or user, can run under the new kernel,
> because the new kernel provides a completely functional system, while at
> the same time not clobbering any of the memory of the old kernel. In
> particular, it will be possible to write the image to a fuse file
> system.

You need to be cautious here. You can't touch any filesystems mounted by
the old kernel, or they will be corrupted after the restore.

> >> - no need for complicated process freezing;
>
> > In fact it's not complicated, at least as far as the user land is
> > concerned.
>
> I think given all of the issues about whether to freeze kernel or users
> tasks, which tasks to freeze, etc., it is hard to argue that there are
> not complications, and many possible bugs and deadlocks. Linus made the
> point that trying to freeze certain things and then continue using some
> of the devices is fundamentally broken, because the freezer doesn't know
> anything about dependencies.

Yes, this applies to kernel threads and I agree with that, except that there
are kernel threads independent of any other tasks, so we don't need to worry
about them.

> >> the same logic that can be used for suspend to ram should be sufficient;
> >>
> >> - no need for an atomic copy of memory, or any other complicated memory
> >> copying; the memory of the old kernel, including the page cache, can
> >> be written directly;
> >>
> >> - instead of needing a significant amount of free memory to store the
> >> atomic copy, only a few megabytes would needed to load and run the
> >> new kernel.
>
> > Yes, this sounds good in theory.
>
> >> It may or may not be necessary to require that the new kernel used to
> >> write the image is the same as the existing kernel; it will likely be
> >> useful to require that it is built from the same sources and with a
> >> similar config. It would likely be useful, however, to either compile
> >> out or (e.g. via the kernel command-line) disable the initialization of
> >> drivers that will not be needed to write the image, such as sound
> >> drivers, cdrom drivers, filesystems, and network drivers (if the image
> >> is not to be written via the network).
>
> > I think that, for average users, this would be difficult.
>
> They could use the same kernel; it would simply be slower. The
> distributions could perhaps help with this. If kernel command line
> parameters could be used to disable certain subsystems, that should be
> relatively easy for users.

Well, I don't agree here. My experience with the userland suspend is that such
things are generally very difficult for users to set up.

> >> Of course, if special initialization was needed under the original
> >> kernel to set up the devices that will be used to write the image, such
> >> as device mapper setup, or network initialization, that will have to be
> >> repeated under the new kernel as well. This is the principal
> >> disadvantage to this approach, but since it must be done during resume
> >> from hibernation in any case, it doesn't seem like a very significant
> >> disadvantage. The other disadvantage is that there would be the
> >> delay of loading the fresh kernel; this may, however, only take a second
> >> or two, which is relatively insignificant compared to the time required
> >> to actually write the image, and the delay could be reduced by stripping
> >> out unnecessary drivers from the image-writing kernel.
>
> > One more thing: How do we restore the system state?
>
> The "resume kernel" would be loaded at the same address as the "save
> kernel" was loaded (it should probably be the same kernel),

Well, we'd have to use a relocatable kernel for this purpose, it seems.

> and then have it copy back the image completely without any need for atomic
> copies. It can then place the devices in the desired state, and jump to
> it. The old kernel would then do what it needs to do with the devices,
> and start running things again.
>
> Presumably it would be most convenient to have the normal boot loader
> load the resume kernel directly at the desired address. The
> disadvantage is that at the same time the image is written, something
> would have to be done so that the boot loader would know to load the
> resume kernel, rather than the normal kernel. (E.g. the image writing
> kernel would need to modify the grub config file.)

No, it can't do that, unless the file is on a 'safe' filesystem

> This shouldn't be a significant problem in practice.

I don't agree here.

> If resume fails, the resume kernel could just print an error message.
> The user would be forced to manually tell the boot loader to not load
> the resume kernel (perhaps by selecting a non-default menu option).
> Alternatively, the resume kernel could just initialize all devices and
> load the system as normal (this won't be possible if the resume kernel
> was compiled with certain device support not needed for suspending
> removed, though), or kexec another kernel.

Actually, I think we could restore in the same way as we do it right now,
if the image contained the right information (which I think it should anyway),
so that's not a big deal technically, but the current code would be needed
anyway.

> It shouldn't matter too much how convenient it is when resume from
> hibernate fails, because that shouldn't happen very often anyway.

Except when there's a problem to debug. ;-)

All in all, I think that the idea is worth considering, although the details
need to be clarified before we can try to implement it.

However, it really doesn't solve the basic problem that we have, which is
that we can't checkpoint filesystems easily. That's why we freeze processes
in the first place and you're suggesting to replace this with another, equally
if not more complicated, mechanism.

I see two basic advantages of your approach:
1) We don't need to freeze tasks.
2) We can create images larger than 50% of RAM.
Still, I don't think we could implement it quickly and easily.

[And really, the freezing of tasks is not that horrible, although it may seem
so. ;-)]

Greetings,
Rafael

--
"Premature optimization is the root of all evil." - Donald Knuth

2007-06-01 23:54:58

by Jeremy Maitin-Shepard

[permalink] [raw]

Subject: Re: A kexec approach to hibernation

"Rafael J. Wysocki" <[email protected]> writes:

> On Saturday, 2 June 2007 00:25, Jeremy Maitin-Shepard wrote:

[snip]

>> Just before jumping into the new kernel, with interrupts disabled, the
>> old kernel could either prepare a data structure that specifies what
>> pages are allocated, or alternatively simply provide a pointer to the
>> relevant data structure in the old kernel.

> But for this purpose the old kernel will actually need to do what is currently
> done in swsusp while the image is being created (the only difference is that
> we allocate memory in the process, but that's a detail only).

Okay, but creating a list of pages should be extremely easy.
Alternatively, with the "save kernel" might be able to read the existing
data structures directly.

>> I can't say exactly how this data would be given to the new kernel, but I
>> can't imagine it being difficult. (For instance, multiboot headers, the
>> kernel command line, initrd, or some other mechanism could be used.)

> Besides, you need to load the new kernel somehow. If that's to work without
> problems, that should be done before we switch off devices.

Well, the new kernel can be loaded at any time, and would be done in
exactly the way kexec loads a kernel. It would probably make sense to
load the kernel into memory (but not jump to it) as the very first step
of hibernation.

>> >> 5. The new kernel loads, and then either kernel space or user space
>> >> writes the necessary data from the old kernel to disk.
>>
>> > You also need to reinitialize devices needed to write the image.
>>
>> Yes. That would be done, as normal, when the kernel loads. Currently
>> devices are suspended or stopped anyway before the atomic copy, and then
>> reinitialized to write the image. In theory, this stopping shouldn't be
>> needed, and I mentioned that if additional support were added to some
>> drivers for passing some information about the current state of the
>> device, the device might only need to be partially shut down before
>> jumping to the new kernel. This might allow, for instance, avoiding
>> spinning down and then up again the disks.

> Well, I don't quite agree. I think that for this purpose we'll need devices to
> be initialized from scratch by the new kernel, so the old kernel should put
> them into states that allow this to be done.

I agree that the default behavior should be to completely shut down the
devices. Later, special support could be added to select devices to
allow them to not be fully shut down.

> We are going to implement something like this anyway, but that's a rather long
> way to go.

>> >> 6. The new kernel either powers off or suspends to ram. If it suspends
>> >> to ram, then it would need to be able to jump back to the old kernel
>> >> when it resumes from ram.
>>
>> > What if the user wants to abort the hibernation?
>>
>> This would be handled in effectively the same way as if the user wants
>> to suspend to ram after writing the image: it would be necessary to jump
>> back to the old kernel. This would effectively be handled in the same
>> way as a resume, except that the copying back of memory would be
>> avoided. Presumably the image writing kernel would have devices in
>> approximately the same state as the image loading kernel, and so the old
>> kernel needs to be prepared to receive the devices in that state anyway.

> Please see above. I don't think that would be easy to arrange for.

In that case, the devices can indeed be fully shut down, at least
initially.

>> >> The advantages of this approach include:
>> >>
>> >> - having a completely functional system (with a completely functional
>> >> userspace) from which the image is written, without having to worry
>> >> about messing up the state that is being saved (hell, the user could
>> >> even do it via an interactive shell on the new kernel);
>> >>
>> >> - no need to worry about trying to use drivers while some processes are
>> >> frozen;
>>
>> > We're rather worried about running processes when the devices are
>> > frozen. ;-)
>>
>> The point is, with this kexec approach, essentially no code at all runs
>> under the old kernel after the very initial steps of the hibernation
>> have begun, but any code, kernel or user, can run under the new kernel,
>> because the new kernel provides a completely functional system, while at
>> the same time not clobbering any of the memory of the old kernel. In
>> particular, it will be possible to write the image to a fuse file
>> system.

> You need to be cautious here. You can't touch any filesystems mounted by
> the old kernel, or they will be corrupted after the restore.

Certainly. Note that any filesystems that are available to the "save
state" kernel would have been specifically mounted under that kernel.
There isn't any real possibility of confusion over which filesystems are
safe to access.

>> >> - no need for complicated process freezing;
>>
>> > In fact it's not complicated, at least as far as the user land is
>> > concerned.
>>
>> I think given all of the issues about whether to freeze kernel or users
>> tasks, which tasks to freeze, etc., it is hard to argue that there are
>> not complications, and many possible bugs and deadlocks. Linus made the
>> point that trying to freeze certain things and then continue using some
>> of the devices is fundamentally broken, because the freezer doesn't know
>> anything about dependencies.

> Yes, this applies to kernel threads and I agree with that, except that there
> are kernel threads independent of any other tasks, so we don't need to worry
> about them.

But kernel threads also rely on userspace, due to e.g. fuse and usermode
helpers.

[snip]

>> > One more thing: How do we restore the system state?
>>
>> The "resume kernel" would be loaded at the same address as the "save
>> kernel" was loaded (it should probably be the same kernel),

> Well, we'd have to use a relocatable kernel for this purpose, it
> seems.

Not necessarily relocatable (although that would be the usual
solution). It just needs to be loaded at a different address than the
normal kernel. If it isn't relocatable, the memory that would be needed
by the "save kernel" would have to be reserved at boot.

>> and then have it copy back the image completely without any need for atomic
>> copies. It can then place the devices in the desired state, and jump to
>> it. The old kernel would then do what it needs to do with the devices,
>> and start running things again.
>>
>> Presumably it would be most convenient to have the normal boot loader
>> load the resume kernel directly at the desired address. The
>> disadvantage is that at the same time the image is written, something
>> would have to be done so that the boot loader would know to load the
>> resume kernel, rather than the normal kernel. (E.g. the image writing
>> kernel would need to modify the grub config file.)

> No, it can't do that, unless the file is on a 'safe' filesystem

Grub, its configuration, and the kernel used to resume the system had
better be on a "safe" filesystem already (i.e. a separate, unmounted
before hibernation /boot).

>> This shouldn't be a significant problem in practice.

> I don't agree here.

I think hibernate-script already includes support for modifying grub's
configuration.

>> If resume fails, the resume kernel could just print an error message.
>> The user would be forced to manually tell the boot loader to not load
>> the resume kernel (perhaps by selecting a non-default menu option).
>> Alternatively, the resume kernel could just initialize all devices and
>> load the system as normal (this won't be possible if the resume kernel
>> was compiled with certain device support not needed for suspending
>> removed, though), or kexec another kernel.

> Actually, I think we could restore in the same way as we do it right now,
> if the image contained the right information (which I think it should anyway),
> so that's not a big deal technically, but the current code would be needed
> anyway.

As far as I understand it, the swsusp resume path involves the boot
kernel loading the entire image from disk to available memory, then
shutting down all the devices, and copying the memory into place, and
then jumping to the original kernel, which reinitializes devices and
starts tasks running. This isn't very different from what I was
proposing as the alternative anyway, except that: memory is copied once,
which is pretty fast, but means that only up to half of the total memory
can be saved.

The suspend2 resume path is a bit more complicated, it seems: it copies
some of the kernel memory off disk to available memory (everything
except page caches), shuts down devices, copies that memory in place,
jumps to it, and then the original kernel reinitializes devices, and
copies the remaining pages off disk into place directly. Having the
original kernel take care of copying some memory back would be totally
incompatible with my proposal, since the original kernel would not be
prepared to access the device containing the hibernate image (since it
never accessed the device during hibernation).

>> It shouldn't matter too much how convenient it is when resume from
>> hibernate fails, because that shouldn't happen very often anyway.

> Except when there's a problem to debug. ;-)

The gain would be a faster resume from hibernation, because there would
be no need to load unnecessary drivers. The disadvantage is that
instead of being able to just go on and boot a normal system if resuming
fails, it would be necessary to reboot. I don't think that presents a
significant inconvenience.

> All in all, I think that the idea is worth considering, although the details
> need to be clarified before we can try to implement it.

> However, it really doesn't solve the basic problem that we have, which is
> that we can't checkpoint filesystems easily. That's why we freeze processes
> in the first place and you're suggesting to replace this with another, equally
> if not more complicated, mechanism.

The whole reason to want to checkpoint filesystems was so that the
original kernel would remain a fully-functional system with a
fully-functional userspace that can continue to access the filesystems
while the hibernate image is being written. In addition to the lack of
checkpoint support, however, there are a number of other issues that
this would create: Even if you can checkpoint filesystems, you can't
checkpoint the entire world. The kernel will keep acking network
packets, and userspace as well will send any normal replies. If a
document was sent off to be printed right before the checkpoint, it
might end up printing while the image is being saved, and then printed
again when the system resumes.

Fundamentally, I don't think checkpointing is the right answer. What is
desired is a fully functional system with a fully functional userspace
during the image writing. But we don't want this to be the _same_
system that is actually being imaged.

That is why I think the kexec solution is the elegant solution.

> I see two basic advantages of your approach:
> 1) We don't need to freeze tasks.
> 2) We can create images larger than 50% of RAM.

There is also the key benefit of allowing an arbitrary userspace in a
fully functional system to be used to both save and load the image. As
far as I understand, uswsusp allows a single userspace processes to run
to handle the loading and saving, but the processes runs in a rather
fragile userspace with most things disabled; in particular, this
userspace process can't access a fuse filesystem and probably can't do
other things like fork.

> Still, I don't think we could implement it quickly and easily.

It is hard to say how hard it would be. I think a lot of the existing
kexec and hibernate code could be leveraged.

> [And really, the freezing of tasks is not that horrible, although it may seem
> so. ;-)]

--
Jeremy Maitin-Shepard

2007-06-02 00:28:20

by Rafael J. Wysocki

[permalink] [raw]

Subject: Re: A kexec approach to hibernation

On Saturday, 2 June 2007 01:54, Jeremy Maitin-Shepard wrote:
> "Rafael J. Wysocki" <[email protected]> writes:
>
> > On Saturday, 2 June 2007 00:25, Jeremy Maitin-Shepard wrote:
>
> [snip]
>
> >> Just before jumping into the new kernel, with interrupts disabled, the
> >> old kernel could either prepare a data structure that specifies what
> >> pages are allocated, or alternatively simply provide a pointer to the
> >> relevant data structure in the old kernel.
>
> > But for this purpose the old kernel will actually need to do what is currently
> > done in swsusp while the image is being created (the only difference is that
> > we allocate memory in the process, but that's a detail only).
>
> Okay, but creating a list of pages should be extremely easy.
> Alternatively, with the "save kernel" might be able to read the existing
> data structures directly.
>
> >> I can't say exactly how this data would be given to the new kernel, but I
> >> can't imagine it being difficult. (For instance, multiboot headers, the
> >> kernel command line, initrd, or some other mechanism could be used.)
>
> > Besides, you need to load the new kernel somehow. If that's to work without
> > problems, that should be done before we switch off devices.
>
> Well, the new kernel can be loaded at any time,

No. By reading from a file systems, you're modifying it's meta data (in
general, of course).

> and would be done in exactly the way kexec loads a kernel. It would probably
> make sense to load the kernel into memory (but not jump to it) as the very
> first step of hibernation.

I think you'd have to do that.

> >> >> 5. The new kernel loads, and then either kernel space or user space
> >> >> writes the necessary data from the old kernel to disk.
> >>
> >> > You also need to reinitialize devices needed to write the image.
> >>
> >> Yes. That would be done, as normal, when the kernel loads. Currently
> >> devices are suspended or stopped anyway before the atomic copy, and then
> >> reinitialized to write the image. In theory, this stopping shouldn't be
> >> needed, and I mentioned that if additional support were added to some
> >> drivers for passing some information about the current state of the
> >> device, the device might only need to be partially shut down before
> >> jumping to the new kernel. This might allow, for instance, avoiding
> >> spinning down and then up again the disks.
>
> > Well, I don't quite agree. I think that for this purpose we'll need devices to
> > be initialized from scratch by the new kernel, so the old kernel should put
> > them into states that allow this to be done.
>
> I agree that the default behavior should be to completely shut down the
> devices. Later, special support could be added to select devices to
> allow them to not be fully shut down.
>
> > We are going to implement something like this anyway, but that's a rather long
> > way to go.
>
> >> >> 6. The new kernel either powers off or suspends to ram. If it suspends
> >> >> to ram, then it would need to be able to jump back to the old kernel
> >> >> when it resumes from ram.
> >>
> >> > What if the user wants to abort the hibernation?
> >>
> >> This would be handled in effectively the same way as if the user wants
> >> to suspend to ram after writing the image: it would be necessary to jump
> >> back to the old kernel. This would effectively be handled in the same
> >> way as a resume, except that the copying back of memory would be
> >> avoided. Presumably the image writing kernel would have devices in
> >> approximately the same state as the image loading kernel, and so the old
> >> kernel needs to be prepared to receive the devices in that state anyway.
>
> > Please see above. I don't think that would be easy to arrange for.
>
> In that case, the devices can indeed be fully shut down, at least
> initially.
>
> >> >> The advantages of this approach include:
> >> >>
> >> >> - having a completely functional system (with a completely functional
> >> >> userspace) from which the image is written, without having to worry
> >> >> about messing up the state that is being saved (hell, the user could
> >> >> even do it via an interactive shell on the new kernel);
> >> >>
> >> >> - no need to worry about trying to use drivers while some processes are
> >> >> frozen;
> >>
> >> > We're rather worried about running processes when the devices are
> >> > frozen. ;-)
> >>
> >> The point is, with this kexec approach, essentially no code at all runs
> >> under the old kernel after the very initial steps of the hibernation
> >> have begun, but any code, kernel or user, can run under the new kernel,
> >> because the new kernel provides a completely functional system, while at
> >> the same time not clobbering any of the memory of the old kernel. In
> >> particular, it will be possible to write the image to a fuse file
> >> system.
>
> > You need to be cautious here. You can't touch any filesystems mounted by
> > the old kernel, or they will be corrupted after the restore.
>
> Certainly. Note that any filesystems that are available to the "save
> state" kernel would have been specifically mounted under that kernel.
> There isn't any real possibility of confusion over which filesystems are
> safe to access.
>
> >> >> - no need for complicated process freezing;
> >>
> >> > In fact it's not complicated, at least as far as the user land is
> >> > concerned.
> >>
> >> I think given all of the issues about whether to freeze kernel or users
> >> tasks, which tasks to freeze, etc., it is hard to argue that there are
> >> not complications, and many possible bugs and deadlocks. Linus made the
> >> point that trying to freeze certain things and then continue using some
> >> of the devices is fundamentally broken, because the freezer doesn't know
> >> anything about dependencies.
>
> > Yes, this applies to kernel threads and I agree with that, except that there
> > are kernel threads independent of any other tasks, so we don't need to worry
> > about them.
>
> But kernel threads also rely on userspace, due to e.g. fuse and usermode
> helpers.

Yes, I know that and I think these issues are solvable within the current
approach.

> [snip]
>
> >> > One more thing: How do we restore the system state?
> >>
> >> The "resume kernel" would be loaded at the same address as the "save
> >> kernel" was loaded (it should probably be the same kernel),
>
> > Well, we'd have to use a relocatable kernel for this purpose, it
> > seems.
>
> Not necessarily relocatable (although that would be the usual
> solution). It just needs to be loaded at a different address than the
> normal kernel.

AFAICS, you can't do that with a kernel which is not relocatable (you can load
it, of course, but will it work then?).

> If it isn't relocatable, the memory that would be needed by the "save kernel"
> would have to be reserved at boot.

That doesn't seem to be realistic to me.

> >> and then have it copy back the image completely without any need for atomic
> >> copies. It can then place the devices in the desired state, and jump to
> >> it. The old kernel would then do what it needs to do with the devices,
> >> and start running things again.
> >>
> >> Presumably it would be most convenient to have the normal boot loader
> >> load the resume kernel directly at the desired address. The
> >> disadvantage is that at the same time the image is written, something
> >> would have to be done so that the boot loader would know to load the
> >> resume kernel, rather than the normal kernel. (E.g. the image writing
> >> kernel would need to modify the grub config file.)
>
> > No, it can't do that, unless the file is on a 'safe' filesystem
>
> Grub, its configuration, and the kernel used to resume the system had
> better be on a "safe" filesystem already (i.e. a separate, unmounted
> before hibernation /boot).

Currently, you don't need to do that.

> >> This shouldn't be a significant problem in practice.
>
> > I don't agree here.
>
> I think hibernate-script already includes support for modifying grub's
> configuration.

Yes. It does that _before_ the hibernation begins. ;-)

> >> If resume fails, the resume kernel could just print an error message.
> >> The user would be forced to manually tell the boot loader to not load
> >> the resume kernel (perhaps by selecting a non-default menu option).
> >> Alternatively, the resume kernel could just initialize all devices and
> >> load the system as normal (this won't be possible if the resume kernel
> >> was compiled with certain device support not needed for suspending
> >> removed, though), or kexec another kernel.
>
> > Actually, I think we could restore in the same way as we do it right now,
> > if the image contained the right information (which I think it should anyway),
> > so that's not a big deal technically, but the current code would be needed
> > anyway.
>
> As far as I understand it, the swsusp resume path involves the boot
> kernel loading the entire image from disk to available memory, then
> shutting down all the devices, and copying the memory into place, and
> then jumping to the original kernel, which reinitializes devices and
> starts tasks running. This isn't very different from what I was
> proposing as the alternative anyway, except that: memory is copied once,
> which is pretty fast, but means that only up to half of the total memory
> can be saved.

No that's not correct. Actually, during the restore we _can_ load much more
than 50% of RAM, everything needed for that is already in place. :-)

The problem is that we can't _save_ more than 50% of RAM without either doing
the suspend2's trick, or saving some memory contents directly from where it is.

Currently, we need to copy each page before saving it, which I admit is a bit
wasteful, but it generally is difficult to say which pages can be saved without
copying (your approach solves this problem quite radically). If we had known
that, we would have been able to create even very large images just fine.

> The suspend2 resume path is a bit more complicated, it seems: it copies
> some of the kernel memory off disk to available memory (everything
> except page caches), shuts down devices, copies that memory in place,
> jumps to it, and then the original kernel reinitializes devices, and
> copies the remaining pages off disk into place directly. Having the
> original kernel take care of copying some memory back would be totally
> incompatible with my proposal, since the original kernel would not be
> prepared to access the device containing the hibernate image (since it
> never accessed the device during hibernation).
>
> >> It shouldn't matter too much how convenient it is when resume from
> >> hibernate fails, because that shouldn't happen very often anyway.
>
> > Except when there's a problem to debug. ;-)
>
> The gain would be a faster resume from hibernation, because there would
> be no need to load unnecessary drivers. The disadvantage is that
> instead of being able to just go on and boot a normal system if resuming
> fails, it would be necessary to reboot. I don't think that presents a
> significant inconvenience.
>
> > All in all, I think that the idea is worth considering, although the details
> > need to be clarified before we can try to implement it.
>
> > However, it really doesn't solve the basic problem that we have, which is
> > that we can't checkpoint filesystems easily. That's why we freeze processes
> > in the first place and you're suggesting to replace this with another, equally
> > if not more complicated, mechanism.
>
> The whole reason to want to checkpoint filesystems was so that the
> original kernel would remain a fully-functional system with a
> fully-functional userspace that can continue to access the filesystems
> while the hibernate image is being written. In addition to the lack of
> checkpoint support, however, there are a number of other issues that
> this would create: Even if you can checkpoint filesystems, you can't
> checkpoint the entire world. The kernel will keep acking network
> packets, and userspace as well will send any normal replies. If a
> document was sent off to be printed right before the checkpoint, it
> might end up printing while the image is being saved, and then printed
> again when the system resumes.

That's correct.

> Fundamentally, I don't think checkpointing is the right answer. What is
> desired is a fully functional system with a fully functional userspace
> during the image writing. But we don't want this to be the _same_
> system that is actually being imaged.
>
> That is why I think the kexec solution is the elegant solution.

Frankly, I think it's tricky. ;-) Moreover, I think it would require some
problems that we don't even anticipate to be solved.

> > I see two basic advantages of your approach:
> > 1) We don't need to freeze tasks.
> > 2) We can create images larger than 50% of RAM.
>
> There is also the key benefit of allowing an arbitrary userspace in a
> fully functional system to be used to both save and load the image. As
> far as I understand, uswsusp allows a single userspace processes to run
> to handle the loading and saving, but the processes runs in a rather
> fragile userspace with most things disabled; in particular, this
> userspace process can't access a fuse filesystem and probably can't do
> other things like fork.

The user space running on top of the new kernel would be limited by the
fact that the old kernel's filesystems would be inaccessible to it. That
would, effectively, require the user to have special filesystems for the
image-saving kernel and its user space, which isn't very realistic IMO.

> > Still, I don't think we could implement it quickly and easily.
>
> It is hard to say how hard it would be. I think a lot of the existing
> kexec and hibernate code could be leveraged.

Yes, I think so, but at least we need to fix the quiescing of devices before
we think of implementing that.

Greetings,
Rafael

--
"Premature optimization is the root of all evil." - Donald Knuth

2007-06-02 01:59:51

by Jeremy Maitin-Shepard

[permalink] [raw]

Subject: Re: A kexec approach to hibernation

"Rafael J. Wysocki" <[email protected]> writes:

>> But kernel threads also rely on userspace, due to e.g. fuse and usermode
>> helpers.

> Yes, I know that and I think these issues are solvable within the current
> approach.

It seems like it would be very hard to get writing of an image to a
fuse filesystem working under the current scheme.

Trying to image a system while it is running seems fundamentally broken.
As another example, I believe currently although devices are "quiesced"
or stopped while the atomic snapshot is made, they are all then started
again afterward while the image is written to disk. As a result, the
network drivers will continue acking TCP packets that are received after
the snapshot, but these packets will be lost.

You might claim then that the solution is to simply keep the network
driver quiesced or stopped. But then it is impossible to write the
image over the network. The way to get around this problem is to write
the image over the network using a fresh network stack.

>> [snip]
>>
>> >> > One more thing: How do we restore the system state?
>> >>
>> >> The "resume kernel" would be loaded at the same address as the "save
>> >> kernel" was loaded (it should probably be the same kernel),
>>
>> > Well, we'd have to use a relocatable kernel for this purpose, it
>> > seems.
>>
>> Not necessarily relocatable (although that would be the usual
>> solution). It just needs to be loaded at a different address than the
>> normal kernel.

> AFAICS, you can't do that with a kernel which is not relocatable (you can load
> it, of course, but will it work then?).

I seem to recall in recent kernel versions support for both a
relocatable kernel and also support for non-relocatable kernels which
load at a non-standard address.

>> If it isn't relocatable, the memory that would be needed by the "save kernel"
>> would have to be reserved at boot.

> That doesn't seem to be realistic to me.

Okay. I don't see why there would be a problem with using a relocatable
kernel though.

[snip]

>> >> Presumably it would be most convenient to have the normal boot loader
>> >> load the resume kernel directly at the desired address. The
>> >> disadvantage is that at the same time the image is written, something
>> >> would have to be done so that the boot loader would know to load the
>> >> resume kernel, rather than the normal kernel. (E.g. the image writing
>> >> kernel would need to modify the grub config file.)
>>
>> > No, it can't do that, unless the file is on a 'safe' filesystem
>>
>> Grub, its configuration, and the kernel used to resume the system had
>> better be on a "safe" filesystem already (i.e. a separate, unmounted
>> before hibernation /boot).

> Currently, you don't need to do that.

Some people get away with it, but fundamentally it is broken to do so.
(The fact that the current software suspend implementations tell the
filesystems to sync to disk increases its chances of working.) You are
accessing a filesystem that is in an unknown state. Consider that the
user might make a change to grub.conf, but the kernel caches the write.
If the filesystem containing grub.conf is left mounted, the write might
never reach disk before the system is hibernated. As a result, when
grub attempts to read it, it doesn't get the expected data.

>> >> This shouldn't be a significant problem in practice.
>>
>> > I don't agree here.
>>
>> I think hibernate-script already includes support for modifying grub's
>> configuration.

> Yes. It does that _before_ the hibernation begins. ;-)

Either way, it doesn't make much difference. Inside of
hibernate-script, you need logic like:

if /boot is not mounted: mount /boot
make change
umount /boot

If you do it from the "save kernel", you need logic like:
mount /dev/boot-device /boot (no fstab on "save kernel", most likely)
make change
umount /boot.

[snip]

>> As far as I understand it, the swsusp resume path involves the boot
>> kernel loading the entire image from disk to available memory, then
>> shutting down all the devices, and copying the memory into place, and
>> then jumping to the original kernel, which reinitializes devices and
>> starts tasks running. This isn't very different from what I was
>> proposing as the alternative anyway, except that: memory is copied once,
>> which is pretty fast, but means that only up to half of the total memory
>> can be saved.

> No that's not correct. Actually, during the restore we _can_ load much more
> than 50% of RAM, everything needed for that is already in place. :-)

I suppose you do that by using more sophisticated logic to atomically
copy the pages to their final location after loading them from disk. In
particular, I suppose you must order the page copies carefully to avoid
clobbering pages that have not yet been copied. Seems reasonable. In
that case, there is indeed probably no reason to not use that approach
for resuming.

[snip]

>> The whole reason to want to checkpoint filesystems was so that the
>> original kernel would remain a fully-functional system with a
>> fully-functional userspace that can continue to access the filesystems
>> while the hibernate image is being written. In addition to the lack of
>> checkpoint support, however, there are a number of other issues that
>> this would create: Even if you can checkpoint filesystems, you can't
>> checkpoint the entire world. The kernel will keep acking network
>> packets, and userspace as well will send any normal replies. If a
>> document was sent off to be printed right before the checkpoint, it
>> might end up printing while the image is being saved, and then printed
>> again when the system resumes.

> That's correct.

>> Fundamentally, I don't think checkpointing is the right answer. What is
>> desired is a fully functional system with a fully functional userspace
>> during the image writing. But we don't want this to be the _same_
>> system that is actually being imaged.
>>
>> That is why I think the kexec solution is the elegant solution.

> Frankly, I think it's tricky. ;-)

To me, it seems a lot easier to get right than the current approaches.

> Moreover, I think it would require some problems that we don't even
> anticipate to be solved.

Possibly. The alternative, though, seems to be to add hack after hack
to get certain functionality to work.

>> > I see two basic advantages of your approach:
>> > 1) We don't need to freeze tasks.
>> > 2) We can create images larger than 50% of RAM.
>>
>> There is also the key benefit of allowing an arbitrary userspace in a
>> fully functional system to be used to both save and load the image. As
>> far as I understand, uswsusp allows a single userspace processes to run
>> to handle the loading and saving, but the processes runs in a rather
>> fragile userspace with most things disabled; in particular, this
>> userspace process can't access a fuse filesystem and probably can't do
>> other things like fork.

> The user space running on top of the new kernel would be limited by the
> fact that the old kernel's filesystems would be inaccessible to it. That
> would, effectively, require the user to have special filesystems for the
> image-saving kernel and its user space, which isn't very realistic
> IMO.

Fundamentally, saving of the image can't access any of the normal
filesystems anyway. The userspace would likely be provided as an
initramfs or initrd, exactly as is done for userspace resume from
hibernate currently. The same initramfs could probably be used for both
saving the image and restoring the image, since exactly the same
procedure would be used to set up the necessary devices for both the
save and restore case, and the GUI that is used might also be the same.

>> > Still, I don't think we could implement it quickly and easily.
>>
>> It is hard to say how hard it would be. I think a lot of the existing
>> kexec and hibernate code could be leveraged.

> Yes, I think so, but at least we need to fix the quiescing of devices before
> we think of implementing that.

It seems like fixing of device stopping/suspend/quiescing is an
orthogonal issue to the actual hibernate implementation. It would
probably be most reliable and simplest if on every jump between kernels,
all devices are fully stopped by the jumping kernel, and then fully
reinitialized by the jumped-to kernel. Presumably the time spent doing
this initialization will not be very significant compared to the time
required to write the image.

--
Jeremy Maitin-Shepard

2007-06-02 09:16:38

by Rafael J. Wysocki

[permalink] [raw]

Subject: Re: A kexec approach to hibernation

On Saturday, 2 June 2007 03:54, Jeremy Maitin-Shepard wrote:
> "Rafael J. Wysocki" <[email protected]> writes:
>
> >> But kernel threads also rely on userspace, due to e.g. fuse and usermode
> >> helpers.
>
> > Yes, I know that and I think these issues are solvable within the current
> > approach.
>
> It seems like it would be very hard to get writing of an image to a
> fuse filesystem working under the current scheme.

If the filesystem is located on a separate partition, that should be doable
(not that I'm going to try it in the foreseeable future).

> Trying to image a system while it is running seems fundamentally broken.

I don't agree.

> As another example, I believe currently although devices are "quiesced"
> or stopped while the atomic snapshot is made, they are all then started
> again afterward while the image is written to disk. As a result, the
> network drivers will continue acking TCP packets that are received after
> the snapshot, but these packets will be lost.
>
> You might claim then that the solution is to simply keep the network
> driver quiesced or stopped. But then it is impossible to write the
> image over the network. The way to get around this problem is to write
> the image over the network using a fresh network stack.

Can we just take the interface down and bring it up just for writing the image,
possibly with another IP address? We don't need another kernel to do
that.

> >> [snip]
> >>
> >> >> > One more thing: How do we restore the system state?
> >> >>
> >> >> The "resume kernel" would be loaded at the same address as the "save
> >> >> kernel" was loaded (it should probably be the same kernel),
> >>
> >> > Well, we'd have to use a relocatable kernel for this purpose, it
> >> > seems.
> >>
> >> Not necessarily relocatable (although that would be the usual
> >> solution). It just needs to be loaded at a different address than the
> >> normal kernel.
>
> > AFAICS, you can't do that with a kernel which is not relocatable (you can load
> > it, of course, but will it work then?).
>
> I seem to recall in recent kernel versions support for both a
> relocatable kernel and also support for non-relocatable kernels which
> load at a non-standard address.
>
> >> If it isn't relocatable, the memory that would be needed by the "save kernel"
> >> would have to be reserved at boot.
>
> > That doesn't seem to be realistic to me.
>
> Okay. I don't see why there would be a problem with using a relocatable
> kernel though.
>
> [snip]
>
> >> >> Presumably it would be most convenient to have the normal boot loader
> >> >> load the resume kernel directly at the desired address. The
> >> >> disadvantage is that at the same time the image is written, something
> >> >> would have to be done so that the boot loader would know to load the
> >> >> resume kernel, rather than the normal kernel. (E.g. the image writing
> >> >> kernel would need to modify the grub config file.)
> >>
> >> > No, it can't do that, unless the file is on a 'safe' filesystem
> >>
> >> Grub, its configuration, and the kernel used to resume the system had
> >> better be on a "safe" filesystem already (i.e. a separate, unmounted
> >> before hibernation /boot).
>
> > Currently, you don't need to do that.
>
> Some people get away with it, but fundamentally it is broken to do so.
> (The fact that the current software suspend implementations tell the
> filesystems to sync to disk increases its chances of working.) You are
> accessing a filesystem that is in an unknown state. Consider that the
> user might make a change to grub.conf, but the kernel caches the write.
> If the filesystem containing grub.conf is left mounted, the write might
> never reach disk before the system is hibernated. As a result, when
> grub attempts to read it, it doesn't get the expected data.

Yes, that can happen in theory (and I believe it's happend for some XFS
users), but _most_ often it just works and people are used to doing it.

> >> >> This shouldn't be a significant problem in practice.
> >>
> >> > I don't agree here.
> >>
> >> I think hibernate-script already includes support for modifying grub's
> >> configuration.
>
> > Yes. It does that _before_ the hibernation begins. ;-)
>
> Either way, it doesn't make much difference. Inside of
> hibernate-script, you need logic like:
>
> if /boot is not mounted: mount /boot
> make change
> umount /boot
>
> If you do it from the "save kernel", you need logic like:
> mount /dev/boot-device /boot (no fstab on "save kernel", most likely)
> make change
> umount /boot.
>
> [snip]
>
> >> As far as I understand it, the swsusp resume path involves the boot
> >> kernel loading the entire image from disk to available memory, then
> >> shutting down all the devices, and copying the memory into place, and
> >> then jumping to the original kernel, which reinitializes devices and
> >> starts tasks running. This isn't very different from what I was
> >> proposing as the alternative anyway, except that: memory is copied once,
> >> which is pretty fast, but means that only up to half of the total memory
> >> can be saved.
>
> > No that's not correct. Actually, during the restore we _can_ load much more
> > than 50% of RAM, everything needed for that is already in place. :-)
>
> I suppose you do that by using more sophisticated logic to atomically
> copy the pages to their final location after loading them from disk. In
> particular, I suppose you must order the page copies carefully to avoid
> clobbering pages that have not yet been copied.

Yes.

> Seems reasonable. In that case, there is indeed probably no reason to not
> use that approach for resuming.
>
> [snip]
>
> >> The whole reason to want to checkpoint filesystems was so that the
> >> original kernel would remain a fully-functional system with a
> >> fully-functional userspace that can continue to access the filesystems
> >> while the hibernate image is being written. In addition to the lack of
> >> checkpoint support, however, there are a number of other issues that
> >> this would create: Even if you can checkpoint filesystems, you can't
> >> checkpoint the entire world. The kernel will keep acking network
> >> packets, and userspace as well will send any normal replies. If a
> >> document was sent off to be printed right before the checkpoint, it
> >> might end up printing while the image is being saved, and then printed
> >> again when the system resumes.
>
> > That's correct.
>
> >> Fundamentally, I don't think checkpointing is the right answer. What is
> >> desired is a fully functional system with a fully functional userspace
> >> during the image writing. But we don't want this to be the _same_
> >> system that is actually being imaged.
> >>
> >> That is why I think the kexec solution is the elegant solution.
>
> > Frankly, I think it's tricky. ;-)
>
> To me, it seems a lot easier to get right than the current approaches.

Well, maybe. :-)

> > Moreover, I think it would require some problems that we don't even
> > anticipate to be solved.
>
> Possibly. The alternative, though, seems to be to add hack after hack
> to get certain functionality to work.

The problem here is that you can't just anticipate everything and sometimes
you need to add hacks to handle problems that you have not anticipated.

I'm not sure if you're able to avoid adding hacks within the kexec-based
approach.

> >> > I see two basic advantages of your approach:
> >> > 1) We don't need to freeze tasks.
> >> > 2) We can create images larger than 50% of RAM.
> >>
> >> There is also the key benefit of allowing an arbitrary userspace in a
> >> fully functional system to be used to both save and load the image. As
> >> far as I understand, uswsusp allows a single userspace processes to run
> >> to handle the loading and saving, but the processes runs in a rather
> >> fragile userspace with most things disabled; in particular, this
> >> userspace process can't access a fuse filesystem and probably can't do
> >> other things like fork.
>
> > The user space running on top of the new kernel would be limited by the
> > fact that the old kernel's filesystems would be inaccessible to it. That
> > would, effectively, require the user to have special filesystems for the
> > image-saving kernel and its user space, which isn't very realistic
> > IMO.
>
> Fundamentally, saving of the image can't access any of the normal
> filesystems anyway. The userspace would likely be provided as an
> initramfs or initrd, exactly as is done for userspace resume from
> hibernate currently. The same initramfs could probably be used for both
> saving the image and restoring the image, since exactly the same
> procedure would be used to set up the necessary devices for both the
> save and restore case, and the GUI that is used might also be the same.

Do you realize how much time would it take to implement that?

> >> > Still, I don't think we could implement it quickly and easily.
> >>
> >> It is hard to say how hard it would be. I think a lot of the existing
> >> kexec and hibernate code could be leveraged.
>
> > Yes, I think so, but at least we need to fix the quiescing of devices before
> > we think of implementing that.
>
> It seems like fixing of device stopping/suspend/quiescing is an
> orthogonal issue to the actual hibernate implementation.

Not that much, IMO, but that's not the issue. The issue is that for the
kexec-based approach to work you'll need the devices to be quiesced properly.

> It would probably be most reliable and simplest if on every jump between
> kernels, all devices are fully stopped by the jumping kernel, and then fully
> reinitialized by the jumped-to kernel.

Yes, but that's not what's going on today. So first, let's fix that.

> Presumably the time spent doing this initialization will not be very
> significant compared to the time required to write the image.

You are probably right.

Now, I think that your idea boils down to freezing the kernel along with all
tasks, which is reasonable, albeit extreme. :-)

I think we can consider doing something like this in the long run, but at
present we are not ready to do that.

Besides, I wouldn't like to drop the existing infrastructure that people use
and start over at some other place, because that would be wasteful. Instead,
I'd prefer to improve the existing approach gradually, in such a way that
new code is added in a more-or-less transparent way (of course it's difficult
to avoid breaking things from time to time, but doing small steps it's easier
to fix bugs along the way).

Greetings,
Rafael

--
"Premature optimization is the root of all evil." - Donald Knuth

2007-06-04 04:41:15

[permalink] [raw]

Subject: Re: A kexec approach to hibernation

On Fri, Jun 01, 2007 at 07:54:30PM -0400, Jeremy Maitin-Shepard wrote:
> "Rafael J. Wysocki" <[email protected]> writes:
>
> > On Saturday, 2 June 2007 00:25, Jeremy Maitin-Shepard wrote:
>
> [snip]
>
> >> Just before jumping into the new kernel, with interrupts disabled, the
> >> old kernel could either prepare a data structure that specifies what
> >> pages are allocated, or alternatively simply provide a pointer to the
> >> relevant data structure in the old kernel.
>
> > But for this purpose the old kernel will actually need to do what is currently
> > done in swsusp while the image is being created (the only difference is that
> > we allocate memory in the process, but that's a detail only).
>
> Okay, but creating a list of pages should be extremely easy.
> Alternatively, with the "save kernel" might be able to read the existing
> data structures directly.
>

Can't we do it Kdump way? Kdump creates ELF headers and stores these in
memory. Address of these elf headers is passed to second kernel through
command line parameter. These ELF headers contain the information regarding
what memory areas need to be captured by the second kernel.

Can't we adopt similar raw approach for hibernation? Reserve a memory
area for second kernel (This is not used by first kernel). During hibernation,
load second kernel in reserved memory area which will also determine
what physical memory needs to be saved (possibly reading /proc/iomem) and
create ELF headers and then second kernel can parse these headers and save
the memory. The output file can possibly be and ELF image again so that
restoring back becomes easier.

I am just thinking that do we have to create a list of pages etc? Can't we
just copy the raw memory to disk and restore it back. Information regarding
where a chunk of memory should go back will be provided by the ELF header.

One fall side would be that problem of reserving a memory area constantly
and this memory area is currently reserved at first kernel boot time. Can we
somehow make this reservation dynamic? Something like using hugepage support
so that we can allocate big chunks of contiguous physical memory from user
space (32MB or 64MB) at run time.

Thanks
Vivek

2007-06-04 05:22:41

by Nigel Cunningham

[permalink] [raw]

Subject: Re: A kexec approach to hibernation

Hi.

I can see that the idea of writing a kernel image from using another
kernel sounds nice and clean initially, but the more we get into the
details (yes, I am listening, even though I said nothing before now),
the more it's sounding like the cure is worse than the disease.

To get rid of process freezing, we're talking about:
* making hibernation depend on depriving the user of 32 or 64M of
otherwise perfectly usable memory (thereby making hibernation on
machines with less memory impossible)
* requiring them to set up kexec or kdump (I don't understand the
difference, sorry) or some new variation
* adding interfaces to tell kexec/dump/whatever what pages need to be
saved and reloaded
* adding convolutions in which at resume time we boot one kernel, switch
to another kernel to do the loading and then switch back again to the
resumed kernel (assuming I understand what you're suggesting).

It all sounds terribly complicated and confusing to me, and that's
before I even begin to think about how this second kernel could possibly
write the image to an encrypted device or LVM or such like that the
first kernel knows about and might use now.

Can't we just get the freezer right and be done with it?

Regards,

Nigel

Attachments:

signature.asc (189.00 B)
This is a digitally signed message part

2007-06-04 07:59:51

by Rafael J. Wysocki

[permalink] [raw]

Subject: Re: A kexec approach to hibernation

Hi,

On Monday, 4 June 2007 07:22, Nigel Cunningham wrote:
> Hi.
>
> I can see that the idea of writing a kernel image from using another
> kernel sounds nice and clean initially, but the more we get into the
> details (yes, I am listening, even though I said nothing before now),
> the more it's sounding like the cure is worse than the disease.
>
> To get rid of process freezing, we're talking about:
> * making hibernation depend on depriving the user of 32 or 64M of
> otherwise perfectly usable memory (thereby making hibernation on
> machines with less memory impossible)
> * requiring them to set up kexec or kdump (I don't understand the
> difference, sorry) or some new variation
> * adding interfaces to tell kexec/dump/whatever what pages need to be
> saved and reloaded
> * adding convolutions in which at resume time we boot one kernel, switch
> to another kernel to do the loading and then switch back again to the
> resumed kernel (assuming I understand what you're suggesting).
>
> It all sounds terribly complicated and confusing to me, and that's
> before I even begin to think about how this second kernel could possibly
> write the image to an encrypted device or LVM or such like that the
> first kernel knows about and might use now.
>
> Can't we just get the freezer right and be done with it?

My feelings about this are pretty much the same. :-)

At least, there still is room for improvements within the current approach,
so first I'd like to improve it as much as reasonably possible and then to
think of alternatives, if need be.

Greetings,
Rafael

--
"Premature optimization is the root of all evil." - Donald Knuth

2007-06-04 08:14:42

by Nigel Cunningham

[permalink] [raw]

Subject: Re: A kexec approach to hibernation

Hi again.

On Mon, 2007-06-04 at 10:05 +0200, Rafael J. Wysocki wrote:
> On Monday, 4 June 2007 07:22, Nigel Cunningham wrote:
> > Hi.
> >
> > I can see that the idea of writing a kernel image from using another
> > kernel sounds nice and clean initially, but the more we get into the
> > details (yes, I am listening, even though I said nothing before now),
> > the more it's sounding like the cure is worse than the disease.
> >
> > To get rid of process freezing, we're talking about:
> > * making hibernation depend on depriving the user of 32 or 64M of
> > otherwise perfectly usable memory (thereby making hibernation on
> > machines with less memory impossible)
> > * requiring them to set up kexec or kdump (I don't understand the
> > difference, sorry) or some new variation
> > * adding interfaces to tell kexec/dump/whatever what pages need to be
> > saved and reloaded
> > * adding convolutions in which at resume time we boot one kernel, switch
> > to another kernel to do the loading and then switch back again to the
> > resumed kernel (assuming I understand what you're suggesting).
> >
> > It all sounds terribly complicated and confusing to me, and that's
> > before I even begin to think about how this second kernel could possibly
> > write the image to an encrypted device or LVM or such like that the
> > first kernel knows about and might use now.
> >
> > Can't we just get the freezer right and be done with it?
>
> My feelings about this are pretty much the same. :-)
>
> At least, there still is room for improvements within the current approach,
> so first I'd like to improve it as much as reasonably possible and then to
> think of alternatives, if need be.

Agreed. I'm not for a moment denying that the current freezer could be
better, but biffing it out the window just doesn't seem to be the
appropriate solution at the moment.

Regards,

Nigel

Attachments:

signature.asc (189.00 B)
This is a digitally signed message part

2007-06-04 10:46:36

by Pavel Machek

[permalink] [raw]

Subject: Re: A kexec approach to hibernation

Hi!

> >> But kernel threads also rely on userspace, due to e.g. fuse and usermode
> >> helpers.
>
> > Yes, I know that and I think these issues are solvable within the current
> > approach.
>
> It seems like it would be very hard to get writing of an image to a
> fuse filesystem working under the current scheme.
>
> Trying to image a system while it is running seems fundamentally broken.
> As another example, I believe currently although devices are "quiesced"
> or stopped while the atomic snapshot is made, they are all then started
> again afterward while the image is written to disk. As a result, the
> network drivers will continue acking TCP packets that are received after
> the snapshot, but these packets will be lost.
>
> You might claim then that the solution is to simply keep the network
> driver quiesced or stopped. But then it is impossible to write the
> image over the network. The way to get around this problem is to write
> the image over the network using a fresh network stack.

The "fresh network stack" will RST any connections that were going,
which is ugly, too.

> >> Grub, its configuration, and the kernel used to resume the system had
> >> better be on a "safe" filesystem already (i.e. a separate, unmounted
> >> before hibernation /boot).
>
> > Currently, you don't need to do that.
>
> Some people get away with it, but fundamentally it is broken to do so.
> (The fact that the current software suspend implementations tell the
> filesystems to sync to disk increases its chances of working.) You are
> accessing a filesystem that is in an unknown state. Consider that the
> user might make a change to grub.conf, but the kernel caches the write.
> If the filesystem containing grub.conf is left mounted, the write might
> never reach disk before the system is hibernated. As a result, when
> grub attempts to read it, it doesn't get the expected data.

sync is perfectly safe way of telling the fs to store data on disk.

> >> That is why I think the kexec solution is the elegant solution.
>
> > Frankly, I think it's tricky. ;-)
>
> To me, it seems a lot easier to get right than the current approaches.

Well, you are certainly welcome to create the patch. "suspend3" name
is still free, AFAICT.

If _I_ were willing to add some runtime overhead to make hibernation
simpler, I'd just use some virtualization to do that... with added
advantage of "hibernate here, resume on different hw".
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2007-06-04 12:42:30

by Matthew Garrett

[permalink] [raw]

Subject: Re: A kexec approach to hibernation

On Mon, Jun 04, 2007 at 12:46:21PM +0200, Pavel Machek wrote:

> sync is perfectly safe way of telling the fs to store data on disk.

On disk, yes. On the filesystem, no. It's valid for the data to be left
in the journal, for instance.

--
Matthew Garrett | [email protected]

2007-06-04 13:10:20

by Pavel Machek

[permalink] [raw]

Subject: Re: A kexec approach to hibernation

On Mon 2007-06-04 13:20:54, Matthew Garrett wrote:
> On Mon, Jun 04, 2007 at 12:46:21PM +0200, Pavel Machek wrote:
>
> > sync is perfectly safe way of telling the fs to store data on disk.
>
> On disk, yes. On the filesystem, no. It's valid for the data to be left
> in the journal, for instance.

Yep... then grub needs to grok the journal. It does for ext3, IIRC.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2007-06-04 13:16:42

by Matthew Garrett

[permalink] [raw]

Subject: Re: A kexec approach to hibernation

On Mon, Jun 04, 2007 at 03:10:00PM +0200, Pavel Machek wrote:
> On Mon 2007-06-04 13:20:54, Matthew Garrett wrote:
> > On Mon, Jun 04, 2007 at 12:46:21PM +0200, Pavel Machek wrote:
> >
> > > sync is perfectly safe way of telling the fs to store data on disk.
> >
> > On disk, yes. On the filesystem, no. It's valid for the data to be left
> > in the journal, for instance.
>
> Yep... then grub needs to grok the journal. It does for ext3, IIRC.

No, it only supports ext2 (and reading ext3 as if it's ext2). Right now,
the assumption that syncing during suspend will cause data to hit
something grub can read isn't a safe one.
--
Matthew Garrett | [email protected]

2007-06-04 22:04:19

by Jeremy Maitin-Shepard

[permalink] [raw]

Subject: Re: A kexec approach to hibernation

On Mon, Jun 04, 2007 at 03:22:20PM +1000, Nigel Cunningham wrote:
> Hi.
>
> I can see that the idea of writing a kernel image from using another
> kernel sounds nice and clean initially, but the more we get into the
> details (yes, I am listening, even though I said nothing before now),
> the more it's sounding like the cure is worse than the disease.

I think if we look into the details a bit more, we may find that it is in
fact not worse after all. It would be nice if it were also the case that this
approach could be implemented in only a few hours of work, but unfortunately I
doubt that to be the case even though I imagine it may be somewhat simpler to
implement than the current swsusp and suspend2 implementations.

Just to give some perspective on the implementation, I believe the following
functions/procedures provided by the kernel to userspace (implemented as system
calls, sysfs files, ioctls, etc.) would be sufficient for this hibernation
approach:

(Note that I wrote this description after writing my responses to the other
points you make, and so it may make more sense for those to be read first.)

1. "start hibernation"
Parameters:
- "save image" kernel to use (either as the binary data or as a path to the
file perhaps);

- extra kernel command-line parameters to the "save image" kernel;

- an initrd for the "save image" kernel (if needed).

This function would result in the original kernel loading the "save image"
kernel into memory, stopping all devices, and jumping to the new kernel.

2. "resume from hibernation"
Parameters:
Somehow the block of memory containing the hibernate image would need to be
provided; it could be specified as a pointer to memory in the process
invoking this function, or alternatively something like /dev/snapshot could
be used.

This function would stop devices, shuffle the pages around in memory, and
jump back to the original kernel.

3. "abort hibernation"
Parameters:
The address to jump back to the original kernel would need to be specified;
the new kernel would know this address because it would be provided as a
kernel command-line parameter.

This function would act similarly to "resume from hibernation", except that
the pages are already in memory exactly where they need to be, so all that
needs to be done is to stop all devices, and jump back to the original
kernel.

If it is desired to do slightly more in the kernel, the "save image" kernel
could process the kernel command-line arguments to determine the pages that
need to be written, and provide of a view of them e.g. as /dev/snapshot, rather
than having the userspace under the "save image" kernel do that work and then
perhaps access the pages using /dev/mem.

> To get rid of process freezing, we're talking about:

Note that the advantage of this approach is not just getting rid of process
freezing and its associated problems. There is also the advantage of allowing
much greater flexibility in how the image is written, and avoiding disturbing
things like the network stack.

> * making hibernation depend on depriving the user of 32 or 64M of
> otherwise perfectly usable memory (thereby making hibernation on
> machines with less memory impossible)

It is not clear that this much memory would really need to be reserved. I'll
admit I don't fully understand the requirements for using kexec to load a
kernel. In particular, I don't know how much memory would really be required
to load a kernel to write an image, and to what extent that memory needs to be
contiguous. Even if a significant amount of contiguous physical memory needs
to be reserved at boot, this memory could still perhaps be used for the page
cache by the original kernel, since it could be freed up for hibernation (and
possibly those cached pages could be moved to different memory.)

In the best case, though, a significant amount of contiguous memory would not
be required, in which case a certain amount of memory would need to be freed
only for hibernation, and could be used normally while not hibernating.

(As a side note, with machines typically having 1GB+ of memory these days, even
wasting 64MB of memory is becoming increasingly unimportant, although I agree
it is not a good idea. I actually run an x86 system with 1GB of memory and no
HIGHMEM support, and as a result waste over 100MB of physical memory, which
would handily be free for the new kernel. Changing the VM split broke certain
programs that I didn't feel like fixing.)

> * requiring them to set up kexec or kdump (I don't understand the
> difference, sorry) or some new variation

This new hibernation approach would indeed internally use some or all of the
kexec code, but I don't think this detail would significantly impact the setup
procedure. The only real impact would be that the user would need to somehow
specify how to access the "save image kernel" and the additional kernel
command-line arguments to include. If an initrd is to be used instead of an
initramfs, then that would have to be specified as well. I don't think this
setup requirement is significantly more taxing than having to specify the
path to the user interface program, for instance.

> * adding interfaces to tell kexec/dump/whatever what pages need to be
> saved and reloaded

Any hibernation mechanism needs to know which pages to save. This approach is
no different. The "interface" could likely be one of the following:

1. Just before jumping to the new kernel, with interrupts disabled and devices
already stopped, the original kernel prepares a list of pages to write
somewhere in memory. The old kernel passes the address of this list as a
kernel command-line argument to the new kernel. The initramfs or initrd
userspace (or the kernel itself, although there would be no advantage in doing
this in the kernel) gets this address from the kernel command-line and then
reads that list to determine which pages to write. Presumably preparing the
list would be a small amount of code, and presumably both suspend2 and the
in-kernel swsusp already need to do something like this.

2. The old kernel prepares no new data structures, and simply provides a few
pointers as kernel command-line arguments to the new kernel to the existing
data structures that describe the pages that are used. The code running under
the new kernel responsible for writing the hibernation image simply accesses
these data structures using the pointers from the kernel command-line to
determine which pages to write.

> * adding convolutions in which at resume time we boot one kernel, switch
> to another kernel to do the loading and then switch back again to the
> resumed kernel (assuming I understand what you're suggesting).

This shouldn't actually be necessary. It should be possible to do the resume
in exactly the same way the in-kernel swsusp resumes currently (except that
userspace could be used to actually load the image into memory, and then tells
the kernel to do the necessary manipulations to stop devices, shuffle the
pages around so they are in the right positions, and then jump to the resumed
kernel).

>
> It all sounds terribly complicated and confusing to me, and that's
> before I even begin to think about how this second kernel could possibly
> write the image to an encrypted device or LVM or such like that the
> first kernel knows about and might use now.

I find in some ways it is much simpler than the current approaches. The "save
kernel" has to re-initialize device mapper devices that are needed to write the
image in exactly the same way that the resume kernel needs to reinitialize those
devices. In fact, it could probably use the very same initramfs/initrd code to
do it. The fact that it imposes this symmetry is arguably an advantage.

> Can't we just get the freezer right and be done with it?

The question is: can the freezer ever be right? As far as I can see, no level
of correctness of the freezer is going to allow you to save the hibernation
image to something on a fuse filesystem, because essentially any code that is
run while writing the image needs to live in an special box that is totally
isolated from the rest of the system in order to avoid problems; thus, it seems
like it makes sense to implement this box by simply using a separate kernel,
rather than adding hacks.

--
Jeremy Maitin-Shepard

2007-06-04 22:09:46

by Jeremy Maitin-Shepard

[permalink] [raw]

Subject: Re: A kexec approach to hibernation

On Mon, Jun 04, 2007 at 12:46:21PM +0200, Pavel Machek wrote:

> [snip]

> > You might claim then that the solution is to simply keep the network
> > driver quiesced or stopped. But then it is impossible to write the
> > image over the network. The way to get around this problem is to write
> > the image over the network using a fresh network stack.
>
> The "fresh network stack" will RST any connections that were going,
> which is ugly, too.

It will only do this if you bring up the network device with the same IP
address in the new kernel (which you would have no reason to do if you don't
need to write the image over the network.) Maybe the ideal behavior would be
to tell the network stack to just ignore unexpected TCP packets, rather than
send RST, while saving or reading the image, but that is probably not necessary
for most uses and would be a hack.

I also think that sending RST is far better than sending ACK and then silently
tossing out the data, which is what is currently done. (Since I
believe currently the network devices are brought back up along with
all other devices after the atomic copy is made.) Silently losing data is
something that should only occur on a crash. This is likely to actually be a
somewhat serious problem for servers on which hibernate is used to move the
server between rooms without losing connections.

Of course, you can get around this by adding a hack to not bring up network
devices based on some option or other, but that just solves one specific
case with an ugly solution. In contrast, using the kexec approach, the network
device or any other device would quite naturally not be brought back up unless
it was needed for hibernate, and even if it is brought back up, no data are
silently lost.

> [snip]

> > To me, it seems a lot easier to get right than the current approaches.
>
> Well, you are certainly welcome to create the patch. "suspend3" name
> is still free, AFAICT.

I could be sneaky and call it "hibernate". Probably nicer though to use the
name "kexec hibernate" to be later simplified to just "hibernate".

I was hoping that everyone would like the idea so much that they would rush to
implement it, so that I wouldn't have to try. (I haven't written much kernel
code before, and I have a number of other time-requiring projects to work on.)
It looks like that is not too likely to happen though ;).

Maybe I'll try implementing it though, and find that it isn't very much work.

It would be very convenient if the current work being done to improve the
driver interfaces for hibernate also results in the proper interfaces needed
for this approach. It looks like the resume path should be exactly the same
with this approach as with the existing approaches, but the hibernate path is
not exactly the same. In particular, it seems that all devices should be shut
down to a greater extent that merely the quiescing neccessary for the current
approaches while making an atomic copy, but also they should not be completely
shut down to the extent that they cannot be restored to the desired state when
resuming or aborting.

>
> If _I_ were willing to add some runtime overhead to make hibernation
> simpler, I'd just use some virtualization to do that... with added
> advantage of "hibernate here, resume on different hw".

I don't believe there is going to be any runtime overhead.

To some extent, (see some of the explanations I gave in the other e-mail I
sent a few minutes ago in reply to Nigel) I think the kexec appraoch can be
viewed as a cleaner variant of userspace hibernate.

--
Jeremy Maitin-Shepard

2007-06-04 22:36:35

by Nigel Cunningham

[permalink] [raw]

Subject: Re: A kexec approach to hibernation

Hi.

On Mon, 2007-06-04 at 18:09 -0400, Jeremy Maitin-Shepard wrote:
> I was hoping that everyone would like the idea so much that they would
> rush to
> implement it, so that I wouldn't have to try. (I haven't written much kernel
> code before, and I have a number of other time-requiring projects to work on.)
> It looks like that is not too likely to happen though ;).

I spent some time, last I think, seriously considering this approach.
The more I thought about the details, the more I realised that it wasn't
a viable approach. As I said before, it does indeed sound like a dream
at first, but once you get into the details, it becomes more and more of
a nightmare.

> Maybe I'll try implementing it though, and find that it isn't very much work.

Perhaps that would be a good idea. Then you'll get to see those issues
too.

[...]

> To some extent, (see some of the explanations I gave in the other e-mail I
> sent a few minutes ago in reply to Nigel) I think the kexec appraoch can be
> viewed as a cleaner variant of userspace hibernate.

I'm not going to bother saying more in response to that at the moment.
It seems clear to me that the three of us who've actually worked on
hibernation and thought about the issues actually know nothing, and
everyone who hasn't worked on it is far more expert than us.

I'm not saying that I think it's utterly impossible to use kexec for
hibernation. I am saying that I think such an implementation would be
even more of a headache than the existing issues.

Regards,

Nigel

Attachments:

signature.asc (189.00 B)
This is a digitally signed message part

2007-06-04 22:51:57

by Pavel Machek

[permalink] [raw]

Subject: Re: A kexec approach to hibernation

Hi!

> > > To me, it seems a lot easier to get right than the current approaches.
> >
> > Well, you are certainly welcome to create the patch. "suspend3" name
> > is still free, AFAICT.
>
> I could be sneaky and call it "hibernate". Probably nicer though to use the
> name "kexec hibernate" to be later simplified to just "hibernate".
>
> I was hoping that everyone would like the idea so much that they would rush to
> implement it, so that I wouldn't have to try. (I haven't written

That apparently did not happen, that much should be clear by now.

> > If _I_ were willing to add some runtime overhead to make hibernation
> > simpler, I'd just use some virtualization to do that... with added
> > advantage of "hibernate here, resume on different hw".
>
> I don't believe there is going to be any runtime overhead.

64MB less memory seems like runtime overhead for me. If you know how
to do kexec without pre-reserving memory, I believe kexec/kdump team
will be interested.

> To some extent, (see some of the explanations I gave in the other e-mail I
> sent a few minutes ago in reply to Nigel) I think the kexec appraoch can be
> viewed as a cleaner variant of userspace hibernate.

It also can be viewed as vaporware.

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2007-06-05 08:23:39

by Xavier Bestel

[permalink] [raw]

Subject: Re: A kexec approach to hibernation

On Tue, 2007-06-05 at 08:36 +1000, Nigel Cunningham wrote:
> I spent some time, last I think, seriously considering this approach.
> The more I thought about the details, the more I realised that it wasn't
> a viable approach. As I said before, it does indeed sound like a dream
> at first, but once you get into the details, it becomes more and more of
> a nightmare.

>From very far, it looks like apm suspend (i.e. an "external" system
taking control of the computer for hibernation and resuming).
FWIW, on my old laptop apm beats any kernel solution hands down in terms
of speed and robustness. Not that this means anything for kexec-suspend.

Xav

2007-06-05 09:35:37

by Stefan Seyfried

[permalink] [raw]

Subject: Re: A kexec approach to hibernation

On Tue, Jun 05, 2007 at 10:15:41AM +0200, Xavier Bestel wrote:
> FWIW, on my old laptop apm beats any kernel solution hands down in terms
> of speed

This might be true on 64MB systems. It is surely not true on multi-Gigabyte-
RAM setups. At least not if you actually use that memory for anything
including filesystem cache.
And you simply cannot buy a new machine today that still supports APM suspend
to disk.
--
Stefan Seyfried
QA / R&D Team Mobile Devices | "Any ideas, John?"
SUSE LINUX Products GmbH, N?rnberg | "Well, surrounding them's out."

This footer brought to you by insane German lawmakers:
SUSE Linux Products GmbH, GF: Markus Rex, HRB 16746 (AG N?rnberg)

2007-06-05 09:41:17

by Xavier Bestel

[permalink] [raw]

Subject: Re: A kexec approach to hibernation

On Tue, 2007-06-05 at 11:34 +0200, Stefan Seyfried wrote:
> On Tue, Jun 05, 2007 at 10:15:41AM +0200, Xavier Bestel wrote:
> > FWIW, on my old laptop apm beats any kernel solution hands down in terms
> > of speed
>
> This might be true on 64MB systems. It is surely not true on multi-Gigabyte-
> RAM setups. At least not if you actually use that memory for anything
> including filesystem cache.
> And you simply cannot buy a new machine today that still supports APM suspend
> to disk.

I don't contest that. I just say that technically, an "external kernel"
can suspend/hibernate a laptop very well.

Xav

2007-06-11 02:03:28

by H. Peter Anvin

[permalink] [raw]

Subject: Re: A kexec approach to hibernation

Matthew Garrett wrote:
> No, it only supports ext2 (and reading ext3 as if it's ext2). Right now,
> the assumption that syncing during suspend will cause data to hit
> something grub can read isn't a safe one.

I brought this issue up quite a few years ago at an OLS BOF. We pretty
much need a "supersync" system call; you can do this by bmapping any
file on ext3, but having something supported across filesystems would be
good.

-hpa

2007-06-11 03:41:01

by Nigel Cunningham

[permalink] [raw]

Subject: Re: A kexec approach to hibernation

Hi.

On Fri, 2007-06-01 at 21:54 -0400, Jeremy Maitin-Shepard wrote:
> "Rafael J. Wysocki" <[email protected]> writes:
>
> >> But kernel threads also rely on userspace, due to e.g. fuse and usermode
> >> helpers.
>
> > Yes, I know that and I think these issues are solvable within the current
> > approach.
>
> It seems like it would be very hard to get writing of an image to a
> fuse filesystem working under the current scheme.
>
> Trying to image a system while it is running seems fundamentally broken.
> As another example, I believe currently although devices are "quiesced"
> or stopped while the atomic snapshot is made, they are all then started
> again afterward while the image is written to disk. As a result, the
> network drivers will continue acking TCP packets that are received after
> the snapshot, but these packets will be lost.

Trying to image a system to a fuse filesystem is indeed fundamentally
broken. The problem is really that we have to make choices about what we
will and won't support.

We can have suspending to fuse filesystems, but only if we have running
userspace (which in turn implies either limiting the image to half of
memory or compressing a larger image as it's copied so that it fits in
the remaining space). We could have fuse from kexec, but then setting it
up will be... interesting.

We can have suspending to a network, but yes, we will want/need to be
selective about how network connections are handled.

I agree that the best solution seems to be selective resuming of devices
for writing the atomic copy. I had a patch to do that long ago, but it
wasn't a popular idea at the time. Since then I've focused more on
minimising the Suspend2 patch, so it's been dropped.

> You might claim then that the solution is to simply keep the network
> driver quiesced or stopped. But then it is impossible to write the
> image over the network. The way to get around this problem is to write
> the image over the network using a fresh network stack.

Or teach the driver stack about the difference/reset it. Remember that
even if you get a fresh network stack, you'll still be getting packets
for the old stack. Getting a new ip (assuming one is available) won't
stop other connections getting killed, either because we send resets
from the kexec'd kernel, or because they timeout looking for the old ip.

I can see that kexec does provide a nice, clean separation of context
from that of the kernel being hibernated. But it also deprives us of the
ability to easily use context in the hibernating kernel such as
encrypted devices and network connections & configuration. Do you have
some way in mind that could be utilised to overcome these limitations?

[..]

> Some people get away with it, but fundamentally it is broken to do so.
> (The fact that the current software suspend implementations tell the
> filesystems to sync to disk increases its chances of working.) You are
> accessing a filesystem that is in an unknown state. Consider that the
> user might make a change to grub.conf, but the kernel caches the write.
> If the filesystem containing grub.conf is left mounted, the write might
> never reach disk before the system is hibernated. As a result, when
> grub attempts to read it, it doesn't get the expected data.
>
> >> >> This shouldn't be a significant problem in practice.
> >>
> >> > I don't agree here.
> >>
> >> I think hibernate-script already includes support for modifying grub's
> >> configuration.
>
> > Yes. It does that _before_ the hibernation begins. ;-)
>
> Either way, it doesn't make much difference. Inside of
> hibernate-script, you need logic like:
>
> if /boot is not mounted: mount /boot
> make change
> umount /boot
>
> If you do it from the "save kernel", you need logic like:
> mount /dev/boot-device /boot (no fstab on "save kernel", most likely)
> make change
> umount /boot.

Doesn't the unmount do everything required to sync the data?

> [snip]
>
> >> As far as I understand it, the swsusp resume path involves the boot
> >> kernel loading the entire image from disk to available memory, then
> >> shutting down all the devices, and copying the memory into place, and
> >> then jumping to the original kernel, which reinitializes devices and
> >> starts tasks running. This isn't very different from what I was
> >> proposing as the alternative anyway, except that: memory is copied once,
> >> which is pretty fast, but means that only up to half of the total memory
> >> can be saved.
>
> > No that's not correct. Actually, during the restore we _can_ load much more
> > than 50% of RAM, everything needed for that is already in place. :-)
>
> I suppose you do that by using more sophisticated logic to atomically
> copy the pages to their final location after loading them from disk. In
> particular, I suppose you must order the page copies carefully to avoid
> clobbering pages that have not yet been copied. Seems reasonable. In
> that case, there is indeed probably no reason to not use that approach
> for resuming.

For Suspend2, I do something similar but simpler. If a page can be
loaded directly to the final address, do so. The only pages that need to
be loaded to another address and then restored are those that are used
by the loading kernel. We don't have to worry about copying pages back
in a particular order.

> [snip]
>
> >> The whole reason to want to checkpoint filesystems was so that the
> >> original kernel would remain a fully-functional system with a
> >> fully-functional userspace that can continue to access the filesystems
> >> while the hibernate image is being written. In addition to the lack of
> >> checkpoint support, however, there are a number of other issues that
> >> this would create: Even if you can checkpoint filesystems, you can't
> >> checkpoint the entire world. The kernel will keep acking network
> >> packets, and userspace as well will send any normal replies. If a
> >> document was sent off to be printed right before the checkpoint, it
> >> might end up printing while the image is being saved, and then printed
> >> again when the system resumes.
>
> > That's correct.
>
> >> Fundamentally, I don't think checkpointing is the right answer. What is
> >> desired is a fully functional system with a fully functional userspace
> >> during the image writing. But we don't want this to be the _same_
> >> system that is actually being imaged.
> >>
> >> That is why I think the kexec solution is the elegant solution.
>
> > Frankly, I think it's tricky. ;-)
>
> To me, it seems a lot easier to get right than the current approaches.

But you can't get what you said you wanted - a fully functional system
with a fully functional userspace isn't possible. You're running a
different kernel and can't safely mount filesystems that were mounted by
the first kernel. You'll have to set up a limited userspace that runs
from some sort of initrd/ramfs and will end up (so far as I can see now)
with similar restrictions to what we have now with uswsusp or suspend2's
userui. (Reads more... oh, I see you said that below :>)

> > Moreover, I think it would require some problems that we don't even
> > anticipate to be solved.
>
> Possibly. The alternative, though, seems to be to add hack after hack
> to get certain functionality to work.

As I argued above, both systems involve some degree of 'hack'. Kexec
only seems clean until you release that you wanted some of the context
you just switched away from.

Regards,

Nigel

Attachments:

signature.asc (189.00 B)
This is a digitally signed message part

2007-06-11 15:03:47

by Jeremy Maitin-Shepard

[permalink] [raw]

Subject: Re: A kexec approach to hibernation

Nigel Cunningham <[email protected]> writes:

[snip]

> Trying to image a system to a fuse filesystem is indeed fundamentally
> broken. The problem is really that we have to make choices about what we
> will and won't support.

> We can have suspending to fuse filesystems, but only if we have
> running userspace (which in turn implies either limiting the image to
> half of memory or compressing a larger image as it's copied so that it
> fits in the remaining space).

> We could have fuse from kexec, but then setting it
> up will be... interesting.

> We can have suspending to a network, but yes, we will want/need to be
> selective about how network connections are handled.

> I agree that the best solution seems to be selective resuming of devices
> for writing the atomic copy. I had a patch to do that long ago, but it
> wasn't a popular idea at the time.

I'd argue that the kexec approach does provide a fairly clean way to
selectively load device drivers --- simply leave out or keep as unloaded
modules the drivers that you don't want to load under the "save" and
"load" kernels.

>> You might claim then that the solution is to simply keep the network
>> driver quiesced or stopped. But then it is impossible to write the
>> image over the network. The way to get around this problem is to write
>> the image over the network using a fresh network stack.

> Or teach the driver stack about the difference/reset it. Remember that
> even if you get a fresh network stack, you'll still be getting packets
> for the old stack. Getting a new ip (assuming one is available) won't
> stop other connections getting killed, either because we send resets
> from the kexec'd kernel, or because they timeout looking for the old
> ip.

I could be mistaken, but I think that bringing up the network interface
with a different IP address would prevent it from reseting existing TCP
connections, because it would never receive the packets for those
existing connections.

> I can see that kexec does provide a nice, clean separation of context
> from that of the kernel being hibernated. But it also deprives us of the
> ability to easily use context in the hibernating kernel such as
> encrypted devices and network connections & configuration. Do you have
> some way in mind that could be utilised to overcome these limitations?

The reason I don't think this need to "re-setup" the context for
suspending should a significant problem in practice is that the setup
required under the "save kernel" should be exactly the same as that
required under the "load kernel". In particular, it should likely be
possible to re-use exactly the same code (in the initrd/initramfs) to
locate the desired device, and/or perform any necessary device mapper
commands to create the necessary devices. In the more complex case,
this "setup" might require setting up a network connection and/or
mounting a fuse filesystem.

> [snip]

>> if /boot is not mounted: mount /boot
>> make change
>> umount /boot
>>
>> If you do it from the "save kernel", you need logic like:
>> mount /dev/boot-device /boot (no fstab on "save kernel", most likely)
>> make change
>> umount /boot.

> Doesn't the unmount do everything required to sync the data?

Yes it does. The issue is that some people might not have /boot as a
separate partition, and have it as part of the root filesystem instead,
for instance. In that case, grub is effectively accessing a dirty
mounted filesystem. In practice, sync basically takes care of it, but
in theory it shouldn't really be done.

> [snip]

>> I suppose you do that by using more sophisticated logic to atomically
>> copy the pages to their final location after loading them from disk. In
>> particular, I suppose you must order the page copies carefully to avoid
>> clobbering pages that have not yet been copied. Seems reasonable. In
>> that case, there is indeed probably no reason to not use that approach
>> for resuming.

> For Suspend2, I do something similar but simpler. If a page can be
> loaded directly to the final address, do so. The only pages that need
> to be loaded to another address and then restored are those that are
> used by the loading kernel. We don't have to worry about copying
> pages back in a particular order.

What about the pages that couldn't be loaded back to their final address
because their final address is used by another page that couldn't be
loaded to its final address? Maybe you have some way to avoid this from
happening, it is just something that occurred to me. (It isn't
important anyway though.)

I suppose in any case, we can see that resuming would be essentially the
same under the kexec approach as under the current approach.

> [snip]

>> To me, it seems a lot easier to get right than the current approaches.

> But you can't get what you said you wanted - a fully functional system
> with a fully functional userspace isn't possible. You're running a
> different kernel and can't safely mount filesystems that were mounted by
> the first kernel. You'll have to set up a limited userspace that runs
> from some sort of initrd/ramfs and will end up (so far as I can see now)
> with similar restrictions to what we have now with uswsusp or suspend2's
> userui. (Reads more... oh, I see you said that below :>)

Well, it is fully functional in the sense that everything works as
advertised. I don't know exactly how uswsusp works, but the kexec
approach would have the advantage that you don't have to follow any
special rules like:

- better not write to the mounted filesystems, or you'll corrupt things

- better not try to talk to any other processes, because they're frozen
and you'll just hang

- better not fork any other processes, because only specially listed
processes get to run (maybe this isn't the case, I don't know).

Essentially, with the current approaches, you end up with two
independent userspaces anyway, but you just try to run them under a
single kernel (and really it would be preferable to have two independent
kernel spaces as well in the case of certain device drivers, but of
course this cannot be done under one kernel, hence the reason for
kexec).

>> > Moreover, I think it would require some problems that we don't even
>> > anticipate to be solved.
>>
>> Possibly. The alternative, though, seems to be to add hack after hack
>> to get certain functionality to work.

> As I argued above, both systems involve some degree of 'hack'. Kexec
> only seems clean until you release that you wanted some of the context
> you just switched away from.

(Perhaps see my comments above.)

Also, perhaps see the reply to Pavel about the need to reserve memory,
which I'm about to write. ;)

Please don't take my comments in this thread too harshly. I'm not
trying to undermine that work that you and the other hibernate
developers have done. I just think this kexec approach is an
interesting idea, and I brought it up so that it might get explored. I
still don't know if it actually makes sense (although I've managed to
mostly convince myself), and discussing it with you and the other
hibernate developers helps in figuring that out. If I didn't strongly
advocate it, it wouldn't get any thought.

--
Jeremy Maitin-Shepard

2007-06-11 15:18:17

by Jeremy Maitin-Shepard

[permalink] [raw]

Subject: Re: A kexec approach to hibernation

Pavel Machek <[email protected]> writes:

[snip]

>> > If _I_ were willing to add some runtime overhead to make hibernation
>> > simpler, I'd just use some virtualization to do that... with added
>> > advantage of "hibernate here, resume on different hw".
>>
>> I don't believe there is going to be any runtime overhead.

> 64MB less memory seems like runtime overhead for me. If you know how
> to do kexec without pre-reserving memory, I believe kexec/kdump team
> will be interested.

The main reason kdump needs to reserve memory at boot is that it needs
to preload the crashdump kernel into memory so that it will be available
on panic (and however much memory the crashdump kernel will need to run
will also need to be available at all times, since a panic can occur at
any time), and also because no attempt is made to shutdown devices on
panic, and consequently devices may clobber existing memory with ongoing
DMA, so a reserved area of memory must be used by the crashdump kernel.

For hibernate via kexec, however, these issues do not exist. The
simplest solution would be to simply backup the first say 16MB or 64MB
(or however much is desired for the "save" kernel to have) of memory
into free pages just before copying the "save" kernel into the desired
position and jumping to it.

Due to the speed of memory copying, this should not add any significant
overhead.

[snip]

--
Jeremy Maitin-Shepard

2007-06-11 15:45:46

by Xavier Bestel

[permalink] [raw]

Subject: Re: A kexec approach to hibernation

On Mon, 2007-06-11 at 11:01 -0400, Jeremy Maitin-Shepard wrote:
> >> You might claim then that the solution is to simply keep the network
> >> driver quiesced or stopped. But then it is impossible to write the
> >> image over the network. The way to get around this problem is to write
> >> the image over the network using a fresh network stack.
>
> > Or teach the driver stack about the difference/reset it. Remember that
> > even if you get a fresh network stack, you'll still be getting packets
> > for the old stack. Getting a new ip (assuming one is available) won't
> > stop other connections getting killed, either because we send resets
> > from the kexec'd kernel, or because they timeout looking for the old
> > ip.
>
> I could be mistaken, but I think that bringing up the network interface
> with a different IP address would prevent it from reseting existing TCP
> connections, because it would never receive the packets for those
> existing connections.

That can't work. There are networks where the client must have a fixed
IP, or must accept the adress given by dhcp in order to talk to
fileservers. And you still have the same mac adress, which may cause
problems.

Xav

2007-06-11 15:51:39

by Jeremy Maitin-Shepard

[permalink] [raw]

Subject: Re: A kexec approach to hibernation

Xavier Bestel <[email protected]> writes:

> On Mon, 2007-06-11 at 11:01 -0400, Jeremy Maitin-Shepard wrote:
>> >> You might claim then that the solution is to simply keep the network
>> >> driver quiesced or stopped. But then it is impossible to write the
>> >> image over the network. The way to get around this problem is to write
>> >> the image over the network using a fresh network stack.
>>
>> > Or teach the driver stack about the difference/reset it. Remember that
>> > even if you get a fresh network stack, you'll still be getting packets
>> > for the old stack. Getting a new ip (assuming one is available) won't
>> > stop other connections getting killed, either because we send resets
>> > from the kexec'd kernel, or because they timeout looking for the old
>> > ip.
>>
>> I could be mistaken, but I think that bringing up the network interface
>> with a different IP address would prevent it from reseting existing TCP
>> connections, because it would never receive the packets for those
>> existing connections.

> That can't work. There are networks where the client must have a fixed
> IP, or must accept the adress given by dhcp in order to talk to
> fileservers. And you still have the same mac adress, which may cause
> problems.

I wasn't suggesting that using a different IP address would be a general
solution. It might be a solution for a few people.

In general, I'd imagine that most people would not bring up the network
interface at all, and most of the people that do would bring it up with
the same IP address, causing some existing TCP connections to possibly
be reset.

I think that causing connections to be reset is, however, far better
than acking packets that are then silently thrown away.

--
Jeremy Maitin-Shepard

2007-06-11 16:03:53

by Xavier Bestel

[permalink] [raw]

Subject: Re: A kexec approach to hibernation

On Mon, 2007-06-11 at 11:51 -0400, Jeremy Maitin-Shepard wrote:
> Xavier Bestel <[email protected]> writes:
>
> > On Mon, 2007-06-11 at 11:01 -0400, Jeremy Maitin-Shepard wrote:
> >> >> You might claim then that the solution is to simply keep the network
> >> >> driver quiesced or stopped. But then it is impossible to write the
> >> >> image over the network. The way to get around this problem is to write
> >> >> the image over the network using a fresh network stack.
> >>
> >> > Or teach the driver stack about the difference/reset it. Remember that
> >> > even if you get a fresh network stack, you'll still be getting packets
> >> > for the old stack. Getting a new ip (assuming one is available) won't
> >> > stop other connections getting killed, either because we send resets
> >> > from the kexec'd kernel, or because they timeout looking for the old
> >> > ip.
> >>
> >> I could be mistaken, but I think that bringing up the network interface
> >> with a different IP address would prevent it from reseting existing TCP
> >> connections, because it would never receive the packets for those
> >> existing connections.
>
> > That can't work. There are networks where the client must have a fixed
> > IP, or must accept the adress given by dhcp in order to talk to
> > fileservers. And you still have the same mac adress, which may cause
> > problems.
>
> I wasn't suggesting that using a different IP address would be a general
> solution. It might be a solution for a few people.
>
> In general, I'd imagine that most people would not bring up the network
> interface at all, and most of the people that do would bring it up with
> the same IP address, causing some existing TCP connections to possibly
> be reset.
>
> I think that causing connections to be reset is, however, far better
> than acking packets that are then silently thrown away.

If I were helping you coding I'd suggest to only concentrate on having
your project work on standard filesystems, and then when it works maybe
think about suspending on crypto-over-loop-over-fuse-over-vpn-over-wifi.
But talk is cheap so I'm shutting up. Right now. :)

Xav

2007-06-11 17:29:07

by Jeremy Maitin-Shepard

[permalink] [raw]

Subject: Re: A kexec approach to hibernation

Xavier Bestel <[email protected]> writes:

[snip]

> If I were helping you coding I'd suggest to only concentrate on having
> your project work on standard filesystems, and then when it works maybe
> think about suspending on crypto-over-loop-over-fuse-over-vpn-over-wifi.
> But talk is cheap so I'm shutting up. Right now. :)

Well, the whole idea of the kexec approach is that the hibernate system
doesn't need to know anything at all about filesystems or any particular
device. So if it works at all, it will work for
crypto-over-loop-over-fuse-over-vpn-over-wifi
-over-pigeon-carrier-protocol-over-printer-and-scanner.

--
Jeremy Maitin-Shepard

2007-06-11 22:44:35

by Nigel Cunningham

[permalink] [raw]

Subject: Re: A kexec approach to hibernation

Hi.

On Sun, 2007-06-10 at 19:02 -0700, H. Peter Anvin wrote:
> Matthew Garrett wrote:
> > No, it only supports ext2 (and reading ext3 as if it's ext2). Right now,
> > the assumption that syncing during suspend will cause data to hit
> > something grub can read isn't a safe one.
>
> I brought this issue up quite a few years ago at an OLS BOF. We pretty
> much need a "supersync" system call; you can do this by bmapping any
> file on ext3, but having something supported across filesystems would be
> good.

Sounds like a good idea to me.

Regards,

Nigel

Attachments:

signature.asc (189.00 B)
This is a digitally signed message part