2010-06-04 22:05:21

by Pavel Machek

[permalink] [raw]
Subject: [SUSPECTED SPAM] Re: [linux-pm] Proposal for a new algorithm for reading & writing a hibernation image.



"Nigel Cunningham" <[email protected]> wrote:

>Hi.
>
>On 30/05/10 15:25, Pavel Machek wrote:
>> Hi!
>>
>>> 2. Prior to writing any of the image, also set up new 4k page tables
>>> such that an attempt to make a change to any of the pages we're about to
>>> write to disk will result in a page fault, giving us an opportunity to
>>> flag the page as needing an atomic copy later. Once this is done, write
>>> protection for the page can be disabled and the write that caused the
>>> fault allowed to proceed.
>>
>> Tricky.
>>
>> page faulting code touches memory, too...
>
>Yeah. I realise we'd need to make the pages that are used to record the
>faults be unprotected themselves. I'm imagining a bitmap for that.
>
>Do you see any reason that it could be inherently impossible? That's
>what I really want to know before (potentially) wasting time trying it.

I'm not sure it is impossible, but it certainly seems way too complex to be
practical.

2mb pages will probably present a problem, as will bat mappings on powerpc.
--
Sent from my Android phone with K-9. Please excuse my brevity.


2010-06-04 23:39:43

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [SUSPECTED SPAM] Re: [linux-pm] Proposal for a new algorithm for reading & writing a hibernation image.

On Thu, 2010-06-03 at 16:50 +0200, Pavel Machek wrote:
>
> "Nigel Cunningham" <[email protected]> wrote:
>
> >Hi.
> >
> >On 30/05/10 15:25, Pavel Machek wrote:
> >> Hi!
> >>
> >>> 2. Prior to writing any of the image, also set up new 4k page tables
> >>> such that an attempt to make a change to any of the pages we're about to
> >>> write to disk will result in a page fault, giving us an opportunity to
> >>> flag the page as needing an atomic copy later. Once this is done, write
> >>> protection for the page can be disabled and the write that caused the
> >>> fault allowed to proceed.
> >>
> >> Tricky.
> >>
> >> page faulting code touches memory, too...
> >
> >Yeah. I realise we'd need to make the pages that are used to record the
> >faults be unprotected themselves. I'm imagining a bitmap for that.
> >
> >Do you see any reason that it could be inherently impossible? That's
> >what I really want to know before (potentially) wasting time trying it.
>
> I'm not sure it is impossible, but it certainly seems way too complex to be
> practical.
>
> 2mb pages will probably present a problem, as will bat mappings on powerpc.


Some time ago, after tuxonce caused medium fs corruption twice on my
root filesystem (superblock gone for example), I was thinking too about
how to make it safe to save whole memory.
Your tuxonice is so fast that it resembles suspend to ram.


I have radically different proposal.


Lets create a kind of self-contained very small operation system that
will know to do just one thing, write the memory to disk.
>From now on I am calling this OS, a suspend module.
Physically its code can be contained in linux kernel, or loaded as a
module.


Let see how things will work first:

1. Linux loads the suspend module to memory (if it is inside kernel
image, that becomes unnecessary)

At that point, its even possible to add some user plug-ins to that
module for example to draw splash screen. Of course all such plug-ins
must be root approved.


2. Linux turns off all devices, but hard disk.
Drivers for hard drives will register for this exception.


3. Linux creates a list of memory areas to save (or exclude from save,
doesn't matter)

4. Linux creates a list of hard disk sectors that will contain the
image.
This ensures support for swap partition and swap files as well.

5. Linux allocates small 'scratch space'
Of course if memory is very tight, some swapping can happen, but that
isn't significant.


6. Linux creates new page tables that cover: the suspend module, both of
above lists, scratch space, and (optionally) the framebuffer RW,
and rest of memory RO.

7. Linux switches to new page table, and passes control to that module.
Even if the module wanted to it won't be able to change system memory.
It won't even know how to do so.

8. Module optionally encrypts and/or compresses (and saves result to
scratch page)

9. Module uses very simplified disk drivers to write the memory to disk.
These drivers can even omit using interrupts because there is nothing
else to do.
It can also draw progress bar on framebuffer using optional plugin

10. Module passes control back to linux, which just shuts system off.

Now what code will be in the module:

1. Optional compression & encryption - easy
2. Draw modules, also optional and easy


3. New disk drivers.
This is the hard part, but if we cover libata and ahci, we will cover
the common case.
Other cases can be handled by existing code that saved 1/2 of ram.


4. Arch specific code. Since it doesn't deal with interrupts nor memory
managment, it won't be lot of code.
Again standard swsusp can be used for arches that that module wasn't
ported to.

Anyone who had a dream to write a new (useful) OS is interested?


Best regards,
Maxim Levitsky

2010-06-04 23:58:17

by Nigel Cunningham

[permalink] [raw]
Subject: Re: [SUSPECTED SPAM] Re: [linux-pm] Proposal for a new algorithm for reading & writing a hibernation image.

Hi Maxim.

On 05/06/10 09:39, Maxim Levitsky wrote:
> On Thu, 2010-06-03 at 16:50 +0200, Pavel Machek wrote:
>>
>> "Nigel Cunningham"<[email protected]> wrote:
>>
>>> Hi.
>>>
>>> On 30/05/10 15:25, Pavel Machek wrote:
>>>> Hi!
>>>>
>>>>> 2. Prior to writing any of the image, also set up new 4k page tables
>>>>> such that an attempt to make a change to any of the pages we're about to
>>>>> write to disk will result in a page fault, giving us an opportunity to
>>>>> flag the page as needing an atomic copy later. Once this is done, write
>>>>> protection for the page can be disabled and the write that caused the
>>>>> fault allowed to proceed.
>>>>
>>>> Tricky.
>>>>
>>>> page faulting code touches memory, too...
>>>
>>> Yeah. I realise we'd need to make the pages that are used to record the
>>> faults be unprotected themselves. I'm imagining a bitmap for that.
>>>
>>> Do you see any reason that it could be inherently impossible? That's
>>> what I really want to know before (potentially) wasting time trying it.
>>
>> I'm not sure it is impossible, but it certainly seems way too complex to be
>> practical.
>>
>> 2mb pages will probably present a problem, as will bat mappings on powerpc.
>
>
> Some time ago, after tuxonce caused medium fs corruption twice on my
> root filesystem (superblock gone for example), I was thinking too about
> how to make it safe to save whole memory.

I'd be asking why you got the corruption. On the odd occasion where it
has been reported, it's usually been because the person didn't set up
their initramfs correctly (resumed after mounting filesystems). Is there
any chance that you did that?

> Your tuxonice is so fast that it resembles suspend to ram.

That depends on hard drive speed and CPU speed. I've just gotten a new
SSD drive, and can understand your statement now, but I wouldn't have
said the same beforehand.

> I have radically different proposal.
>
>
> Lets create a kind of self-contained very small operation system that
> will know to do just one thing, write the memory to disk.
>> From now on I am calling this OS, a suspend module.
> Physically its code can be contained in linux kernel, or loaded as a
> module.
>
>
> Let see how things will work first:
>
> 1. Linux loads the suspend module to memory (if it is inside kernel
> image, that becomes unnecessary)
>
> At that point, its even possible to add some user plug-ins to that
> module for example to draw splash screen. Of course all such plug-ins
> must be root approved.
>
>
> 2. Linux turns off all devices, but hard disk.
> Drivers for hard drives will register for this exception.
>
>
> 3. Linux creates a list of memory areas to save (or exclude from save,
> doesn't matter)
>
> 4. Linux creates a list of hard disk sectors that will contain the
> image.
> This ensures support for swap partition and swap files as well.
>
> 5. Linux allocates small 'scratch space'
> Of course if memory is very tight, some swapping can happen, but that
> isn't significant.
>
>
> 6. Linux creates new page tables that cover: the suspend module, both of
> above lists, scratch space, and (optionally) the framebuffer RW,
> and rest of memory RO.
>
> 7. Linux switches to new page table, and passes control to that module.
> Even if the module wanted to it won't be able to change system memory.
> It won't even know how to do so.
>
> 8. Module optionally encrypts and/or compresses (and saves result to
> scratch page)
>
> 9. Module uses very simplified disk drivers to write the memory to disk.
> These drivers can even omit using interrupts because there is nothing
> else to do.
> It can also draw progress bar on framebuffer using optional plugin
>
> 10. Module passes control back to linux, which just shuts system off.

Sounds a lot like kexec based hibernation that was suggested a year or
two back. Have you thought about resuming, too? That's the trickier part
of the process.

> Now what code will be in the module:
>
> 1. Optional compression& encryption - easy
> 2. Draw modules, also optional and easy
>
>
> 3. New disk drivers.
> This is the hard part, but if we cover libata and ahci, we will cover
> the common case.
> Other cases can be handled by existing code that saved 1/2 of ram.

To my mind, supporting only some hardware isn't an option.

> 4. Arch specific code. Since it doesn't deal with interrupts nor memory
> managment, it won't be lot of code.
> Again standard swsusp can be used for arches that that module wasn't
> ported to.

Perhaps I'm being a pessimist, but it sounds to me like this is going to
be a way bigger project than you're allowing for.

> Anyone who had a dream to write a new (useful) OS is interested?

:)

Nigel

2010-06-05 00:05:21

by Nigel Cunningham

[permalink] [raw]
Subject: Re: [SUSPECTED SPAM] Re: [linux-pm] Proposal for a new algorithm for reading & writing a hibernation image.

Hi.

On 04/06/10 00:50, Pavel Machek wrote:
> "Nigel Cunningham"<[email protected]> wrote:
>> On 30/05/10 15:25, Pavel Machek wrote:
>>> Hi!
>>>
>>>> 2. Prior to writing any of the image, also set up new 4k page tables
>>>> such that an attempt to make a change to any of the pages we're about to
>>>> write to disk will result in a page fault, giving us an opportunity to
>>>> flag the page as needing an atomic copy later. Once this is done, write
>>>> protection for the page can be disabled and the write that caused the
>>>> fault allowed to proceed.
>>>
>>> Tricky.
>>>
>>> page faulting code touches memory, too...
>>
>> Yeah. I realise we'd need to make the pages that are used to record the
>> faults be unprotected themselves. I'm imagining a bitmap for that.
>>
>> Do you see any reason that it could be inherently impossible? That's
>> what I really want to know before (potentially) wasting time trying it.
>
> I'm not sure it is impossible, but it certainly seems way too complex to be
> practical.

Oh. I thought this bit would actually be quite simple if it was
technically possible. I'm more concerned about the potential for
difficulties with restoring the state successfully.

> 2mb pages will probably present a problem, as will bat mappings on powerpc.

I have the idea that 2MB pages are only used for the kernel text and
read only data. Is that right? If so, perhaps they can just be
unconditionally copied (so that we can restore the image if a different
kernel is booted) and wouldn't need any page protection. Does that sound
right?

From the small bit I've read about bat mappings on the powerpc, it
looks like they could be replaced with normal ptes while doing
hibernation. More than willing to be told I don't understand what's
going on there :)

Nigel

2010-06-05 00:21:09

by Nigel Cunningham

[permalink] [raw]
Subject: Re: [linux-pm] [SUSPECTED SPAM] Re: Proposal for a new algorithm for reading & writing a hibernation image.

Hi again.

As I think about this more, I reckon we could run into problems at
resume time with reloading the image. Even if some bits aren't modified
as we're writing the image, they still might need to be atomically
restored. If we make the atomic restore part too small, we might not be
able to do that.

So perhaps the best thing would be to stick with the way TuxOnIce splits
the image at the moment (page cache / process pages vs 'rest'), but
using this faulting mechanism to ensure we do get all the pages that are
changed while writing the first part of the image.

Regards,

Nigel

2010-06-05 00:36:17

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [SUSPECTED SPAM] Re: [linux-pm] Proposal for a new algorithm for reading & writing a hibernation image.

On Sat, 2010-06-05 at 09:58 +1000, Nigel Cunningham wrote:
> Hi Maxim.
>
> On 05/06/10 09:39, Maxim Levitsky wrote:
> > On Thu, 2010-06-03 at 16:50 +0200, Pavel Machek wrote:
> >>
> >> "Nigel Cunningham"<[email protected]> wrote:
> >>
> >>> Hi.
> >>>
> >>> On 30/05/10 15:25, Pavel Machek wrote:
> >>>> Hi!
> >>>>
> >>>>> 2. Prior to writing any of the image, also set up new 4k page tables
> >>>>> such that an attempt to make a change to any of the pages we're about to
> >>>>> write to disk will result in a page fault, giving us an opportunity to
> >>>>> flag the page as needing an atomic copy later. Once this is done, write
> >>>>> protection for the page can be disabled and the write that caused the
> >>>>> fault allowed to proceed.
> >>>>
> >>>> Tricky.
> >>>>
> >>>> page faulting code touches memory, too...
> >>>
> >>> Yeah. I realise we'd need to make the pages that are used to record the
> >>> faults be unprotected themselves. I'm imagining a bitmap for that.
> >>>
> >>> Do you see any reason that it could be inherently impossible? That's
> >>> what I really want to know before (potentially) wasting time trying it.
> >>
> >> I'm not sure it is impossible, but it certainly seems way too complex to be
> >> practical.
> >>
> >> 2mb pages will probably present a problem, as will bat mappings on powerpc.
> >
> >
> > Some time ago, after tuxonce caused medium fs corruption twice on my
> > root filesystem (superblock gone for example), I was thinking too about
> > how to make it safe to save whole memory.
>
> I'd be asking why you got the corruption. On the odd occasion where it
> has been reported, it's usually been because the person didn't set up
> their initramfs correctly (resumed after mounting filesystems). Is there
> any chance that you did that?
>
> > Your tuxonice is so fast that it resembles suspend to ram.
>
> That depends on hard drive speed and CPU speed. I've just gotten a new
> SSD drive, and can understand your statement now, but I wouldn't have
> said the same beforehand.
Nope, I have a slow laptop drive.

>
> > I have radically different proposal.
> >
> >
> > Lets create a kind of self-contained very small operation system that
> > will know to do just one thing, write the memory to disk.
> >> From now on I am calling this OS, a suspend module.
> > Physically its code can be contained in linux kernel, or loaded as a
> > module.
> >
> >
> > Let see how things will work first:
> >
> > 1. Linux loads the suspend module to memory (if it is inside kernel
> > image, that becomes unnecessary)
> >
> > At that point, its even possible to add some user plug-ins to that
> > module for example to draw splash screen. Of course all such plug-ins
> > must be root approved.
> >
> >
> > 2. Linux turns off all devices, but hard disk.
> > Drivers for hard drives will register for this exception.
> >
> >
> > 3. Linux creates a list of memory areas to save (or exclude from save,
> > doesn't matter)
> >
> > 4. Linux creates a list of hard disk sectors that will contain the
> > image.
> > This ensures support for swap partition and swap files as well.
> >
> > 5. Linux allocates small 'scratch space'
> > Of course if memory is very tight, some swapping can happen, but that
> > isn't significant.
> >
> >
> > 6. Linux creates new page tables that cover: the suspend module, both of
> > above lists, scratch space, and (optionally) the framebuffer RW,
> > and rest of memory RO.
> >
> > 7. Linux switches to new page table, and passes control to that module.
> > Even if the module wanted to it won't be able to change system memory.
> > It won't even know how to do so.
> >
> > 8. Module optionally encrypts and/or compresses (and saves result to
> > scratch page)
> >
> > 9. Module uses very simplified disk drivers to write the memory to disk.
> > These drivers can even omit using interrupts because there is nothing
> > else to do.
> > It can also draw progress bar on framebuffer using optional plugin
> >
> > 10. Module passes control back to linux, which just shuts system off.
>
> Sounds a lot like kexec based hibernation that was suggested a year or
> two back. Have you thought about resuming, too? That's the trickier part
> of the process.
Why its tricky?

We can just reseve say 25 MB of memory and make resuming kernel only use
it for all its needs.




>
> > Now what code will be in the module:
> >
> > 1. Optional compression& encryption - easy
> > 2. Draw modules, also optional and easy
> >
> >
> > 3. New disk drivers.
> > This is the hard part, but if we cover libata and ahci, we will cover
> > the common case.
> > Other cases can be handled by existing code that saved 1/2 of ram.
>
> To my mind, supporting only some hardware isn't an option.


>
> > 4. Arch specific code. Since it doesn't deal with interrupts nor memory
> > managment, it won't be lot of code.
> > Again standard swsusp can be used for arches that that module wasn't
> > ported to.
>
> Perhaps I'm being a pessimist, but it sounds to me like this is going to
> be a way bigger project than you're allowing for.
I also thinks so. This is just an idea.


To add a comment on your idea.

I think is is possible to use page faults to see which memory regions
changed. Actually its is very interesting idea.

You just need to install your own page fault handler, and make sure it
doesn't touch any memory.
Of course the sucky part will be how to edit the page tables.
You might need to write your own code to do so to be sure.
And this has to be arch specific.

Since userspace is frozen, you can be sure that faults can only be
caused by access to WO memory or kernel bugs.


Best regards,
Maxim Levitsky

2010-06-05 00:45:27

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [SUSPECTED SPAM] Re: [linux-pm] Proposal for a new algorithm for reading & writing a hibernation image.

On Sat, 2010-06-05 at 03:36 +0300, Maxim Levitsky wrote:
> On Sat, 2010-06-05 at 09:58 +1000, Nigel Cunningham wrote:
> > Hi Maxim.
> >
> > On 05/06/10 09:39, Maxim Levitsky wrote:
> > > On Thu, 2010-06-03 at 16:50 +0200, Pavel Machek wrote:
> > >>
> > >> "Nigel Cunningham"<[email protected]> wrote:
> > >>
> > >>> Hi.
> > >>>
> > >>> On 30/05/10 15:25, Pavel Machek wrote:
> > >>>> Hi!
> > >>>>
> > >>>>> 2. Prior to writing any of the image, also set up new 4k page tables
> > >>>>> such that an attempt to make a change to any of the pages we're about to
> > >>>>> write to disk will result in a page fault, giving us an opportunity to
> > >>>>> flag the page as needing an atomic copy later. Once this is done, write
> > >>>>> protection for the page can be disabled and the write that caused the
> > >>>>> fault allowed to proceed.
> > >>>>
> > >>>> Tricky.
> > >>>>
> > >>>> page faulting code touches memory, too...
> > >>>
> > >>> Yeah. I realise we'd need to make the pages that are used to record the
> > >>> faults be unprotected themselves. I'm imagining a bitmap for that.
> > >>>
> > >>> Do you see any reason that it could be inherently impossible? That's
> > >>> what I really want to know before (potentially) wasting time trying it.
> > >>
> > >> I'm not sure it is impossible, but it certainly seems way too complex to be
> > >> practical.
> > >>
> > >> 2mb pages will probably present a problem, as will bat mappings on powerpc.
> > >
> > >
> > > Some time ago, after tuxonce caused medium fs corruption twice on my
> > > root filesystem (superblock gone for example), I was thinking too about
> > > how to make it safe to save whole memory.
> >
> > I'd be asking why you got the corruption. On the odd occasion where it
> > has been reported, it's usually been because the person didn't set up
> > their initramfs correctly (resumed after mounting filesystems). Is there
> > any chance that you did that?
I didn't use any initramfs.
I did use kernel modesetting and nouveau.
I used ext4.
The corruption happened after normal suspend.
I replaces swsusp with tuxonice.

Anyway, some more or less verified method must be used to save memory
because fs corruption is too scary thing to have.

I can't say it scared me that much 'cause I had dealt with worse
corruptions before, but being thrown to "grub rescue>" on boot is not
pleasant thing to see.

Best regards,
Maxim Levitsky

2010-06-05 00:47:19

by Nigel Cunningham

[permalink] [raw]
Subject: Re: [SUSPECTED SPAM] Re: [linux-pm] Proposal for a new algorithm for reading & writing a hibernation image.

Hi.

On 05/06/10 10:36, Maxim Levitsky wrote:
> On Sat, 2010-06-05 at 09:58 +1000, Nigel Cunningham wrote:
>> On 05/06/10 09:39, Maxim Levitsky wrote:
>>> On Thu, 2010-06-03 at 16:50 +0200, Pavel Machek wrote:
>>>> "Nigel Cunningham"<[email protected]> wrote:
>>>>> On 30/05/10 15:25, Pavel Machek wrote:
>>>>>> Hi!
>>>>>>
>>>>>>> 2. Prior to writing any of the image, also set up new 4k page tables
>>>>>>> such that an attempt to make a change to any of the pages we're about to
>>>>>>> write to disk will result in a page fault, giving us an opportunity to
>>>>>>> flag the page as needing an atomic copy later. Once this is done, write
>>>>>>> protection for the page can be disabled and the write that caused the
>>>>>>> fault allowed to proceed.
>>>>>>
>>>>>> Tricky.
>>>>>>
>>>>>> page faulting code touches memory, too...
>>>>>
>>>>> Yeah. I realise we'd need to make the pages that are used to record the
>>>>> faults be unprotected themselves. I'm imagining a bitmap for that.
>>>>>
>>>>> Do you see any reason that it could be inherently impossible? That's
>>>>> what I really want to know before (potentially) wasting time trying it.
>>>>
>>>> I'm not sure it is impossible, but it certainly seems way too complex to be
>>>> practical.
>>>>
>>>> 2mb pages will probably present a problem, as will bat mappings on powerpc.
>>>
>>>
>>> Some time ago, after tuxonce caused medium fs corruption twice on my
>>> root filesystem (superblock gone for example), I was thinking too about
>>> how to make it safe to save whole memory.
>>
>> I'd be asking why you got the corruption. On the odd occasion where it
>> has been reported, it's usually been because the person didn't set up
>> their initramfs correctly (resumed after mounting filesystems). Is there
>> any chance that you did that?
>>
>>> Your tuxonice is so fast that it resembles suspend to ram.
>>
>> That depends on hard drive speed and CPU speed. I've just gotten a new
>> SSD drive, and can understand your statement now, but I wouldn't have
>> said the same beforehand.
> Nope, I have a slow laptop drive.

Oh, okay. Not much ram then? I would have thought that in most cases -
and especially with a slow laptop drive - suspend to ram would be waaay
faster. Ah well, there is a huge variation in specs.

>>> I have radically different proposal.
>>>
>>>
>>> Lets create a kind of self-contained very small operation system that
>>> will know to do just one thing, write the memory to disk.
>>>> From now on I am calling this OS, a suspend module.
>>> Physically its code can be contained in linux kernel, or loaded as a
>>> module.
>>>
>>>
>>> Let see how things will work first:
>>>
>>> 1. Linux loads the suspend module to memory (if it is inside kernel
>>> image, that becomes unnecessary)
>>>
>>> At that point, its even possible to add some user plug-ins to that
>>> module for example to draw splash screen. Of course all such plug-ins
>>> must be root approved.
>>>
>>>
>>> 2. Linux turns off all devices, but hard disk.
>>> Drivers for hard drives will register for this exception.
>>>
>>>
>>> 3. Linux creates a list of memory areas to save (or exclude from save,
>>> doesn't matter)
>>>
>>> 4. Linux creates a list of hard disk sectors that will contain the
>>> image.
>>> This ensures support for swap partition and swap files as well.
>>>
>>> 5. Linux allocates small 'scratch space'
>>> Of course if memory is very tight, some swapping can happen, but that
>>> isn't significant.
>>>
>>>
>>> 6. Linux creates new page tables that cover: the suspend module, both of
>>> above lists, scratch space, and (optionally) the framebuffer RW,
>>> and rest of memory RO.
>>>
>>> 7. Linux switches to new page table, and passes control to that module.
>>> Even if the module wanted to it won't be able to change system memory.
>>> It won't even know how to do so.
>>>
>>> 8. Module optionally encrypts and/or compresses (and saves result to
>>> scratch page)
>>>
>>> 9. Module uses very simplified disk drivers to write the memory to disk.
>>> These drivers can even omit using interrupts because there is nothing
>>> else to do.
>>> It can also draw progress bar on framebuffer using optional plugin
>>>
>>> 10. Module passes control back to linux, which just shuts system off.
>>
>> Sounds a lot like kexec based hibernation that was suggested a year or
>> two back. Have you thought about resuming, too? That's the trickier part
>> of the process.
> Why its tricky?
>
> We can just reseve say 25 MB of memory and make resuming kernel only use
> it for all its needs.

Well, I suppose in this scenario, you can do it all atomically. I was
thinking of where we do a two-part restore (still trying to maximise
image size, but without a separate kernel).

>>> Now what code will be in the module:
>>>
>>> 1. Optional compression& encryption - easy
>>> 2. Draw modules, also optional and easy
>>>
>>>
>>> 3. New disk drivers.
>>> This is the hard part, but if we cover libata and ahci, we will cover
>>> the common case.
>>> Other cases can be handled by existing code that saved 1/2 of ram.
>>
>> To my mind, supporting only some hardware isn't an option.
>
>
>>
>>> 4. Arch specific code. Since it doesn't deal with interrupts nor memory
>>> managment, it won't be lot of code.
>>> Again standard swsusp can be used for arches that that module wasn't
>>> ported to.
>>
>> Perhaps I'm being a pessimist, but it sounds to me like this is going to
>> be a way bigger project than you're allowing for.
> I also thinks so. This is just an idea.
>
>
> To add a comment on your idea.
>
> I think is is possible to use page faults to see which memory regions
> changed. Actually its is very interesting idea.
>
> You just need to install your own page fault handler, and make sure it
> doesn't touch any memory.

If the memory it writes to isn't protected, there'll be no recursive
page fault and no problem, right? I'm imagining this page fault handler
will only set a flag to record that the page needs to be atomically
copied, copy the original contents to a page previously prepared for the
purpose, remove the write protection for the page and allow the write to
continue. That should be okay, right?

> Of course the sucky part will be how to edit the page tables.
> You might need to write your own code to do so to be sure.
> And this has to be arch specific.

Yeah. I wondered whether the code that's already used for creating page
tables for the atomic restore could be reused, at least in part.

> Since userspace is frozen, you can be sure that faults can only be
> caused by access to WO memory or kernel bugs.

Userspace helpers or uswsusp shouldn't be forgotten.

Regards,

Nigel

2010-06-05 01:16:16

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [SUSPECTED SPAM] Re: [linux-pm] Proposal for a new algorithm for reading & writing a hibernation image.


> If the memory it writes to isn't protected, there'll be no recursive
> page fault and no problem, right? I'm imagining this page fault handler
> will only set a flag to record that the page needs to be atomically
> copied, copy the original contents to a page previously prepared for the
> purpose, remove the write protection for the page and allow the write to
> continue. That should be okay, right?
I think so, although I have no experience yet to comment on such things.
Despite that I think you might run out of 'page previously prepared for
the purpose'
However you can adopt a retrial process, like you do today in tuxonce.
Just abort suspend, and do it again.

>
> > Of course the sucky part will be how to edit the page tables.
> > You might need to write your own code to do so to be sure.
> > And this has to be arch specific.
>
> Yeah. I wondered whether the code that's already used for creating page
> tables for the atomic restore could be reused, at least in part.
This is very dangerous.
The code might work now, and tomorrow somebody will add a code that does
memory writes.
The point is that you must be sure that there are no recursive faults,
or somehow deal with them (this is probably too dangerous to even think
of)


>
> > Since userspace is frozen, you can be sure that faults can only be
> > caused by access to WO memory or kernel bugs.
>
> Userspace helpers or uswsusp shouldn't be forgotten.
This is especially bad. because a fault in userspace will mean swapping.
You won't get away with custom page fault for this.
You could assure before suspend that all relevant userspace is not
swapped, or forget about userspace, because its minor thing compared to
speed increases of full memory write.


Best regards,
Maxim Levitsky

2010-06-05 03:17:49

by Nigel Cunningham

[permalink] [raw]
Subject: Re: [SUSPECTED SPAM] Re: [linux-pm] Proposal for a new algorithm for reading & writing a hibernation image.

Hi.

On 05/06/10 11:16, Maxim Levitsky wrote:
>
>> If the memory it writes to isn't protected, there'll be no recursive
>> page fault and no problem, right? I'm imagining this page fault handler
>> will only set a flag to record that the page needs to be atomically
>> copied, copy the original contents to a page previously prepared for the
>> purpose, remove the write protection for the page and allow the write to
>> continue. That should be okay, right?
> I think so, although I have no experience yet to comment on such things.
> Despite that I think you might run out of 'page previously prepared for
> the purpose'
> However you can adopt a retrial process, like you do today in tuxonce.
> Just abort suspend, and do it again.

Yeah. I'm expecting that it will be reasonably predictable - at least
ballpark. I guess there's only one way to find out...

>>> Of course the sucky part will be how to edit the page tables.
>>> You might need to write your own code to do so to be sure.
>>> And this has to be arch specific.
>>
>> Yeah. I wondered whether the code that's already used for creating page
>> tables for the atomic restore could be reused, at least in part.
> This is very dangerous.
> The code might work now, and tomorrow somebody will add a code that does
> memory writes.
> The point is that you must be sure that there are no recursive faults,
> or somehow deal with them (this is probably too dangerous to even think
> of)

That shouldn't be too hard - after all, we're going to know what memory
we're using to record the info. As long as we don't do anything silly
like protecting our own data, we should be right.

>>> Since userspace is frozen, you can be sure that faults can only be
>>> caused by access to WO memory or kernel bugs.
>>
>> Userspace helpers or uswsusp shouldn't be forgotten.
> This is especially bad. because a fault in userspace will mean swapping.
> You won't get away with custom page fault for this.
> You could assure before suspend that all relevant userspace is not
> swapped, or forget about userspace, because its minor thing compared to
> speed increases of full memory write.

Mmm, but existing userspace helpers for TuxOnIce are carefully designed
so everything is in memory before we start work, and so that nothing is
done which will compromise the integrity of the image. I assume the same
applies to uswsusp. I'm more concerned just to make sure that we don't
forget to modify the page tables for these userspace processes (at least
the TuxOnIce ones) so that any modifications to kernel memory made while
in their process context are also caught.

Regards,

Nigel

2010-06-05 03:37:42

by Nigel Cunningham

[permalink] [raw]
Subject: Re: [SUSPECTED SPAM] Re: [linux-pm] Proposal for a new algorithm for reading & writing a hibernation image.

Hi.

On 05/06/10 10:45, Maxim Levitsky wrote:
> On Sat, 2010-06-05 at 03:36 +0300, Maxim Levitsky wrote:
>> On Sat, 2010-06-05 at 09:58 +1000, Nigel Cunningham wrote:
>>> Hi Maxim.
>>>
>>> On 05/06/10 09:39, Maxim Levitsky wrote:
>>>> On Thu, 2010-06-03 at 16:50 +0200, Pavel Machek wrote:
>>>>>
>>>>> "Nigel Cunningham"<[email protected]> wrote:
>>>>>
>>>>>> Hi.
>>>>>>
>>>>>> On 30/05/10 15:25, Pavel Machek wrote:
>>>>>>> Hi!
>>>>>>>
>>>>>>>> 2. Prior to writing any of the image, also set up new 4k page tables
>>>>>>>> such that an attempt to make a change to any of the pages we're about to
>>>>>>>> write to disk will result in a page fault, giving us an opportunity to
>>>>>>>> flag the page as needing an atomic copy later. Once this is done, write
>>>>>>>> protection for the page can be disabled and the write that caused the
>>>>>>>> fault allowed to proceed.
>>>>>>>
>>>>>>> Tricky.
>>>>>>>
>>>>>>> page faulting code touches memory, too...
>>>>>>
>>>>>> Yeah. I realise we'd need to make the pages that are used to record the
>>>>>> faults be unprotected themselves. I'm imagining a bitmap for that.
>>>>>>
>>>>>> Do you see any reason that it could be inherently impossible? That's
>>>>>> what I really want to know before (potentially) wasting time trying it.
>>>>>
>>>>> I'm not sure it is impossible, but it certainly seems way too complex to be
>>>>> practical.
>>>>>
>>>>> 2mb pages will probably present a problem, as will bat mappings on powerpc.
>>>>
>>>>
>>>> Some time ago, after tuxonce caused medium fs corruption twice on my
>>>> root filesystem (superblock gone for example), I was thinking too about
>>>> how to make it safe to save whole memory.
>>>
>>> I'd be asking why you got the corruption. On the odd occasion where it
>>> has been reported, it's usually been because the person didn't set up
>>> their initramfs correctly (resumed after mounting filesystems). Is there
>>> any chance that you did that?
> I didn't use any initramfs.
> I did use kernel modesetting and nouveau.
> I used ext4.
> The corruption happened after normal suspend.

What's 'normal suspend'?

> I replaces swsusp with tuxonice.
>
> Anyway, some more or less verified method must be used to save memory
> because fs corruption is too scary thing to have.

Agreed.

> I can't say it scared me that much 'cause I had dealt with worse
> corruptions before, but being thrown to "grub rescue>" on boot is not
> pleasant thing to see.

Oh, I agree and don't want anyone to ever experience corruption because
of TuxOnIce. Unfortunately my wishes don't just happen :)

Nigel

2010-06-05 13:00:04

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [TuxOnIce-devel] [SUSPECTED SPAM] Re: [linux-pm] Proposal for a new algorithm for reading & writing a hibernation image.


On Jun 4, 2010, at 8:05 PM, Nigel Cunningham wrote:
>> 2mb pages will probably present a problem, as will bat mappings on powerpc.
>
> I have the idea that 2MB pages are only used for the kernel text and read only data. Is that right? If so, perhaps they can just be unconditionally copied (so that we can restore the image if a different kernel is booted) and wouldn't need any page protection. Does that sound right?

No, hugepages are available for use by userspace programs.

See: https://lwn.net/Articles/374424/

- Ted

2010-06-05 18:44:10

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [linux-pm] [SUSPECTED SPAM] Re: Proposal for a new algorithm for reading & writing a hibernation image.

On Saturday 05 June 2010, Nigel Cunningham wrote:
> Hi again.
>
> As I think about this more, I reckon we could run into problems at
> resume time with reloading the image. Even if some bits aren't modified
> as we're writing the image, they still might need to be atomically
> restored. If we make the atomic restore part too small, we might not be
> able to do that.
>
> So perhaps the best thing would be to stick with the way TuxOnIce splits
> the image at the moment (page cache / process pages vs 'rest'), but
> using this faulting mechanism to ensure we do get all the pages that are
> changed while writing the first part of the image.

I still don't quite understand why you insist on saving the page cache data
upfront and re-using the memory occupied by them for another purpose. If you
dropped that requirement, I'd really have much less of a problem with the
TuxOnIce's approach.

Rafael

2010-06-05 19:10:35

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [linux-pm] [SUSPECTED SPAM] Re: Proposal for a new algorithm for reading & writing a hibernation image.

On Sat, 2010-06-05 at 20:45 +0200, Rafael J. Wysocki wrote:
> On Saturday 05 June 2010, Nigel Cunningham wrote:
> > Hi again.
> >
> > As I think about this more, I reckon we could run into problems at
> > resume time with reloading the image. Even if some bits aren't modified
> > as we're writing the image, they still might need to be atomically
> > restored. If we make the atomic restore part too small, we might not be
> > able to do that.
> >
> > So perhaps the best thing would be to stick with the way TuxOnIce splits
> > the image at the moment (page cache / process pages vs 'rest'), but
> > using this faulting mechanism to ensure we do get all the pages that are
> > changed while writing the first part of the image.
>
> I still don't quite understand why you insist on saving the page cache data
> upfront and re-using the memory occupied by them for another purpose. If you
> dropped that requirement, I'd really have much less of a problem with the
> TuxOnIce's approach.
Because its the biggest advantage?
Really saving whole memory makes huge difference.


Best regards,
Maxim Levitsky

2010-06-05 19:20:34

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [linux-pm] [SUSPECTED SPAM] Re: Proposal for a new algorithm for reading & writing a hibernation image.

On Saturday 05 June 2010, Maxim Levitsky wrote:
> On Sat, 2010-06-05 at 20:45 +0200, Rafael J. Wysocki wrote:
> > On Saturday 05 June 2010, Nigel Cunningham wrote:
> > > Hi again.
> > >
> > > As I think about this more, I reckon we could run into problems at
> > > resume time with reloading the image. Even if some bits aren't modified
> > > as we're writing the image, they still might need to be atomically
> > > restored. If we make the atomic restore part too small, we might not be
> > > able to do that.
> > >
> > > So perhaps the best thing would be to stick with the way TuxOnIce splits
> > > the image at the moment (page cache / process pages vs 'rest'), but
> > > using this faulting mechanism to ensure we do get all the pages that are
> > > changed while writing the first part of the image.
> >
> > I still don't quite understand why you insist on saving the page cache data
> > upfront and re-using the memory occupied by them for another purpose. If you
> > dropped that requirement, I'd really have much less of a problem with the
> > TuxOnIce's approach.
> Because its the biggest advantage?

It isn't in fact.

> Really saving whole memory makes huge difference.

You don't have to save the _whole_ memory to get the same speed (you don't
do that anyway, but the amount of data you don't put into the image with
TuxOnIce is smaller). Something like 80% would be just sufficient IMO and
then (a) the level of complications involved would drop significantly and (2)
you'd be able to use the image-reading code already in the kernel without
any modifications. It really looks like a win-win to me, doesn't it?

Rafael

2010-06-05 22:55:03

by Nigel Cunningham

[permalink] [raw]
Subject: Re: [linux-pm] [SUSPECTED SPAM] Re: Proposal for a new algorithm for reading & writing a hibernation image.

Hi.

On 06/06/10 05:21, Rafael J. Wysocki wrote:
> On Saturday 05 June 2010, Maxim Levitsky wrote:
>> On Sat, 2010-06-05 at 20:45 +0200, Rafael J. Wysocki wrote:
>>> On Saturday 05 June 2010, Nigel Cunningham wrote:
>>>> Hi again.
>>>>
>>>> As I think about this more, I reckon we could run into problems at
>>>> resume time with reloading the image. Even if some bits aren't modified
>>>> as we're writing the image, they still might need to be atomically
>>>> restored. If we make the atomic restore part too small, we might not be
>>>> able to do that.
>>>>
>>>> So perhaps the best thing would be to stick with the way TuxOnIce splits
>>>> the image at the moment (page cache / process pages vs 'rest'), but
>>>> using this faulting mechanism to ensure we do get all the pages that are
>>>> changed while writing the first part of the image.
>>>
>>> I still don't quite understand why you insist on saving the page cache data
>>> upfront and re-using the memory occupied by them for another purpose. If you
>>> dropped that requirement, I'd really have much less of a problem with the
>>> TuxOnIce's approach.
>> Because its the biggest advantage?
>
> It isn't in fact.

Because saving a complete image of memory gives you a much more
responsive system, post-resume - especially if (as is likely) you're
going to keep doing the same work post-resume that you were doing
pre-hibernate. Saving a complete image means it's for all intents and
purposes just as if you'd never done the hibernation. Dropping page
cache, on the other hand, slows things down post-resume because it has
to be repopulated - and the repopulation takes longer than reading the
pages as part of the image because they're not compressed and there's
extra work required to get the pages back in.

>> Really saving whole memory makes huge difference.
>
> You don't have to save the _whole_ memory to get the same speed (you don't
> do that anyway, but the amount of data you don't put into the image with
> TuxOnIce is smaller). Something like 80% would be just sufficient IMO and
> then (a) the level of complications involved would drop significantly and (2)
> you'd be able to use the image-reading code already in the kernel without
> any modifications. It really looks like a win-win to me, doesn't it?

It is certainly true that you'll notice the effect less if you save 80%
of memory instead of 40%, but how much you'll be affected is also
heavily influenced by your amount of memory and how you're using it. If
you're swapping heavily or don't have much memory (embedded), freeing
memory might not be an option.

At the end of the day, I would argue that the user knows best, and this
should be a tuneable. This is, in fact the way TuxOnIce has done it for
years: the user can use a single sysfs entry to set a (soft) image size
limit in MB (values 1 and up), tell TuxOnIce to only free memory if
needed (0), abort if freeing memory is necessary (-1) or drop caches (-2).

I do agree that doing a single atomic copy and saving the result makes
for a simpler algorithm, but I've always been of the opinion that we're
writing code to satisfy real work needs and desires, not our own desires
for simpler or easier to understand algorithms. Doing the bare minimum
isn't an option for me. That's why I started trying to improve swsusp in
the first place, and why I kept working on it even through the
difficulties I've had with Pavel and times when I've really just wanted
to drop the whole thing.

Saving the image in two parts isn't inherently unreliable, Rafael. Even
the recent KMS changes haven't broken TuxOnIce - the kernel bugzilla
report turned out to be KMS breakage, not TuxOnIce (I didn't change
anything in TuxOnIce, and it started working again in 2.6.34). Yes, this
isn't a guarantee that something in the future won't break TuxOnIce, but
it does show (and especially when you remember that it's worked this way
without issue for something like 8 or 9 years) that the basic concept
isn't inherently flaws. The page faulting idea is, I think, the last
piece of the puzzle to make it perfectly reliable, regardless of what
changes are made in the future.

Regards,

Nigel

2010-06-05 23:01:48

by Nigel Cunningham

[permalink] [raw]
Subject: Re: [TuxOnIce-devel] [SUSPECTED SPAM] Re: [linux-pm] Proposal for a new algorithm for reading & writing a hibernation image.

Hi.

On 05/06/10 22:59, Theodore Tso wrote:
>
> On Jun 4, 2010, at 8:05 PM, Nigel Cunningham wrote:
>>> 2mb pages will probably present a problem, as will bat mappings on powerpc.
>>
>> I have the idea that 2MB pages are only used for the kernel text and read only data. Is that right? If so, perhaps they can just be unconditionally copied (so that we can restore the image if a different kernel is booted) and wouldn't need any page protection. Does that sound right?
>
> No, hugepages are available for use by userspace programs.
>
> See: https://lwn.net/Articles/374424/

Ta!

I'll take a look.

Nigel

2010-06-05 23:19:50

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [linux-pm] [SUSPECTED SPAM] Re: Proposal for a new algorithm for reading & writing a hibernation image.

On Sunday 06 June 2010, Nigel Cunningham wrote:
> Hi.
>
> On 06/06/10 05:21, Rafael J. Wysocki wrote:
> > On Saturday 05 June 2010, Maxim Levitsky wrote:
> >> On Sat, 2010-06-05 at 20:45 +0200, Rafael J. Wysocki wrote:
> >>> On Saturday 05 June 2010, Nigel Cunningham wrote:
> >>>> Hi again.
> >>>>
> >>>> As I think about this more, I reckon we could run into problems at
> >>>> resume time with reloading the image. Even if some bits aren't modified
> >>>> as we're writing the image, they still might need to be atomically
> >>>> restored. If we make the atomic restore part too small, we might not be
> >>>> able to do that.
> >>>>
> >>>> So perhaps the best thing would be to stick with the way TuxOnIce splits
> >>>> the image at the moment (page cache / process pages vs 'rest'), but
> >>>> using this faulting mechanism to ensure we do get all the pages that are
> >>>> changed while writing the first part of the image.
> >>>
> >>> I still don't quite understand why you insist on saving the page cache data
> >>> upfront and re-using the memory occupied by them for another purpose. If you
> >>> dropped that requirement, I'd really have much less of a problem with the
> >>> TuxOnIce's approach.
> >> Because its the biggest advantage?
> >
> > It isn't in fact.
>
> Because saving a complete image of memory gives you a much more
> responsive system, post-resume - especially if (as is likely) you're
> going to keep doing the same work post-resume that you were doing
> pre-hibernate.

We've given that argument for (at least) 100 times already and I still claim
that the user won't see a difference between putting 80% and 95% of RAM
contents into the image (you don't save 100%, at least not every time).

> Saving a complete image means it's for all intents and
> purposes just as if you'd never done the hibernation. Dropping page
> cache, on the other hand, slows things down post-resume because it has
> to be repopulated - and the repopulation takes longer than reading the
> pages as part of the image because they're not compressed and there's
> extra work required to get the pages back in.

I'm not talking about dropping the page cache, but about keeping it in place
and saving as a part of the image - later. The part I think is too complicated
is the re-using of that memory for creating the "atomic" image. That in my
opinion really goes too far and causes things to be excessively fragile -
without a really good reason (it is like "we do that because we can" IMO).

> >> Really saving whole memory makes huge difference.
> >
> > You don't have to save the _whole_ memory to get the same speed (you don't
> > do that anyway, but the amount of data you don't put into the image with
> > TuxOnIce is smaller). Something like 80% would be just sufficient IMO and
> > then (a) the level of complications involved would drop significantly and (2)
> > you'd be able to use the image-reading code already in the kernel without
> > any modifications. It really looks like a win-win to me, doesn't it?
>
> It is certainly true that you'll notice the effect less if you save 80%
> of memory instead of 40%, but how much you'll be affected is also
> heavily influenced by your amount of memory and how you're using it. If
> you're swapping heavily or don't have much memory (embedded), freeing
> memory might not be an option.

I don't think you have any practical example of anything like this, do you?

> At the end of the day, I would argue that the user knows best, and this
> should be a tuneable. This is, in fact the way TuxOnIce has done it for
> years: the user can use a single sysfs entry to set a (soft) image size
> limit in MB (values 1 and up), tell TuxOnIce to only free memory if
> needed (0), abort if freeing memory is necessary (-1) or drop caches (-2).
>
> I do agree that doing a single atomic copy and saving the result makes
> for a simpler algorithm, but I've always been of the opinion that we're
> writing code to satisfy real work needs and desires, not our own desires
> for simpler or easier to understand algorithms. Doing the bare minimum
> isn't an option for me.

I'm not talking about that!

In short, if your observation that the page cache doesn't really change during
hibernation is correct, then it should be possible to avoid making an atomic
copy of it and to save it directly from its original locations. I think that
would allow us to save about 80% of memory in the majority of cases without
the entire complexity that makes things extremely fragile and depends haevily
on the current (undocumented) behavior of our mm subsystem that _happens_
to be favourable to TuxOnIce. HTH

Rafael

2010-06-06 00:40:36

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [linux-pm] [SUSPECTED SPAM] Re: Proposal for a new algorithm for reading & writing a hibernation image.

On Sat, 2010-06-05 at 21:21 +0200, Rafael J. Wysocki wrote:
> On Saturday 05 June 2010, Maxim Levitsky wrote:
> > On Sat, 2010-06-05 at 20:45 +0200, Rafael J. Wysocki wrote:
> > > On Saturday 05 June 2010, Nigel Cunningham wrote:
> > > > Hi again.
> > > >
> > > > As I think about this more, I reckon we could run into problems at
> > > > resume time with reloading the image. Even if some bits aren't modified
> > > > as we're writing the image, they still might need to be atomically
> > > > restored. If we make the atomic restore part too small, we might not be
> > > > able to do that.
> > > >
> > > > So perhaps the best thing would be to stick with the way TuxOnIce splits
> > > > the image at the moment (page cache / process pages vs 'rest'), but
> > > > using this faulting mechanism to ensure we do get all the pages that are
> > > > changed while writing the first part of the image.
> > >
> > > I still don't quite understand why you insist on saving the page cache data
> > > upfront and re-using the memory occupied by them for another purpose. If you
> > > dropped that requirement, I'd really have much less of a problem with the
> > > TuxOnIce's approach.
> > Because its the biggest advantage?
>
> It isn't in fact.
>
> > Really saving whole memory makes huge difference.
>
> You don't have to save the _whole_ memory to get the same speed (you don't
> do that anyway, but the amount of data you don't put into the image with
> TuxOnIce is smaller). Something like 80% would be just sufficient IMO and
> then (a) the level of complications involved would drop significantly and (2)
> you'd be able to use the image-reading code already in the kernel without
> any modifications. It really looks like a win-win to me, doesn't it?


Well, in fact on modern systems its not possible to save 100% of ram
even if we save it all because of video memory.
Look I got 256MB of video ram, and when compiz is used I say most of it
is used, and its isn't going to be magically preserved during suspend.
So system still has to free about 256MB of memory before suspend (which
means around 80% percent of ram is saved in best case :-) )

Best regards,
Maxim Levitsky

2010-06-06 07:01:53

by Nigel Cunningham

[permalink] [raw]
Subject: Re: [linux-pm] [SUSPECTED SPAM] Re: Proposal for a new algorithm for reading & writing a hibernation image.

Hi Rafael.

On 06/06/10 09:20, Rafael J. Wysocki wrote:
> On Sunday 06 June 2010, Nigel Cunningham wrote:
>> On 06/06/10 05:21, Rafael J. Wysocki wrote:
>>> On Saturday 05 June 2010, Maxim Levitsky wrote:
>>>> On Sat, 2010-06-05 at 20:45 +0200, Rafael J. Wysocki wrote:
>>>>> On Saturday 05 June 2010, Nigel Cunningham wrote:
>>>>>> Hi again.
>>>>>>
>>>>>> As I think about this more, I reckon we could run into problems at
>>>>>> resume time with reloading the image. Even if some bits aren't modified
>>>>>> as we're writing the image, they still might need to be atomically
>>>>>> restored. If we make the atomic restore part too small, we might not be
>>>>>> able to do that.
>>>>>>
>>>>>> So perhaps the best thing would be to stick with the way TuxOnIce splits
>>>>>> the image at the moment (page cache / process pages vs 'rest'), but
>>>>>> using this faulting mechanism to ensure we do get all the pages that are
>>>>>> changed while writing the first part of the image.
>>>>>
>>>>> I still don't quite understand why you insist on saving the page cache data
>>>>> upfront and re-using the memory occupied by them for another purpose. If you
>>>>> dropped that requirement, I'd really have much less of a problem with the
>>>>> TuxOnIce's approach.
>>>> Because its the biggest advantage?
>>>
>>> It isn't in fact.
>>
>> Because saving a complete image of memory gives you a much more
>> responsive system, post-resume - especially if (as is likely) you're
>> going to keep doing the same work post-resume that you were doing
>> pre-hibernate.
>
> We've given that argument for (at least) 100 times already and I still claim
> that the user won't see a difference between putting 80% and 95% of RAM
> contents into the image (you don't save 100%, at least not every time).

On 64 bit operating systems, saving 100% of the image - even with full
ram - is the entirely possible and in my experience the norm. For the
last month or so, I've been running a 32 bit OS again on my 64 bit
laptop, and have been seeing it free memory more often because of the
constraints highmem includes (I haven't gotten around to trying those
changes you made which might help in this regard).

Whether running 32 bit or 64, the part of the image that's saved prior
to the atomic copy usually accounts for around (going off progress bars)
80-95% of the image. This is why - for 64 bit at least - it's rare to
have to free memory. The atomically copied part easily fits in the
memory that's already been saved.

So the main reasons for not saving 100% of the image would be:

1) The user said they don't want 100% saved (image size limit sysfs entry)
2) Insufficient storage (user choice)
3) 32 bit OS with highmem constraints (which I'll hopefully deal with soon).

>> Saving a complete image means it's for all intents and
>> purposes just as if you'd never done the hibernation. Dropping page
>> cache, on the other hand, slows things down post-resume because it has
>> to be repopulated - and the repopulation takes longer than reading the
>> pages as part of the image because they're not compressed and there's
>> extra work required to get the pages back in.
>
> I'm not talking about dropping the page cache, but about keeping it in place
> and saving as a part of the image - later. The part I think is too complicated
> is the re-using of that memory for creating the "atomic" image. That in my
> opinion really goes too far and causes things to be excessively fragile -
> without a really good reason (it is like "we do that because we can" IMO).

First, it's not fragile. All it depends on is the freezer being
effective, just as the other parts of hibernation depend on the freezer
being effective. Checksumming has been used to confirm that the contents
of memory haven't changed prior to this page fault idea. I can think of
examples where pages have been found to have changed, but they're few
and far between, and easily addressed by resaving the affected pages in
the atomic copy.

Second, it's not done without reason or simply because we can. It's done
because it's been proven to make it more likely for us to be able to
hibernate successfully in the first place AND gives us a more responsive
system post-resume.

We haven't mentioned the first part so far, so let me go into more
detail there. The problem with not doing things the TuxOnIce way is that
you when you have more than (say) 80% of memory in use, you MUST free
memory. Depending upon your workload, that simply might not be possible.
In other cases, the only way to free memory might be to swap it out, but
you're then reducing the amount of storage available for the image,
which means you have to free more memory again, which means... For
maximum reliability, you need an algorithm wherein you can save the
contents of memory as they are at the start of the cycle.

>>>> Really saving whole memory makes huge difference.
>>>
>>> You don't have to save the _whole_ memory to get the same speed (you don't
>>> do that anyway, but the amount of data you don't put into the image with
>>> TuxOnIce is smaller). Something like 80% would be just sufficient IMO and
>>> then (a) the level of complications involved would drop significantly and (2)
>>> you'd be able to use the image-reading code already in the kernel without
>>> any modifications. It really looks like a win-win to me, doesn't it?
>>
>> It is certainly true that you'll notice the effect less if you save 80%
>> of memory instead of 40%, but how much you'll be affected is also
>> heavily influenced by your amount of memory and how you're using it. If
>> you're swapping heavily or don't have much memory (embedded), freeing
>> memory might not be an option.
>
> I don't think you have any practical example of anything like this, do you?

I don't have one right now that I can copy and paste, but it wouldn't be
hard at all to show the effect of eating more or less memory by running
a range of image size limits with timed kernel compiles afterwards. To
prove the second part of my statement, I'd have to boot with mem=. I
certainly don't see it with 4GB of memory, but I'm writing out of
recollections from when I worked hard on reliability while still having
a laptop with 1GB of RAM and users who often had less. Hmmm... I wonder
if I can find archived email list discussions from that period. If you
insist, I'll go looking :)

>> At the end of the day, I would argue that the user knows best, and this
>> should be a tuneable. This is, in fact the way TuxOnIce has done it for
>> years: the user can use a single sysfs entry to set a (soft) image size
>> limit in MB (values 1 and up), tell TuxOnIce to only free memory if
>> needed (0), abort if freeing memory is necessary (-1) or drop caches (-2).
>>
>> I do agree that doing a single atomic copy and saving the result makes
>> for a simpler algorithm, but I've always been of the opinion that we're
>> writing code to satisfy real work needs and desires, not our own desires
>> for simpler or easier to understand algorithms. Doing the bare minimum
>> isn't an option for me.
>
> I'm not talking about that!
>
> In short, if your observation that the page cache doesn't really change during
> hibernation is correct, then it should be possible to avoid making an atomic
> copy of it and to save it directly from its original locations. I think that
> would allow us to save about 80% of memory in the majority of cases without
> the entire complexity that makes things extremely fragile and depends haevily
> on the current (undocumented) behavior of our mm subsystem that _happens_
> to be favourable to TuxOnIce. HTH

I'm not sure what this current undocumented behaviour is. All I'm
relying on is the freezer working and the mm subsystem not deciding to
free process pages or LRU for no good reason. Remember that kswapd is
also frozen.

Regards,

Nigel

2010-06-06 13:55:51

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [linux-pm] [SUSPECTED SPAM] Re: Proposal for a new algorithm for reading & writing a hibernation image.

On Sunday 06 June 2010, Maxim Levitsky wrote:
> On Sat, 2010-06-05 at 21:21 +0200, Rafael J. Wysocki wrote:
> > On Saturday 05 June 2010, Maxim Levitsky wrote:
> > > On Sat, 2010-06-05 at 20:45 +0200, Rafael J. Wysocki wrote:
> > > > On Saturday 05 June 2010, Nigel Cunningham wrote:
> > > > > Hi again.
> > > > >
> > > > > As I think about this more, I reckon we could run into problems at
> > > > > resume time with reloading the image. Even if some bits aren't modified
> > > > > as we're writing the image, they still might need to be atomically
> > > > > restored. If we make the atomic restore part too small, we might not be
> > > > > able to do that.
> > > > >
> > > > > So perhaps the best thing would be to stick with the way TuxOnIce splits
> > > > > the image at the moment (page cache / process pages vs 'rest'), but
> > > > > using this faulting mechanism to ensure we do get all the pages that are
> > > > > changed while writing the first part of the image.
> > > >
> > > > I still don't quite understand why you insist on saving the page cache data
> > > > upfront and re-using the memory occupied by them for another purpose. If you
> > > > dropped that requirement, I'd really have much less of a problem with the
> > > > TuxOnIce's approach.
> > > Because its the biggest advantage?
> >
> > It isn't in fact.
> >
> > > Really saving whole memory makes huge difference.
> >
> > You don't have to save the _whole_ memory to get the same speed (you don't
> > do that anyway, but the amount of data you don't put into the image with
> > TuxOnIce is smaller). Something like 80% would be just sufficient IMO and
> > then (a) the level of complications involved would drop significantly and (2)
> > you'd be able to use the image-reading code already in the kernel without
> > any modifications. It really looks like a win-win to me, doesn't it?
>
>
> Well, in fact on modern systems its not possible to save 100% of ram
> even if we save it all because of video memory.
> Look I got 256MB of video ram, and when compiz is used I say most of it
> is used, and its isn't going to be magically preserved during suspend.
> So system still has to free about 256MB of memory before suspend (which
> means around 80% percent of ram is saved in best case :-) )

So how TuxOnIce helps here?

Rafael

2010-06-06 14:04:50

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [linux-pm] [SUSPECTED SPAM] Re: Proposal for a new algorithm for reading & writing a hibernation image.

On Sunday 06 June 2010, Nigel Cunningham wrote:
> Hi Rafael.

Hi,

> On 06/06/10 09:20, Rafael J. Wysocki wrote:
> > On Sunday 06 June 2010, Nigel Cunningham wrote:
...
> > I'm not talking about dropping the page cache, but about keeping it in place
> > and saving as a part of the image - later. The part I think is too complicated
> > is the re-using of that memory for creating the "atomic" image. That in my
> > opinion really goes too far and causes things to be excessively fragile -
> > without a really good reason (it is like "we do that because we can" IMO).
>
> First, it's not fragile.

Well, I obviously don't agree and I'm not convinced by the arguments below.

> All it depends on is the freezer being
> effective, just as the other parts of hibernation depend on the freezer
> being effective. Checksumming has been used to confirm that the contents
> of memory haven't changed prior to this page fault idea. I can think of
> examples where pages have been found to have changed, but they're few
> and far between, and easily addressed by resaving the affected pages in
> the atomic copy.
>
> Second, it's not done without reason or simply because we can. It's done
> because it's been proven to make it more likely for us to be able to
> hibernate successfully in the first place AND gives us a more responsive
> system post-resume.
>
> We haven't mentioned the first part so far, so let me go into more
> detail there. The problem with not doing things the TuxOnIce way is that
> you when you have more than (say) 80% of memory in use, you MUST free
> memory. Depending upon your workload, that simply might not be possible.
> In other cases, the only way to free memory might be to swap it out, but
> you're then reducing the amount of storage available for the image,
> which means you have to free more memory again, which means... For
> maximum reliability, you need an algorithm wherein you can save the
> contents of memory as they are at the start of the cycle.
>
...
> >> I do agree that doing a single atomic copy and saving the result makes
> >> for a simpler algorithm, but I've always been of the opinion that we're
> >> writing code to satisfy real work needs and desires, not our own desires
> >> for simpler or easier to understand algorithms. Doing the bare minimum
> >> isn't an option for me.
> >
> > I'm not talking about that!
> >
> > In short, if your observation that the page cache doesn't really change during
> > hibernation is correct, then it should be possible to avoid making an atomic
> > copy of it and to save it directly from its original locations. I think that
> > would allow us to save about 80% of memory in the majority of cases without
> > the entire complexity that makes things extremely fragile and depends haevily
> > on the current (undocumented) behavior of our mm subsystem that _happens_
> > to be favourable to TuxOnIce. HTH
>
> I'm not sure what this current undocumented behaviour is.

Easy. The behavior that allows you to use memory used for the page cache
hibernation without the risk of it being overwritten in the process. This is
not documented anywhere and I don't think it'll ever be.

> All I'm relying on is the freezer working and the mm subsystem not deciding to
> free process pages or LRU for no good reason.

It's more than just freeing them. In fact you need a guarantee that their
contents won't be modified over the entire hibernation in a way that you don't
control. There's no such guarantee at the moment I know of, so you have to
assume that that won't happen, which is _exactly_ relying on undocumented
behavior that's not guaranteed to change in future.

> Remember that kswapd is also frozen.

But some day it may turn out that it would be better not to freeze it for some
reason. If we go the TuxOnIce route, that won't ever be possible I think.

Rafael

2010-06-06 15:54:17

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [linux-pm] [SUSPECTED SPAM] Re: Proposal for a new algorithm for reading & writing a hibernation image.

On Sun, 2010-06-06 at 15:57 +0200, Rafael J. Wysocki wrote:
> On Sunday 06 June 2010, Maxim Levitsky wrote:
> > On Sat, 2010-06-05 at 21:21 +0200, Rafael J. Wysocki wrote:
> > > On Saturday 05 June 2010, Maxim Levitsky wrote:
> > > > On Sat, 2010-06-05 at 20:45 +0200, Rafael J. Wysocki wrote:
> > > > > On Saturday 05 June 2010, Nigel Cunningham wrote:
> > > > > > Hi again.
> > > > > >
> > > > > > As I think about this more, I reckon we could run into problems at
> > > > > > resume time with reloading the image. Even if some bits aren't modified
> > > > > > as we're writing the image, they still might need to be atomically
> > > > > > restored. If we make the atomic restore part too small, we might not be
> > > > > > able to do that.
> > > > > >
> > > > > > So perhaps the best thing would be to stick with the way TuxOnIce splits
> > > > > > the image at the moment (page cache / process pages vs 'rest'), but
> > > > > > using this faulting mechanism to ensure we do get all the pages that are
> > > > > > changed while writing the first part of the image.
> > > > >
> > > > > I still don't quite understand why you insist on saving the page cache data
> > > > > upfront and re-using the memory occupied by them for another purpose. If you
> > > > > dropped that requirement, I'd really have much less of a problem with the
> > > > > TuxOnIce's approach.
> > > > Because its the biggest advantage?
> > >
> > > It isn't in fact.
> > >
> > > > Really saving whole memory makes huge difference.
> > >
> > > You don't have to save the _whole_ memory to get the same speed (you don't
> > > do that anyway, but the amount of data you don't put into the image with
> > > TuxOnIce is smaller). Something like 80% would be just sufficient IMO and
> > > then (a) the level of complications involved would drop significantly and (2)
> > > you'd be able to use the image-reading code already in the kernel without
> > > any modifications. It really looks like a win-win to me, doesn't it?
> >
> >
> > Well, in fact on modern systems its not possible to save 100% of ram
> > even if we save it all because of video memory.
> > Look I got 256MB of video ram, and when compiz is used I say most of it
> > is used, and its isn't going to be magically preserved during suspend.
> > So system still has to free about 256MB of memory before suspend (which
> > means around 80% percent of ram is saved in best case :-) )
>
> So how TuxOnIce helps here?
Very simple.

With swsusp, I can save 750MB (memory) + 250 Vram (vram)
With full memory save I can save (1750 MB of memory) + 250 MB of
vram....

Of course save of vram sure can be made non atomic....


Best regards,
Maxim Levitsky

2010-06-06 19:03:26

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [linux-pm] [SUSPECTED SPAM] Re: Proposal for a new algorithm for reading & writing a hibernation image.

On Sunday 06 June 2010, Maxim Levitsky wrote:
> On Sun, 2010-06-06 at 15:57 +0200, Rafael J. Wysocki wrote:
> > On Sunday 06 June 2010, Maxim Levitsky wrote:
...
> > So how TuxOnIce helps here?
> Very simple.
>
> With swsusp, I can save 750MB (memory) + 250 Vram (vram)
> With full memory save I can save (1750 MB of memory) + 250 MB of
> vram....

So what about being able to save 1600 MB total instead of the 2 GB
(which is what we're talking about in case that's not clear)? Would it
be _that_ _much_ worse?

Rafael

2010-06-06 19:51:36

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [linux-pm] [SUSPECTED SPAM] Re: Proposal for a new algorithm for reading & writing a hibernation image.

On Sun, 2010-06-06 at 21:04 +0200, Rafael J. Wysocki wrote:
> On Sunday 06 June 2010, Maxim Levitsky wrote:
> > On Sun, 2010-06-06 at 15:57 +0200, Rafael J. Wysocki wrote:
> > > On Sunday 06 June 2010, Maxim Levitsky wrote:
> ...
> > > So how TuxOnIce helps here?
> > Very simple.
> >
> > With swsusp, I can save 750MB (memory) + 250 Vram (vram)
> > With full memory save I can save (1750 MB of memory) + 250 MB of
> > vram....
>
> So what about being able to save 1600 MB total instead of the 2 GB
> (which is what we're talking about in case that's not clear)? Would it
> be _that_ _much_ worse?

That I agree with you.

Best regards,
Maxim Levitsky

2010-06-06 21:55:35

by Pedro Ribeiro

[permalink] [raw]
Subject: Re: [linux-pm] [SUSPECTED SPAM] Re: Proposal for a new algorithm for reading & writing a hibernation image.

On 6 June 2010 20:04, Rafael J. Wysocki <[email protected]> wrote:
> On Sunday 06 June 2010, Maxim Levitsky wrote:
>> On Sun, 2010-06-06 at 15:57 +0200, Rafael J. Wysocki wrote:
>> > On Sunday 06 June 2010, Maxim Levitsky wrote:
> ...
>> > So how TuxOnIce helps here?
>> Very simple.
>>
>> With swsusp, I can save 750MB (memory) + 250 Vram (vram)
>> With full memory save I can save (1750 MB of memory) + 250 MB of
>> vram....
>

I am completely unaware of the technical difficulties of saving the
whole memory vs 80% of it, but from my experience with TuxOnIce I
fully agree with Nigel on saving the whole memory.

The fact is that the whole system is much more responsive than when
using swsusp. No doubt about that, I can tell you that TuxOnIce really
changed the way I use my computer. I only used swsusp when I really
needed because of the lagginess starting up, and TuxOnIce changed it.

I have a laptop computer which I have to shutdown every night and open
every morning. It has 4GB of ram and I usually have 3.5 to 3.8 in use
all the time. With TuxOnIce I can restart my work every morning in
under 25 seconds, exactly where I left it, without any delays or
lagginess.

Its kind of hard to express in words, but really, this gave me a
completely different view of how to use a computer. Of course you can
compare it to S2R, but this consumes energy, no matter how little -
this has an environmental and financial cost which will only increase
in the future.

> So what about being able to save 1600 MB total instead of the 2 GB
> (which is what we're talking about in case that's not clear)? ?Would it
> be _that_ _much_ worse?

No it wouldn't be much worse. But there will still be some lagginess,
some delay, some sort of annoying disk activity compared to NO delay,
NO lagginess, in short, you have your computer _exactly_ the way you
left it when you left it when you hibernated. And the difference is
noticeable.

>
> Rafael
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at ?http://www.tux.org/lkml/
>


Sorry to jump in the thread, but I just wanted to give my end user perspective.

Regards,
Pedro

2010-06-07 05:23:35

by Nigel Cunningham

[permalink] [raw]
Subject: Re: [linux-pm] [SUSPECTED SPAM] Re: Proposal for a new algorithm for reading & writing a hibernation image.

Hi.

On 07/06/10 00:06, Rafael J. Wysocki wrote:
> On Sunday 06 June 2010, Nigel Cunningham wrote:
>> On 06/06/10 09:20, Rafael J. Wysocki wrote:
>>> On Sunday 06 June 2010, Nigel Cunningham wrote:
> ...
>>> I'm not talking about dropping the page cache, but about keeping it in place
>>> and saving as a part of the image - later. The part I think is too complicated
>>> is the re-using of that memory for creating the "atomic" image. That in my
>>> opinion really goes too far and causes things to be excessively fragile -
>>> without a really good reason (it is like "we do that because we can" IMO).
>>
>> First, it's not fragile.
>
> Well, I obviously don't agree and I'm not convinced by the arguments below.

Okay. I'm going to assume you're not being unreasonable and ask: "What
do you find unconvincing in the arguments below? That is, what can I do
to help build a better case for you?"

>> All it depends on is the freezer being
>> effective, just as the other parts of hibernation depend on the freezer
>> being effective. Checksumming has been used to confirm that the contents
>> of memory haven't changed prior to this page fault idea. I can think of
>> examples where pages have been found to have changed, but they're few
>> and far between, and easily addressed by resaving the affected pages in
>> the atomic copy.
>>
>> Second, it's not done without reason or simply because we can. It's done
>> because it's been proven to make it more likely for us to be able to
>> hibernate successfully in the first place AND gives us a more responsive
>> system post-resume.
>>
>> We haven't mentioned the first part so far, so let me go into more
>> detail there. The problem with not doing things the TuxOnIce way is that
>> you when you have more than (say) 80% of memory in use, you MUST free
>> memory. Depending upon your workload, that simply might not be possible.
>> In other cases, the only way to free memory might be to swap it out, but
>> you're then reducing the amount of storage available for the image,
>> which means you have to free more memory again, which means... For
>> maximum reliability, you need an algorithm wherein you can save the
>> contents of memory as they are at the start of the cycle.
>>
> ...
>>>> I do agree that doing a single atomic copy and saving the result makes
>>>> for a simpler algorithm, but I've always been of the opinion that we're
>>>> writing code to satisfy real work needs and desires, not our own desires
>>>> for simpler or easier to understand algorithms. Doing the bare minimum
>>>> isn't an option for me.
>>>
>>> I'm not talking about that!
>>>
>>> In short, if your observation that the page cache doesn't really change during
>>> hibernation is correct, then it should be possible to avoid making an atomic
>>> copy of it and to save it directly from its original locations. I think that
>>> would allow us to save about 80% of memory in the majority of cases without
>>> the entire complexity that makes things extremely fragile and depends haevily
>>> on the current (undocumented) behavior of our mm subsystem that _happens_
>>> to be favourable to TuxOnIce. HTH
>>
>> I'm not sure what this current undocumented behaviour is.
>
> Easy. The behavior that allows you to use memory used for the page cache
> hibernation without the risk of it being overwritten in the process. This is
> not documented anywhere and I don't think it'll ever be.
>
>> All I'm relying on is the freezer working and the mm subsystem not deciding to
>> free process pages or LRU for no good reason.
>
> It's more than just freeing them. In fact you need a guarantee that their
> contents won't be modified over the entire hibernation in a way that you don't
> control. There's no such guarantee at the moment I know of, so you have to
> assume that that won't happen, which is _exactly_ relying on undocumented
> behavior that's not guaranteed to change in future.

I think it's rather unfair to talk about undocumented and unguaranteed
behaviour when you know I'm relying on the freezer, which is documented
and guaranteed to work.

I'm willing to modify things so we use this page-fault idea to make the
guarantee even more certain - would that satisfy you?

>> Remember that kswapd is also frozen.
>
> But some day it may turn out that it would be better not to freeze it for some
> reason. If we go the TuxOnIce route, that won't ever be possible I think.

It may also turn out some day that it's better not to freeze any
processes at all.

But seriously, what could possibly lead us to that decision? The only
reason we'd want to not freeze kswapd would be if we wanted it to be
able to free memory while we're hibernating. What demand would there be
for such memory apart from our own routines for writing the image? What
impetus would it have to do any freeing? After the atomic copy, any
other work is pointless - it's going to be thrown away when we power off.

Regards,

Nigel

2010-06-07 05:28:44

by Nigel Cunningham

[permalink] [raw]
Subject: Re: [linux-pm] [SUSPECTED SPAM] Re: Proposal for a new algorithm for reading & writing a hibernation image.

Hi.

On 06/06/10 23:57, Rafael J. Wysocki wrote:
> On Sunday 06 June 2010, Maxim Levitsky wrote:
>> On Sat, 2010-06-05 at 21:21 +0200, Rafael J. Wysocki wrote:
>>> On Saturday 05 June 2010, Maxim Levitsky wrote:
>>>> On Sat, 2010-06-05 at 20:45 +0200, Rafael J. Wysocki wrote:
>>>>> On Saturday 05 June 2010, Nigel Cunningham wrote:
>>>>>> Hi again.
>>>>>>
>>>>>> As I think about this more, I reckon we could run into problems at
>>>>>> resume time with reloading the image. Even if some bits aren't modified
>>>>>> as we're writing the image, they still might need to be atomically
>>>>>> restored. If we make the atomic restore part too small, we might not be
>>>>>> able to do that.
>>>>>>
>>>>>> So perhaps the best thing would be to stick with the way TuxOnIce splits
>>>>>> the image at the moment (page cache / process pages vs 'rest'), but
>>>>>> using this faulting mechanism to ensure we do get all the pages that are
>>>>>> changed while writing the first part of the image.
>>>>>
>>>>> I still don't quite understand why you insist on saving the page cache data
>>>>> upfront and re-using the memory occupied by them for another purpose. If you
>>>>> dropped that requirement, I'd really have much less of a problem with the
>>>>> TuxOnIce's approach.
>>>> Because its the biggest advantage?
>>>
>>> It isn't in fact.
>>>
>>>> Really saving whole memory makes huge difference.
>>>
>>> You don't have to save the _whole_ memory to get the same speed (you don't
>>> do that anyway, but the amount of data you don't put into the image with
>>> TuxOnIce is smaller). Something like 80% would be just sufficient IMO and
>>> then (a) the level of complications involved would drop significantly and (2)
>>> you'd be able to use the image-reading code already in the kernel without
>>> any modifications. It really looks like a win-win to me, doesn't it?
>>
>>
>> Well, in fact on modern systems its not possible to save 100% of ram
>> even if we save it all because of video memory.
>> Look I got 256MB of video ram, and when compiz is used I say most of it
>> is used, and its isn't going to be magically preserved during suspend.
>> So system still has to free about 256MB of memory before suspend (which
>> means around 80% percent of ram is saved in best case :-) )
>
> So how TuxOnIce helps here?

The 256MB of video ram is irrelevant, unless it's 'stolen', in which
case it will be saved.

2010-06-07 05:31:46

by Nigel Cunningham

[permalink] [raw]
Subject: Re: [linux-pm] [SUSPECTED SPAM] Re: Proposal for a new algorithm for reading & writing a hibernation image.

Hi.

On 07/06/10 05:04, Rafael J. Wysocki wrote:
> On Sunday 06 June 2010, Maxim Levitsky wrote:
>> On Sun, 2010-06-06 at 15:57 +0200, Rafael J. Wysocki wrote:
>>> On Sunday 06 June 2010, Maxim Levitsky wrote:
> ...
>>> So how TuxOnIce helps here?
>> Very simple.
>>
>> With swsusp, I can save 750MB (memory) + 250 Vram (vram)
>> With full memory save I can save (1750 MB of memory) + 250 MB of
>> vram....
>
> So what about being able to save 1600 MB total instead of the 2 GB
> (which is what we're talking about in case that's not clear)? Would it
> be _that_ _much_ worse?

That all depends on what is in the 400MB you discard.

The difference is "Just as if you'd never hibernated" vs something
closer to "Just as if you'd only just started up". We can't make
categorical statements because it really does depend upon what you
discard and what you want to do post-resume - that is, how useful the
memory you discard would have been. That's always going to vary from
case to case.

Regards,

Nigel

2010-06-07 08:39:16

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [linux-pm] [SUSPECTED SPAM] Re: Proposal for a new algorithm for reading & writing a hibernation image.

On Monday 07 June 2010, Nigel Cunningham wrote:
> Hi.
>
> On 07/06/10 00:06, Rafael J. Wysocki wrote:
> > On Sunday 06 June 2010, Nigel Cunningham wrote:
> >> On 06/06/10 09:20, Rafael J. Wysocki wrote:
> >>> On Sunday 06 June 2010, Nigel Cunningham wrote:
> > ...
> >>> I'm not talking about dropping the page cache, but about keeping it in place
> >>> and saving as a part of the image - later. The part I think is too complicated
> >>> is the re-using of that memory for creating the "atomic" image. That in my
> >>> opinion really goes too far and causes things to be excessively fragile -
> >>> without a really good reason (it is like "we do that because we can" IMO).
> >>
> >> First, it's not fragile.
> >
> > Well, I obviously don't agree and I'm not convinced by the arguments below.
>
> Okay. I'm going to assume you're not being unreasonable and ask: "What
> do you find unconvincing in the arguments below? That is, what can I do
> to help build a better case for you?"

First, the freezer really doesn't guarantee that the things will work the way
you like it, for the simple reason that not all processes are frozen. The
second paragraph below is simply wrong (it's not been _proven_, at least
not with respect to the case when we save 80% of RAM I'm talking about
and I don't believe the user will see a difference between systems where
80% and 90% or more RAM is saved) and the third paragraph is just
hand waving.

> >> All it depends on is the freezer being
> >> effective, just as the other parts of hibernation depend on the freezer
> >> being effective. Checksumming has been used to confirm that the contents
> >> of memory haven't changed prior to this page fault idea. I can think of
> >> examples where pages have been found to have changed, but they're few
> >> and far between, and easily addressed by resaving the affected pages in
> >> the atomic copy.
> >>
> >> Second, it's not done without reason or simply because we can. It's done
> >> because it's been proven to make it more likely for us to be able to
> >> hibernate successfully in the first place AND gives us a more responsive
> >> system post-resume.
> >>
> >> We haven't mentioned the first part so far, so let me go into more
> >> detail there. The problem with not doing things the TuxOnIce way is that
> >> you when you have more than (say) 80% of memory in use, you MUST free
> >> memory. Depending upon your workload, that simply might not be possible.
> >> In other cases, the only way to free memory might be to swap it out, but
> >> you're then reducing the amount of storage available for the image,
> >> which means you have to free more memory again, which means... For
> >> maximum reliability, you need an algorithm wherein you can save the
> >> contents of memory as they are at the start of the cycle.
> >>
> > ...
> >>>> I do agree that doing a single atomic copy and saving the result makes
> >>>> for a simpler algorithm, but I've always been of the opinion that we're
> >>>> writing code to satisfy real work needs and desires, not our own desires
> >>>> for simpler or easier to understand algorithms. Doing the bare minimum
> >>>> isn't an option for me.
> >>>
> >>> I'm not talking about that!
> >>>
> >>> In short, if your observation that the page cache doesn't really change during
> >>> hibernation is correct, then it should be possible to avoid making an atomic
> >>> copy of it and to save it directly from its original locations. I think that
> >>> would allow us to save about 80% of memory in the majority of cases without
> >>> the entire complexity that makes things extremely fragile and depends haevily
> >>> on the current (undocumented) behavior of our mm subsystem that _happens_
> >>> to be favourable to TuxOnIce. HTH
> >>
> >> I'm not sure what this current undocumented behaviour is.
> >
> > Easy. The behavior that allows you to use memory used for the page cache
> > hibernation without the risk of it being overwritten in the process. This is
> > not documented anywhere and I don't think it'll ever be.
> >
> >> All I'm relying on is the freezer working and the mm subsystem not deciding to
> >> free process pages or LRU for no good reason.
> >
> > It's more than just freeing them. In fact you need a guarantee that their
> > contents won't be modified over the entire hibernation in a way that you don't
> > control. There's no such guarantee at the moment I know of, so you have to
> > assume that that won't happen, which is _exactly_ relying on undocumented
> > behavior that's not guaranteed to change in future.
>
> I think it's rather unfair to talk about undocumented and unguaranteed
> behaviour when you know I'm relying on the freezer, which is documented
> and guaranteed to work.

The freezer _doesn't_ give you the guarantee you need. It only guarantees
user space to be frozen, which is _not_ _enough_.

> I'm willing to modify things so we use this page-fault idea to make the
> guarantee even more certain - would that satisfy you?

I said what I didn't like: Re-using of the page cache memory for another
purpose behind the back of the mm subsystem in the _hope_ it won't break.
This is simply wrong IMO.

> >> Remember that kswapd is also frozen.
> >
> > But some day it may turn out that it would be better not to freeze it for some
> > reason. If we go the TuxOnIce route, that won't ever be possible I think.
>
> It may also turn out some day that it's better not to freeze any
> processes at all.

Well, I think we'll always need to freeze user space, more or less, but kernel
threads not necessarily.

> But seriously, what could possibly lead us to that decision? The only
> reason we'd want to not freeze kswapd would be if we wanted it to be
> able to free memory while we're hibernating.

And why would that be unreasonable?

> What demand would there be for such memory apart from our own routines
> for writing the image? What impetus would it have to do any freeing? After
> the atomic copy, any other work is pointless - it's going to be thrown away
> when we power off.

It may be useful for image saving or a progress meter or whatever is going on
while the image is being saved.

Rafael

2010-06-07 08:40:19

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [linux-pm] [SUSPECTED SPAM] Re: Proposal for a new algorithm for reading & writing a hibernation image.

On Sunday 06 June 2010, Pedro Ribeiro wrote:
> On 6 June 2010 20:04, Rafael J. Wysocki <[email protected]> wrote:
> > On Sunday 06 June 2010, Maxim Levitsky wrote:
...
> > So what about being able to save 1600 MB total instead of the 2 GB
> > (which is what we're talking about in case that's not clear)? Would it
> > be _that_ _much_ worse?
>
> No it wouldn't be much worse. But there will still be some lagginess,
> some delay, some sort of annoying disk activity compared to NO delay,
> NO lagginess, in short, you have your computer _exactly_ the way you
> left it when you left it when you hibernated. And the difference is
> noticeable.

Well, have you actually tried that?

Rafael

2010-06-07 08:47:39

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [linux-pm] [SUSPECTED SPAM] Re: Proposal for a new algorithm for reading & writing a hibernation image.

On Monday 07 June 2010, Nigel Cunningham wrote:
> Hi.
>
> On 07/06/10 05:04, Rafael J. Wysocki wrote:
> > On Sunday 06 June 2010, Maxim Levitsky wrote:
> >> On Sun, 2010-06-06 at 15:57 +0200, Rafael J. Wysocki wrote:
> >>> On Sunday 06 June 2010, Maxim Levitsky wrote:
> > ...
> >>> So how TuxOnIce helps here?
> >> Very simple.
> >>
> >> With swsusp, I can save 750MB (memory) + 250 Vram (vram)
> >> With full memory save I can save (1750 MB of memory) + 250 MB of
> >> vram....
> >
> > So what about being able to save 1600 MB total instead of the 2 GB
> > (which is what we're talking about in case that's not clear)? Would it
> > be _that_ _much_ worse?
>
> That all depends on what is in the 400MB you discard.

Well, they are discarded following the LRU algorithm and it's very much
like loading a program that takes 20% of your memory upfront.

> The difference is "Just as if you'd never hibernated" vs something
> closer to "Just as if you'd only just started up". We can't make
> categorical statements because it really does depend upon what you
> discard and what you want to do post-resume - that is, how useful the
> memory you discard would have been. That's always going to vary from
> case to case.

Not so much.

Besides, it doesn't matter too much.

Let me reiterate, please. Doing serious memory management behind the back
of the mm subsystem (and trying to do that so it doesn't notice) is wrong and
the reason it works is by accident. As long as you do that, I have a problem
with TuxOnIce.

Rafael

2010-06-07 13:08:00

by Martin Steigerwald

[permalink] [raw]
Subject: Re: [TuxOnIce-devel] [linux-pm] [SUSPECTED SPAM] Re: Proposal for a new algorithm for reading & writing a hibernation image.

Am Montag 07 Juni 2010 schrieb Nigel Cunningham:
> Hi.

Hi Nigel and Rafael, hi everyone else involved,

> On 07/06/10 05:04, Rafael J. Wysocki wrote:
> > On Sunday 06 June 2010, Maxim Levitsky wrote:
> >> On Sun, 2010-06-06 at 15:57 +0200, Rafael J. Wysocki wrote:
> >>> On Sunday 06 June 2010, Maxim Levitsky wrote:
> > ...
> >
> >>> So how TuxOnIce helps here?
> >>
> >> Very simple.
> >>
> >> With swsusp, I can save 750MB (memory) + 250 Vram (vram)
> >> With full memory save I can save (1750 MB of memory) + 250 MB of
> >> vram....
> >
> > So what about being able to save 1600 MB total instead of the 2 GB
> > (which is what we're talking about in case that's not clear)? Would
> > it be _that_ _much_ worse?
>
> That all depends on what is in the 400MB you discard.
>
> The difference is "Just as if you'd never hibernated" vs something
> closer to "Just as if you'd only just started up". We can't make
> categorical statements because it really does depend upon what you
> discard and what you want to do post-resume - that is, how useful the
> memory you discard would have been. That's always going to vary from
> case to case.

Nigel and Rafael, how about just testing it?

Whats needed to have 80% of the memory saved instead of 50%?

I think its important to go the next steps towards a better snapshot in
mainline kernel even when you do not agree on the complete end result yet.

What about

- Rafael, you review the async write patches of Nigel. If they are good,
IMHO they should go in as soon as possible.

- Nigel and/or Rafael, you look at whats needed to save 80% instead of 50%
of the memory and develop a patch for it


?

Then this goes into one stable kernel series and be tested in the wild.
And if that approach does not suffice to give a similar experience than with
TuxOnIce one could still look further. In that case I ask you Rafael, to
at least listen open-mindedly to practical experiences being told and to
ideas to improve the situation.

I really want to see this make some progress instead of getting stuck in
discussion loops again. No offence meant - you do the all the development
work! - but the time spent here IMHO is better spent on reviewing and
furtherly refining the existing patches by Nigel and Jiri and developing a
patchset for the 80% solution which should already help a lot.

Does that incremental approach sound acceptable for the time being?

IMHO *any* step forward helps!

Ciao,
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7


Attachments:
signature.asc (198.00 B)
This is a digitally signed message part.

2010-06-07 21:27:10

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [TuxOnIce-devel] [linux-pm] [SUSPECTED SPAM] Re: Proposal for a new algorithm for reading & writing a hibernation image.

On Monday 07 June 2010, Martin Steigerwald wrote:
> Am Montag 07 Juni 2010 schrieb Nigel Cunningham:
> > Hi.
>
> Hi Nigel and Rafael, hi everyone else involved,
>
> > On 07/06/10 05:04, Rafael J. Wysocki wrote:
> > > On Sunday 06 June 2010, Maxim Levitsky wrote:
> > >> On Sun, 2010-06-06 at 15:57 +0200, Rafael J. Wysocki wrote:
> > >>> On Sunday 06 June 2010, Maxim Levitsky wrote:
> > > ...
> > >
> > >>> So how TuxOnIce helps here?
> > >>
> > >> Very simple.
> > >>
> > >> With swsusp, I can save 750MB (memory) + 250 Vram (vram)
> > >> With full memory save I can save (1750 MB of memory) + 250 MB of
> > >> vram....
> > >
> > > So what about being able to save 1600 MB total instead of the 2 GB
> > > (which is what we're talking about in case that's not clear)? Would
> > > it be _that_ _much_ worse?
> >
> > That all depends on what is in the 400MB you discard.
> >
> > The difference is "Just as if you'd never hibernated" vs something
> > closer to "Just as if you'd only just started up". We can't make
> > categorical statements because it really does depend upon what you
> > discard and what you want to do post-resume - that is, how useful the
> > memory you discard would have been. That's always going to vary from
> > case to case.
>
> Nigel and Rafael, how about just testing it?

ISTR that can be done to some extent using TuxOnIce as is, becuase there is a
knob that you can use to limit the image size.

> Whats needed to have 80% of the memory saved instead of 50%?
>
> I think its important to go the next steps towards a better snapshot in
> mainline kernel even when you do not agree on the complete end result yet.
>
> What about
>
> - Rafael, you review the async write patches of Nigel. If they are good,
> IMHO they should go in as soon as possible.

Yes, I'm going to do that.

> - Nigel and/or Rafael, you look at whats needed to save 80% instead of 50%
> of the memory and develop a patch for it

That would be my suggestion as well.

Thanks,
Rafael

2010-06-07 21:31:14

by Nigel Cunningham

[permalink] [raw]
Subject: Re: [TuxOnIce-devel] [linux-pm] [SUSPECTED SPAM] Re: Proposal for a new algorithm for reading & writing a hibernation image.

Hi.

On 08/06/10 07:28, Rafael J. Wysocki wrote:
> On Monday 07 June 2010, Martin Steigerwald wrote:
>> Am Montag 07 Juni 2010 schrieb Nigel Cunningham:
>>> Hi.
>>
>> Hi Nigel and Rafael, hi everyone else involved,
>>
>>> On 07/06/10 05:04, Rafael J. Wysocki wrote:
>>>> On Sunday 06 June 2010, Maxim Levitsky wrote:
>>>>> On Sun, 2010-06-06 at 15:57 +0200, Rafael J. Wysocki wrote:
>>>>>> On Sunday 06 June 2010, Maxim Levitsky wrote:
>>>> ...
>>>>
>>>>>> So how TuxOnIce helps here?
>>>>>
>>>>> Very simple.
>>>>>
>>>>> With swsusp, I can save 750MB (memory) + 250 Vram (vram)
>>>>> With full memory save I can save (1750 MB of memory) + 250 MB of
>>>>> vram....
>>>>
>>>> So what about being able to save 1600 MB total instead of the 2 GB
>>>> (which is what we're talking about in case that's not clear)? Would
>>>> it be _that_ _much_ worse?
>>>
>>> That all depends on what is in the 400MB you discard.
>>>
>>> The difference is "Just as if you'd never hibernated" vs something
>>> closer to "Just as if you'd only just started up". We can't make
>>> categorical statements because it really does depend upon what you
>>> discard and what you want to do post-resume - that is, how useful the
>>> memory you discard would have been. That's always going to vary from
>>> case to case.
>>
>> Nigel and Rafael, how about just testing it?
>
> ISTR that can be done to some extent using TuxOnIce as is, becuase there is a
> knob that you can use to limit the image size.

Yes.

>> Whats needed to have 80% of the memory saved instead of 50%?
>>
>> I think its important to go the next steps towards a better snapshot in
>> mainline kernel even when you do not agree on the complete end result yet.
>>
>> What about
>>
>> - Rafael, you review the async write patches of Nigel. If they are good,
>> IMHO they should go in as soon as possible.
>
> Yes, I'm going to do that.

Great.

>> - Nigel and/or Rafael, you look at whats needed to save 80% instead of 50%
>> of the memory and develop a patch for it
>
> That would be my suggestion as well.

It would be no problem to merge most of the TuxOnIce code without even
thinking further about this two-part image issue, because TuxOnIce also
has a tuneable to disable the second part of the image. We could even
merge the two part stuff and make it off by default, but I'm not sure
Rafael would accept that option.

Regards,

Nigel

2010-06-08 02:07:20

by Nigel Cunningham

[permalink] [raw]
Subject: Re: [linux-pm] [SUSPECTED SPAM] Re: Proposal for a new algorithm for reading & writing a hibernation image.

Hi Rafael.

On 07/06/10 18:49, Rafael J. Wysocki wrote:
> On Monday 07 June 2010, Nigel Cunningham wrote:
>> On 07/06/10 05:04, Rafael J. Wysocki wrote:
>>> On Sunday 06 June 2010, Maxim Levitsky wrote:
>>>> On Sun, 2010-06-06 at 15:57 +0200, Rafael J. Wysocki wrote:
>>>>> On Sunday 06 June 2010, Maxim Levitsky wrote:
>>>>> So how TuxOnIce helps here?
>>>> Very simple.
>>>>
>>>> With swsusp, I can save 750MB (memory) + 250 Vram (vram)
>>>> With full memory save I can save (1750 MB of memory) + 250 MB of
>>>> vram....
>>>
>>> So what about being able to save 1600 MB total instead of the 2 GB
>>> (which is what we're talking about in case that's not clear)? Would it
>>> be _that_ _much_ worse?
>>
>> That all depends on what is in the 400MB you discard.
>
> Well, they are discarded following the LRU algorithm and it's very much
> like loading a program that takes 20% of your memory upfront.
>
>> The difference is "Just as if you'd never hibernated" vs something
>> closer to "Just as if you'd only just started up". We can't make
>> categorical statements because it really does depend upon what you
>> discard and what you want to do post-resume - that is, how useful the
>> memory you discard would have been. That's always going to vary from
>> case to case.
>
> Not so much.
>
> Besides, it doesn't matter too much.
>
> Let me reiterate, please. Doing serious memory management behind the back
> of the mm subsystem (and trying to do that so it doesn't notice) is wrong and
> the reason it works is by accident. As long as you do that, I have a problem
> with TuxOnIce.

I know we're at a point where it doesn't matter what I say - you've made
up you're mind and are not going to be persuaded by anything I say.
We're degenerating from a technical discussion into emotive language.

This is why I object to the way you're picturing things. TuxOnIce isn't
doing "serious memory management behind the back of the mm subsystem" or
working "by accident". It's an algorithm that has been designed to rely
on and use both the freezer and the existing mm subsystem to provide a
means wherein we can get more reliable hibernation and a fuller image of
memory.

May I suggest that we seek to get away from this point and focus on what
we can agree on. Do you have any object to my work in the areas of:

- speed (async I/O, multithreaded I/O)
- flexibility (support for multiple swap devices, support for non swap,
UUID support)
- tuneability (sysfs interface)
- anything else I might have forgotten to mention

If so, perhaps we can deal with those issues before I get too carried
away preparing patches to get them merged.

Regards,

Nigel

2010-06-08 09:00:58

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [linux-pm] [SUSPECTED SPAM] Re: Proposal for a new algorithm for reading & writing a hibernation image.

On Tuesday 08 June 2010, Nigel Cunningham wrote:
> Hi Rafael.
>
> On 07/06/10 18:49, Rafael J. Wysocki wrote:
> > On Monday 07 June 2010, Nigel Cunningham wrote:
> >> On 07/06/10 05:04, Rafael J. Wysocki wrote:
> >>> On Sunday 06 June 2010, Maxim Levitsky wrote:
> >>>> On Sun, 2010-06-06 at 15:57 +0200, Rafael J. Wysocki wrote:
> >>>>> On Sunday 06 June 2010, Maxim Levitsky wrote:
> >>>>> So how TuxOnIce helps here?
> >>>> Very simple.
> >>>>
> >>>> With swsusp, I can save 750MB (memory) + 250 Vram (vram)
> >>>> With full memory save I can save (1750 MB of memory) + 250 MB of
> >>>> vram....
> >>>
> >>> So what about being able to save 1600 MB total instead of the 2 GB
> >>> (which is what we're talking about in case that's not clear)? Would it
> >>> be _that_ _much_ worse?
> >>
> >> That all depends on what is in the 400MB you discard.
> >
> > Well, they are discarded following the LRU algorithm and it's very much
> > like loading a program that takes 20% of your memory upfront.
> >
> >> The difference is "Just as if you'd never hibernated" vs something
> >> closer to "Just as if you'd only just started up". We can't make
> >> categorical statements because it really does depend upon what you
> >> discard and what you want to do post-resume - that is, how useful the
> >> memory you discard would have been. That's always going to vary from
> >> case to case.
> >
> > Not so much.
> >
> > Besides, it doesn't matter too much.
> >
> > Let me reiterate, please. Doing serious memory management behind the back
> > of the mm subsystem (and trying to do that so it doesn't notice) is wrong and
> > the reason it works is by accident. As long as you do that, I have a problem
> > with TuxOnIce.
>
> I know we're at a point where it doesn't matter what I say - you've made
> up you're mind and are not going to be persuaded by anything I say.
> We're degenerating from a technical discussion into emotive language.
>
> This is why I object to the way you're picturing things. TuxOnIce isn't
> doing "serious memory management behind the back of the mm subsystem" or
> working "by accident". It's an algorithm that has been designed to rely
> on and use both the freezer and the existing mm subsystem to provide a
> means wherein we can get more reliable hibernation and a fuller image of
> memory.
>
> May I suggest that we seek to get away from this point and focus on what
> we can agree on.

Sure.

> Do you have any object to my work in the areas of:
>
> - speed (async I/O, multithreaded I/O)
> - flexibility (support for multiple swap devices, support for non swap,
> UUID support)
> - tuneability (sysfs interface)
> - anything else I might have forgotten to mention

No, that's all fine, perhaps up to some details, but fundamentally I don't
have a problem with that.

Rafael