2006-03-10 17:04:06

by Jun OKAJIMA

[permalink] [raw]
Subject: Faster resuming of suspend technology.



As you might know, one of the key technology of fast booting is suspending.
actually, using suspending does fast booting. And very good point is
not only can do booting desktop and daemons, but apps at once.
but one big fault --- it is slow for a while after booted because of HDD thrashing.
(I mention a term suspend as generic one, not refering only to Nigel Cunningham's one)

One of the solution of thrashing issue is like this.
1. log disk access pattern after booted.
2. analyze the log and find common disk access pattern.
2. re-order a suspend image using the pattern.
3. read-aheading the image after booted.

so far is okay. this is common technique to reduce disk seek.

The problem of above way is, "Is there common access pattern?".
I guess there would be.
The reason is that even what user does is always different, but what pages
it needs has common pattern. For example, pages which contain glibc or gtk
libs are always used. So, reading ahead these pages is meaningful, I suppose.

What you think? Your opinion is very welcome.


--- Okajima, Jun. Tokyo, Japan.
http://www.machboot.com




2006-03-11 07:24:42

by Nigel Cunningham

[permalink] [raw]
Subject: Re: Faster resuming of suspend technology.

Hi.

On Saturday 11 March 2006 03:04, Jun OKAJIMA wrote:
> As you might know, one of the key technology of fast booting is suspending.
> actually, using suspending does fast booting. And very good point is
> not only can do booting desktop and daemons, but apps at once.
> but one big fault --- it is slow for a while after booted because of HDD
> thrashing. (I mention a term suspend as generic one, not refering only to
> Nigel Cunningham's one)
>
> One of the solution of thrashing issue is like this.
> 1. log disk access pattern after booted.
> 2. analyze the log and find common disk access pattern.
> 2. re-order a suspend image using the pattern.
> 3. read-aheading the image after booted.
>
> so far is okay. this is common technique to reduce disk seek.
>
> The problem of above way is, "Is there common access pattern?".
> I guess there would be.
> The reason is that even what user does is always different, but what pages
> it needs has common pattern. For example, pages which contain glibc or gtk
> libs are always used. So, reading ahead these pages is meaningful, I
> suppose.
>
> What you think? Your opinion is very welcome.
>
>
> --- Okajima, Jun. Tokyo, Japan.
> http://www.machboot.com

My version doesn't have this problem by default, because it saves a full image
of memory unless the user explicitly sets a (soft) upper limit on the image
size. The image is stored as contiguously as available storage allows, so
rereading it quickly isn't so much of an issue (and far less of an issue than
discarding the memory before suspending and faulting it back in from all over
the place afterwards).

That said, work has already been done along the lines that you're describing.
You might, for example, look at the OLS papers from last year. There was a
paper there describing work on almost exactly what you're describing.

Hope that helps.

Nigel


Attachments:
(No filename) (1.87 kB)
(No filename) (189.00 B)
Download all attachments

2006-03-11 12:17:50

by Jun OKAJIMA

[permalink] [raw]
Subject: Re: Faster resuming of suspend technology.


>
>My version doesn't have this problem by default, because it saves a full image
>of memory unless the user explicitly sets a (soft) upper limit on the image
>size. The image is stored as contiguously as available storage allows, so
>rereading it quickly isn't so much of an issue (and far less of an issue than
>discarding the memory before suspending and faulting it back in from all over
>the place afterwards).
>

Yes, right. In your way, there is no thrashing. but it slows booting.
I mean, there is a trade-off between booting and after booted.
But, what people would want is always both, not either.
Especially, your way has problem if you boot( resume ) not from HDD
but for example, from NFS server or CD-R or even from Internet.



>
>That said, work has already been done along the lines that you're describing.
>You might, for example, look at the OLS papers from last year. There was a
>paper there describing work on almost exactly what you're describing.
>

Could I have URL or title of the paper?

--- Okajima, Jun. Tokyo, Japan.


2006-03-11 12:49:17

by Nigel Cunningham

[permalink] [raw]
Subject: Re: Faster resuming of suspend technology.

Hi.

On Saturday 11 March 2006 22:17, Jun OKAJIMA wrote:
> >My version doesn't have this problem by default, because it saves a full
> > image of memory unless the user explicitly sets a (soft) upper limit on
> > the image size. The image is stored as contiguously as available storage
> > allows, so rereading it quickly isn't so much of an issue (and far less
> > of an issue than discarding the memory before suspending and faulting it
> > back in from all over the place afterwards).
>
> Yes, right. In your way, there is no thrashing. but it slows booting.
> I mean, there is a trade-off between booting and after booted.
> But, what people would want is always both, not either.

I don't understand what you're saying. In particular, I'm not sure why/how you
think suspend functionality slows booting or what the tradeoff is "between
booting and after booted".

> Especially, your way has problem if you boot( resume ) not from HDD
> but for example, from NFS server or CD-R or even from Internet.

Resuming from the internet? Scary. Anyway, I hope I'll understand better what
you're getting at after your next reply.

> >That said, work has already been done along the lines that you're
> > describing. You might, for example, look at the OLS papers from last
> > year. There was a paper there describing work on almost exactly what
> > you're describing.
>
> Could I have URL or title of the paper?

http://www.linuxsymposium.org/2005/. I don't recall the title now, sorry, and
can't tell you whether it's in volume 1 or 2 of the proceedings, but I'm sure
it will stick out like a sore thumb.

Regards,

Nigel


Attachments:
(No filename) (1.58 kB)
(No filename) (189.00 B)
Download all attachments

2006-03-12 09:26:25

by Jun OKAJIMA

[permalink] [raw]
Subject: Re: Faster resuming of suspend technology.

>>
>> Yes, right. In your way, there is no thrashing. but it slows booting.
>> I mean, there is a trade-off between booting and after booted.
>> But, what people would want is always both, not either.
>
>I don't understand what you're saying. In particular, I'm not sure why/how you
>think suspend functionality slows booting or what the tradeoff is "between
>booting and after booted".
>

Sorry, I used words in not usual way.
I refer "booting" as just resuming. And "after booted" means "after resumed".
In other words, I treat swsusp2 as not note PC's hibernation equivalent,
but just for faster booting technology.
So, What I wanted to say was,

--- Reading all image in advance ( your way) slows resuming itself.
--- Reading pages on demand ( e.g. VMware) slows apps after resumed.

Hope my English is understandable one...


>> Especially, your way has problem if you boot( resume ) not from HDD
>> but for example, from NFS server or CD-R or even from Internet.
>
>Resuming from the internet? Scary. Anyway, I hope I'll understand better what
>you're getting at after your next reply.
>

In Japan, it is not so scary.
We have 100Mbps symmetric FTTH ( optical Fiber To The Home), and
more than 1M homes have it, and price is about 30USD/month.
With this, theoretically you can download 600MB ISO image in one min,
and actually you can download 100MBytes suspend image within 30sec.
So, not click to run (e.g. Java applet) but "click to resume" is not dreaming
but rather feasible. You still think it is scary on this situation?

>> >That said, work has already been done along the lines that you're
>> > describing. You might, for example, look at the OLS papers from last
>> > year. There was a paper there describing work on almost exactly what
>> > you're describing.
>>
>> Could I have URL or title of the paper?
>
>http://www.linuxsymposium.org/2005/. I don't recall the title now, sorry, and
>can't tell you whether it's in volume 1 or 2 of the proceedings, but I'm sure
>it will stick out like a sore thumb.
>
>

I checked the URL but could not find the paper,
with keywords of "Cunningham" or "swsusp" or "suspend".
Could you tell me any keyword to find it?

--- Okajima, Jun. Tokyo, Japan.

2006-03-12 17:54:55

by Jim Crilly

[permalink] [raw]
Subject: Re: Faster resuming of suspend technology.

On 03/12/06 06:26:17PM +0900, Jun OKAJIMA wrote:
> >>
> >> Yes, right. In your way, there is no thrashing. but it slows booting.
> >> I mean, there is a trade-off between booting and after booted.
> >> But, what people would want is always both, not either.
> >
> >I don't understand what you're saying. In particular, I'm not sure why/how you
> >think suspend functionality slows booting or what the tradeoff is "between
> >booting and after booted".
> >
>
> Sorry, I used words in not usual way.
> I refer "booting" as just resuming. And "after booted" means "after resumed".
> In other words, I treat swsusp2 as not note PC's hibernation equivalent,
> but just for faster booting technology.
> So, What I wanted to say was,
>
> --- Reading all image in advance ( your way) slows resuming itself.
> --- Reading pages on demand ( e.g. VMware) slows apps after resumed.
>
> Hope my English is understandable one...
>

But you have to read all of the pages at some point so the hard disk is
going to be the bottleneck no matter what you do. And since Suspend2
currently saves the cache as a contiguous stream, possibly compressed, it
should be a good bit faster than seeking around the disk loading the files
from the filesystem.

>
> >> Especially, your way has problem if you boot( resume ) not from HDD
> >> but for example, from NFS server or CD-R or even from Internet.
> >
> >Resuming from the internet? Scary. Anyway, I hope I'll understand better what
> >you're getting at after your next reply.
> >
>
> In Japan, it is not so scary.
> We have 100Mbps symmetric FTTH ( optical Fiber To The Home), and
> more than 1M homes have it, and price is about 30USD/month.
> With this, theoretically you can download 600MB ISO image in one min,
> and actually you can download 100MBytes suspend image within 30sec.
> So, not click to run (e.g. Java applet) but "click to resume" is not dreaming
> but rather feasible. You still think it is scary on this situation?
>

I don't think the scary part is speed, but security. I for one wouldn't
want to resume from an image hosted on a remote machine unless I had some
way to be sure it wasn't tampered with, like gpg signing or something.

> >> >That said, work has already been done along the lines that you're
> >> > describing. You might, for example, look at the OLS papers from last
> >> > year. There was a paper there describing work on almost exactly what
> >> > you're describing.
> >>
> >> Could I have URL or title of the paper?
> >
> >http://www.linuxsymposium.org/2005/. I don't recall the title now, sorry, and
> >can't tell you whether it's in volume 1 or 2 of the proceedings, but I'm sure
> >it will stick out like a sore thumb.
> >
> >
>
> I checked the URL but could not find the paper,
> with keywords of "Cunningham" or "swsusp" or "suspend".
> Could you tell me any keyword to find it?
>

I took a quick look at the PDFs and I believe the section Nigel is talking
about is called "On faster application startup times: Cache stuffing, seek
profiling, adaptive preloading" in volume 1.

Jim.

2006-03-12 21:32:35

by Andreas Mohr

[permalink] [raw]
Subject: Re: Faster resuming of suspend technology.

Hi,

[CC'd -ck list]

On Sat, Mar 11, 2006 at 02:04:10AM +0900, Jun OKAJIMA wrote:
>
>
> As you might know, one of the key technology of fast booting is suspending.
> actually, using suspending does fast booting. And very good point is
> not only can do booting desktop and daemons, but apps at once.
> but one big fault --- it is slow for a while after booted because of HDD thrashing.
> (I mention a term suspend as generic one, not refering only to Nigel Cunningham's one)
I think that is the case since swsusp AFAIR forces as many pages
as possible into swap and then appends some non-pageable parts
before shutting down.
Thus the system will resume with all processes fully residing
in swap space and the apps getting back to main memory
on demand only.

And... well... this sounds to me exactly like a prime task
for the newish swap prefetch work, no need for any other
special solutions here, I think.
We probably want a new flag for swap prefetch to let it know
that we just resumed from software suspend and thus need
prefetching to happen *much* faster than under normal
conditions for a short while, though (most likely by
enabling prefetching on a *non-idle* system for a minute).

Andreas Mohr

2006-03-12 22:30:27

by Con Kolivas

[permalink] [raw]
Subject: Re: [ck] Re: Faster resuming of suspend technology.

On Monday 13 March 2006 08:32, Andreas Mohr wrote:
> And... well... this sounds to me exactly like a prime task
> for the newish swap prefetch work, no need for any other
> special solutions here, I think.
> We probably want a new flag for swap prefetch to let it know
> that we just resumed from software suspend and thus need
> prefetching to happen *much* faster than under normal
> conditions for a short while, though (most likely by
> enabling prefetching on a *non-idle* system for a minute).

Adding a resume_swap_prefetch() called just before the resume finishes that
aggressively prefetches from swap would be easy. Please tell me if you think
adding such a function would be worthwhile.

Cheers,
Con

2006-03-12 23:09:34

by Nigel Cunningham

[permalink] [raw]
Subject: Re: Faster resuming of suspend technology.

Hi.

On Monday 13 March 2006 03:54, Jim Crilly wrote:
> On 03/12/06 06:26:17PM +0900, Jun OKAJIMA wrote:
> > >> Yes, right. In your way, there is no thrashing. but it slows booting.
> > >> I mean, there is a trade-off between booting and after booted.
> > >> But, what people would want is always both, not either.
> > >
> > >I don't understand what you're saying. In particular, I'm not sure
> > > why/how you think suspend functionality slows booting or what the
> > > tradeoff is "between booting and after booted".
> >
> > Sorry, I used words in not usual way.
> > I refer "booting" as just resuming. And "after booted" means "after
> > resumed". In other words, I treat swsusp2 as not note PC's hibernation
> > equivalent, but just for faster booting technology.
> > So, What I wanted to say was,
> >
> > --- Reading all image in advance ( your way) slows resuming itself.
> > --- Reading pages on demand ( e.g. VMware) slows apps after resumed.
> >
> > Hope my English is understandable one...
>
> But you have to read all of the pages at some point so the hard disk is
> going to be the bottleneck no matter what you do. And since Suspend2
> currently saves the cache as a contiguous stream, possibly compressed, it
> should be a good bit faster than seeking around the disk loading the files
> from the filesystem.

Agreed.

> > >> Especially, your way has problem if you boot( resume ) not from HDD
> > >> but for example, from NFS server or CD-R or even from Internet.
> > >
> > >Resuming from the internet? Scary. Anyway, I hope I'll understand better
> > > what you're getting at after your next reply.
> >
> > In Japan, it is not so scary.
> > We have 100Mbps symmetric FTTH ( optical Fiber To The Home), and
> > more than 1M homes have it, and price is about 30USD/month.
> > With this, theoretically you can download 600MB ISO image in one min,
> > and actually you can download 100MBytes suspend image within 30sec.
> > So, not click to run (e.g. Java applet) but "click to resume" is not
> > dreaming but rather feasible. You still think it is scary on this
> > situation?
>
> I don't think the scary part is speed, but security. I for one wouldn't
> want to resume from an image hosted on a remote machine unless I had some
> way to be sure it wasn't tampered with, like gpg signing or something.

Another issues is that at the moment, hotplugging is work in progress. In
order to resume, you currently need the same kernel build you're booting
with, and the same hardware configuration in the resumed system. As hotplug
matures, this restriction might relax, and we could probably come up with a
way around the former restriction, but at the moment, it really only makes
sense to try to resume an image you created using the same machine.

> > >> >That said, work has already been done along the lines that you're
> > >> > describing. You might, for example, look at the OLS papers from last
> > >> > year. There was a paper there describing work on almost exactly what
> > >> > you're describing.
> > >>
> > >> Could I have URL or title of the paper?
> > >
> > >http://www.linuxsymposium.org/2005/. I don't recall the title now,
> > > sorry, and can't tell you whether it's in volume 1 or 2 of the
> > > proceedings, but I'm sure it will stick out like a sore thumb.
> >
> > I checked the URL but could not find the paper,
> > with keywords of "Cunningham" or "swsusp" or "suspend".
> > Could you tell me any keyword to find it?
>
> I took a quick look at the PDFs and I believe the section Nigel is talking
> about is called "On faster application startup times: Cache stuffing, seek
> profiling, adaptive preloading" in volume 1.

Yes, that's the one. Sorry - I didn't mean to give the impression that I wrote
it. I know about it because I attended the talk.

Regards,

Nigel


Attachments:
(No filename) (3.72 kB)
(No filename) (189.00 B)
Download all attachments

2006-03-13 01:46:25

by Nigel Cunningham

[permalink] [raw]
Subject: Re: [ck] Re: Faster resuming of suspend technology.

Hi.

On Monday 13 March 2006 08:30, Con Kolivas wrote:
> On Monday 13 March 2006 08:32, Andreas Mohr wrote:
> > And... well... this sounds to me exactly like a prime task
> > for the newish swap prefetch work, no need for any other
> > special solutions here, I think.
> > We probably want a new flag for swap prefetch to let it know
> > that we just resumed from software suspend and thus need
> > prefetching to happen *much* faster than under normal
> > conditions for a short while, though (most likely by
> > enabling prefetching on a *non-idle* system for a minute).
>
> Adding a resume_swap_prefetch() called just before the resume finishes that
> aggressively prefetches from swap would be easy. Please tell me if you
> think adding such a function would be worthwhile.

My 2c would be that swsusp is broken in a number of ways in discarding those
pages in the first place:

- Forcing pages out to swap by vm pressure is an inefficient way of writing
the pages.
- It doesn't get the pages compressed, and so makes inefficient use of the
storage and forces more pages to be discarded that would otherwise be
necessary.
- Bringing the pages back in by swap prefetching or swapoffing or whatever is
equally inefficient (I was going to say 'particularly in low memory
situations', but immediately ate my words as I remembered that if you've just
swsusp'd, you've freed at least half of memory anyway).
- This technique doesn't guarantee that the pages you end up with in memory
are the pages that you're actually most likely to want. The vast majority of
what you really want will simply have been discarded rather than swapped.

Having said that, Rafael is making some progress in these areas, such that
swsusp is eating less memory than it used to, so that swap prefetching will
be less important at resume time than it has been in the past.

Hope this helps.

Nigel


Attachments:
(No filename) (1.84 kB)
(No filename) (189.00 B)
Download all attachments

2006-03-13 10:06:37

by Pavel Machek

[permalink] [raw]
Subject: Re: Faster resuming of suspend technology.

On Ne 12-03-06 22:32:28, Andreas Mohr wrote:
> Hi,
>
> [CC'd -ck list]
>
> On Sat, Mar 11, 2006 at 02:04:10AM +0900, Jun OKAJIMA wrote:
> >
> >
> > As you might know, one of the key technology of fast booting is suspending.
> > actually, using suspending does fast booting. And very good point is
> > not only can do booting desktop and daemons, but apps at once.
> > but one big fault --- it is slow for a while after booted because of HDD thrashing.
> > (I mention a term suspend as generic one, not refering only to Nigel Cunningham's one)
> I think that is the case since swsusp AFAIR forces as many pages
> as possible into swap and then appends some non-pageable parts
> before shutting down.
> Thus the system will resume with all processes fully residing
> in swap space and the apps getting back to main memory
> on demand only.

Actually... not any more, see /sys/power/image_size.

> And... well... this sounds to me exactly like a prime task
> for the newish swap prefetch work, no need for any other
> special solutions here, I think.
> We probably want a new flag for swap prefetch to let it know
> that we just resumed from software suspend and thus need
> prefetching to happen *much* faster than under normal
> conditions for a short while, though (most likely by
> enabling prefetching on a *non-idle* system for a minute).

Yep, that would be nice. We are actually able to save up-to half of
pagecache, so situation is not as bad as it used to be.
Pavel
--
159:

2006-03-13 10:12:50

by Pavel Machek

[permalink] [raw]
Subject: Re: [ck] Re: Faster resuming of suspend technology.

On Po 13-03-06 11:43:55, Nigel Cunningham wrote:
> Hi.
>
> On Monday 13 March 2006 08:30, Con Kolivas wrote:
> > On Monday 13 March 2006 08:32, Andreas Mohr wrote:
> > > And... well... this sounds to me exactly like a prime task
> > > for the newish swap prefetch work, no need for any other
> > > special solutions here, I think.
> > > We probably want a new flag for swap prefetch to let it know
> > > that we just resumed from software suspend and thus need
> > > prefetching to happen *much* faster than under normal
> > > conditions for a short while, though (most likely by
> > > enabling prefetching on a *non-idle* system for a minute).
> >
> > Adding a resume_swap_prefetch() called just before the resume finishes that
> > aggressively prefetches from swap would be easy. Please tell me if you
> > think adding such a function would be worthwhile.
>
> My 2c would be that swsusp is broken in a number of ways in discarding those
> pages in the first place:

Yep, feel free to submit a patch.

> - Forcing pages out to swap by vm pressure is an inefficient way of writing
> the pages.

Really? VM subsystem is supposed to be effective.

> - It doesn't get the pages compressed, and so makes inefficient use of the
> storage and forces more pages to be discarded that would otherwise be
> necessary.

"more pages to be discarded" is untrue. If you want to argue that swap
needs to be compressed, feel free to submit patches for swap
compression.

(Compression is actually not as important as you paint it. Rafael
implemented it, only to find out that it is 20 percent speedup in
common cases -- and your gzip actually slows things down.)

> - Bringing the pages back in by swap prefetching or swapoffing or whatever is
> equally inefficient (I was going to say 'particularly in low memory
> situations', but immediately ate my words as I remembered that if you've just
> swsusp'd, you've freed at least half of memory anyway).

...but allows you to use machine immediately after resume, which
people want, as you have just seen.

> - This technique doesn't guarantee that the pages you end up with in memory
> are the pages that you're actually most likely to want. The vast majority of
> what you really want will simply have been discarded rather than swapped.
>
> Having said that, Rafael is making some progress in these areas, such that
> swsusp is eating less memory than it used to, so that swap prefetching will
> be less important at resume time than it has been in the past.
>
> Hope this helps.
>
> Nigel



--
37:

2006-03-13 10:36:27

by Con Kolivas

[permalink] [raw]
Subject: Re: [ck] Re: Faster resuming of suspend technology.

On Monday 13 March 2006 21:06, Pavel Machek wrote:
> On Ne 12-03-06 22:32:28, Andreas Mohr wrote:
> > And... well... this sounds to me exactly like a prime task
> > for the newish swap prefetch work, no need for any other
> > special solutions here, I think.
> > We probably want a new flag for swap prefetch to let it know
> > that we just resumed from software suspend and thus need
> > prefetching to happen *much* faster than under normal
> > conditions for a short while, though (most likely by
> > enabling prefetching on a *non-idle* system for a minute).
>
> Yep, that would be nice. We are actually able to save up-to half of
> pagecache, so situation is not as bad as it used to be.

I would be happy to extend swap prefetch's capabilities to improve resume. It
wouldn't be too hard to add a special post_resume_swap_prefetch() which
aggressively prefetches for a while. Excuse my ignorance, though, as I know
little about swsusp. Are there pages still on swap space after a resume
cycle?

Cheers,
Con

2006-03-13 10:43:39

by Pavel Machek

[permalink] [raw]
Subject: Re: [ck] Re: Faster resuming of suspend technology.

On Po 13-03-06 21:35:59, Con Kolivas wrote:
> On Monday 13 March 2006 21:06, Pavel Machek wrote:
> > On Ne 12-03-06 22:32:28, Andreas Mohr wrote:
> > > And... well... this sounds to me exactly like a prime task
> > > for the newish swap prefetch work, no need for any other
> > > special solutions here, I think.
> > > We probably want a new flag for swap prefetch to let it know
> > > that we just resumed from software suspend and thus need
> > > prefetching to happen *much* faster than under normal
> > > conditions for a short while, though (most likely by
> > > enabling prefetching on a *non-idle* system for a minute).
> >
> > Yep, that would be nice. We are actually able to save up-to half of
> > pagecache, so situation is not as bad as it used to be.
>
> I would be happy to extend swap prefetch's capabilities to improve
> resume. It

That would be nice.

> wouldn't be too hard to add a special post_resume_swap_prefetch() which
> aggressively prefetches for a while. Excuse my ignorance, though, as I know
> little about swsusp. Are there pages still on swap space after a resume
> cycle?

Yes, there are, most of the time. Let me explain:

swsusp needs half of memory free. So it shrinks caches (by emulating
memory pressure) so that half of memory if free (and optionaly shrinks
them some more). Pages are pushed into swap by this process.

Now, that works perfectly okay for me (with 1.5GB machine). I can
imagine that on 128MB machine, shrinking caches to 64MB could hurt a
bit. I guess we'll need to find someone interested with small memory
machine (if there are no such people, we can happily ignore the issue
:-).
Pavel
--
61: uint KeyID = BitConverter.ToUInt32( adKEY, 0 );

2006-03-13 11:13:28

by Andreas Mohr

[permalink] [raw]
Subject: Re: [ck] Re: Faster resuming of suspend technology.

Hi,

On Mon, Mar 13, 2006 at 11:43:15AM +0100, Pavel Machek wrote:
> On Po 13-03-06 21:35:59, Con Kolivas wrote:
> > wouldn't be too hard to add a special post_resume_swap_prefetch() which
> > aggressively prefetches for a while. Excuse my ignorance, though, as I know
> > little about swsusp. Are there pages still on swap space after a resume
> > cycle?
>
> Yes, there are, most of the time. Let me explain:
>
> swsusp needs half of memory free. So it shrinks caches (by emulating
> memory pressure) so that half of memory if free (and optionaly shrinks
> them some more). Pages are pushed into swap by this process.
>
> Now, that works perfectly okay for me (with 1.5GB machine). I can
> imagine that on 128MB machine, shrinking caches to 64MB could hurt a
> bit. I guess we'll need to find someone interested with small memory
> machine (if there are no such people, we can happily ignore the issue
> :-).

Why not simply use the mem= boot parameter?
Or is that impossible for some reason in this specific case?

I have a P3/450 256M machine where I could do some tests if really needed.

Andreas Mohr

2006-03-13 11:37:48

by Pavel Machek

[permalink] [raw]
Subject: does swsusp suck aftre resume for you? [was Re: [ck] Re: Faster resuming of suspend technology.]

On Po 13-03-06 12:13:26, Andreas Mohr wrote:
> Hi,
>
> On Mon, Mar 13, 2006 at 11:43:15AM +0100, Pavel Machek wrote:
> > On Po 13-03-06 21:35:59, Con Kolivas wrote:
> > > wouldn't be too hard to add a special post_resume_swap_prefetch() which
> > > aggressively prefetches for a while. Excuse my ignorance, though, as I know
> > > little about swsusp. Are there pages still on swap space after a resume
> > > cycle?
> >
> > Yes, there are, most of the time. Let me explain:
> >
> > swsusp needs half of memory free. So it shrinks caches (by emulating
> > memory pressure) so that half of memory if free (and optionaly shrinks
> > them some more). Pages are pushed into swap by this process.
> >
> > Now, that works perfectly okay for me (with 1.5GB machine). I can
> > imagine that on 128MB machine, shrinking caches to 64MB could hurt a
> > bit. I guess we'll need to find someone interested with small memory
> > machine (if there are no such people, we can happily ignore the issue
> > :-).
>
> Why not simply use the mem= boot parameter?
> Or is that impossible for some reason in this specific case?
>
> I have a P3/450 256M machine where I could do some tests if really needed.

Yes, I can do mem=128M... but then, I'd prefer not to code workarounds
for machines noone uses any more.

So, I'm looking for a volunteer:

1) Does the swsusp work for you (no => bugzilla, but not interesting
here)

2) Does interactivity suck after resume (no => you are not the right
person)

3) Does it still suck after setting image_size to high value (no =>
good, we have simple fix)

[If there are no people that got here, I'll just assume problem is
solved or hits too little people to be interesting.]

4) Congratulations, you are right person to help. Could you test if
Con's patches help?
Pavel
--
114: }

2006-03-13 12:03:40

by Con Kolivas

[permalink] [raw]
Subject: Re: does swsusp suck after resume for you? [was Re: Faster resuming of suspend technology.]

On Monday 13 March 2006 22:36, Pavel Machek wrote:
> 4) Congratulations, you are right person to help. Could you test if
> Con's patches help?

Ok this patch is only compile tested only but is reasonably straight forward.
(I have no hardware to test it on atm). It relies on the previous 4 patches I
sent out that update swap prefetch. To make it easier here is a single rolled
up patch that goes on top of 2.6.16-rc6-mm1:

http://ck.kolivas.org/patches/swap-prefetch/2.6.16-rc6-mm1-swap_prefetch_suspend_test.patch

Otherwise the incremental patch is below.

Usual blowing up warnings apply with this sort of patch. If it works well then
/proc/meminfo should show a very large SwapCached value after resume.

Cheers,
Con
---
include/linux/swap-prefetch.h | 5 ++
kernel/power/swsusp.c | 2
mm/swap_prefetch.c | 91 +++++++++++++++++++++++++-----------------
3 files changed, 62 insertions(+), 36 deletions(-)

Index: linux-2.6.16-rc6-mm1/include/linux/swap-prefetch.h
===================================================================
--- linux-2.6.16-rc6-mm1.orig/include/linux/swap-prefetch.h 2006-03-13 10:05:05.000000000 +1100
+++ linux-2.6.16-rc6-mm1/include/linux/swap-prefetch.h 2006-03-13 22:41:07.000000000 +1100
@@ -33,6 +33,7 @@ extern void add_to_swapped_list(struct p
extern void remove_from_swapped_list(const unsigned long index);
extern void delay_swap_prefetch(void);
extern void prepare_swap_prefetch(void);
+extern void post_resume_swap_prefetch(void);

#else /* CONFIG_SWAP_PREFETCH */
static inline void add_to_swapped_list(struct page *__unused)
@@ -50,6 +51,10 @@ static inline void remove_from_swapped_l
static inline void delay_swap_prefetch(void)
{
}
+
+static inline void post_resume_swap_prefetch(void)
+{
+}
#endif /* CONFIG_SWAP_PREFETCH */

#endif /* SWAP_PREFETCH_H_INCLUDED */
Index: linux-2.6.16-rc6-mm1/mm/swap_prefetch.c
===================================================================
--- linux-2.6.16-rc6-mm1.orig/mm/swap_prefetch.c 2006-03-13 20:12:24.000000000 +1100
+++ linux-2.6.16-rc6-mm1/mm/swap_prefetch.c 2006-03-13 22:44:30.000000000 +1100
@@ -291,43 +291,17 @@ static void examine_free_limits(void)
}

/*
- * We want to be absolutely certain it's ok to start prefetching.
+ * Have some hysteresis between where page reclaiming and prefetching
+ * will occur to prevent ping-ponging between them.
*/
-static int prefetch_suitable(void)
+static void set_suitable_nodes(void)
{
- unsigned long limit;
struct zone *z;
- int node, ret = 0, test_pagestate = 0;
-
- /* Purposefully racy */
- if (test_bit(0, &swapped.busy)) {
- __clear_bit(0, &swapped.busy);
- goto out;
- }
-
- /*
- * get_page_state and above_background_load are expensive so we only
- * perform them every SWAP_CLUSTER_MAX prefetched_pages.
- * We test to see if we're above_background_load as disk activity
- * even at low priority can cause interrupt induced scheduling
- * latencies.
- */
- if (!(sp_stat.prefetched_pages % SWAP_CLUSTER_MAX)) {
- if (above_background_load())
- goto out;
- test_pagestate = 1;
- }
-
- clear_current_prefetch_free();

- /*
- * Have some hysteresis between where page reclaiming and prefetching
- * will occur to prevent ping-ponging between them.
- */
for_each_zone(z) {
struct node_stats *ns;
unsigned long free;
- int idx;
+ int node, idx;

if (!populated_zone(z))
continue;
@@ -349,6 +323,45 @@ static int prefetch_suitable(void)
}
ns->current_free += free;
}
+}
+
+/*
+ * We want to be absolutely certain it's ok to start prefetching.
+ */
+static int prefetch_suitable(const int resume)
+{
+ unsigned long limit;
+ int node, ret = 0, test_pagestate = 0;
+
+ if (unlikely(resume)) {
+ clear_current_prefetch_free();
+ set_suitable_nodes();
+ if (!nodes_empty(sp_stat.prefetch_nodes))
+ ret = 1;
+ goto out;
+ }
+
+ /* Purposefully racy */
+ if (test_bit(0, &swapped.busy)) {
+ __clear_bit(0, &swapped.busy);
+ goto out;
+ }
+
+ /*
+ * get_page_state and above_background_load are expensive so we only
+ * perform them every SWAP_CLUSTER_MAX prefetched_pages.
+ * We test to see if we're above_background_load as disk activity
+ * even at low priority can cause interrupt induced scheduling
+ * latencies.
+ */
+ if (!(sp_stat.prefetched_pages % SWAP_CLUSTER_MAX)) {
+ if (above_background_load())
+ goto out;
+ test_pagestate = 1;
+ }
+
+ clear_current_prefetch_free();
+ set_suitable_nodes();

/*
* We iterate over each node testing to see if it is suitable for
@@ -429,7 +442,7 @@ static inline struct swapped_entry *prev
* vm is busy, we prefetch to the watermark, or the list is empty or we have
* iterated over all entries
*/
-static enum trickle_return trickle_swap(void)
+static enum trickle_return trickle_swap(const int resume)
{
enum trickle_return ret = TRICKLE_DELAY;
struct swapped_entry *entry;
@@ -438,7 +451,7 @@ static enum trickle_return trickle_swap(
* If laptop_mode is enabled don't prefetch to avoid hard drives
* doing unnecessary spin-ups
*/
- if (!swap_prefetch || laptop_mode)
+ if (!swap_prefetch || (laptop_mode && !resume))
return ret;

examine_free_limits();
@@ -448,7 +461,7 @@ static enum trickle_return trickle_swap(
swp_entry_t swp_entry;
int node;

- if (!prefetch_suitable())
+ if (!prefetch_suitable(resume))
break;

spin_lock(&swapped.lock);
@@ -491,8 +504,9 @@ static enum trickle_return trickle_swap(
entry = prev_swapped_entry(entry);
spin_unlock(&swapped.lock);

- if (trickle_swap_cache_async(swp_entry, node) == TRICKLE_DELAY)
- break;
+ if (trickle_swap_cache_async(swp_entry, node) == TRICKLE_DELAY &&
+ !resume)
+ break;
}

if (sp_stat.prefetched_pages) {
@@ -502,6 +516,11 @@ static enum trickle_return trickle_swap(
return ret;
}

+void post_resume_swap_prefetch(void)
+{
+ trickle_swap(1);
+}
+
static int kprefetchd(void *__unused)
{
set_user_nice(current, 19);
@@ -515,7 +534,7 @@ static int kprefetchd(void *__unused)
* TRICKLE_FAILED implies no entries left - we do not schedule
* a wakeup, and further delay the next one.
*/
- if (trickle_swap() == TRICKLE_FAILED) {
+ if (trickle_swap(0) == TRICKLE_FAILED) {
set_current_state(TASK_INTERRUPTIBLE);
schedule();
}
Index: linux-2.6.16-rc6-mm1/kernel/power/swsusp.c
===================================================================
--- linux-2.6.16-rc6-mm1.orig/kernel/power/swsusp.c 2006-03-13 10:05:05.000000000 +1100
+++ linux-2.6.16-rc6-mm1/kernel/power/swsusp.c 2006-03-13 22:42:52.000000000 +1100
@@ -49,6 +49,7 @@
#include <linux/bootmem.h>
#include <linux/syscalls.h>
#include <linux/highmem.h>
+#include <linux/swap-prefetch.h>

#include "power.h"

@@ -269,5 +270,6 @@ int swsusp_resume(void)
touch_softlockup_watchdog();
device_power_up();
local_irq_enable();
+ post_resume_swap_prefetch();
return error;
}

2006-03-13 22:51:58

by Nigel Cunningham

[permalink] [raw]
Subject: Re: [ck] Re: Faster resuming of suspend technology.

Hi.

On Monday 13 March 2006 20:12, Pavel Machek wrote:
> On Po 13-03-06 11:43:55, Nigel Cunningham wrote:
> > Hi.
> >
> > On Monday 13 March 2006 08:30, Con Kolivas wrote:
> > > On Monday 13 March 2006 08:32, Andreas Mohr wrote:
> > > > And... well... this sounds to me exactly like a prime task
> > > > for the newish swap prefetch work, no need for any other
> > > > special solutions here, I think.
> > > > We probably want a new flag for swap prefetch to let it know
> > > > that we just resumed from software suspend and thus need
> > > > prefetching to happen *much* faster than under normal
> > > > conditions for a short while, though (most likely by
> > > > enabling prefetching on a *non-idle* system for a minute).
> > >
> > > Adding a resume_swap_prefetch() called just before the resume finishes
> > > that aggressively prefetches from swap would be easy. Please tell me if
> > > you think adding such a function would be worthwhile.
> >
> > My 2c would be that swsusp is broken in a number of ways in discarding
> > those pages in the first place:
>
> Yep, feel free to submit a patch.
>
> > - Forcing pages out to swap by vm pressure is an inefficient way of
> > writing the pages.
>
> Really? VM subsystem is supposed to be effective.

All the scanning of pages and so on when freeing memory takes time. Likewise
at resume, figuring out what to fault back in or swap back in takes time.
Then you have to wait for the read to get processed, perhaps while other
processes are running, demanding disk to swap back in the file backed pages
that were just discarded. The alternative is not discarding them, taking a
little longer to write them to disk and read them back in at resume, but
having fewer calculations to do and less thrashing of the disk because the
data is stored as contiguously as storage allows.

> > - It doesn't get the pages compressed, and so makes inefficient use of
> > the storage and forces more pages to be discarded that would otherwise be
> > necessary.
>
> "more pages to be discarded" is untrue. If you want to argue that swap
> needs to be compressed, feel free to submit patches for swap
> compression.

If I'm trying to store an image of 5000 pages and I have 3000 pages of storage
available, I can compress them with LZF and put all 5000 on disk (assuming
the common 50% compression), or dicard 2000. More pages are discarded without
compression.

> (Compression is actually not as important as you paint it. Rafael
> implemented it, only to find out that it is 20 percent speedup in
> common cases -- and your gzip actually slows things down.)

I don't use gzip. I agree it slows things down. But 20%? What algorithm did
you use? It will also depend on the speed of your cpu and drive. (If the cpu
is fast but the drive is slow or you're still only using synchronous I/O,
yes, the improvement might only be 20%).

> > - Bringing the pages back in by swap prefetching or swapoffing or
> > whatever is equally inefficient (I was going to say 'particularly in low
> > memory situations', but immediately ate my words as I remembered that if
> > you've just swsusp'd, you've freed at least half of memory anyway).
>
> ...but allows you to use machine immediately after resume, which
> people want, as you have just seen.

Just?

> > - This technique doesn't guarantee that the pages you end up with in
> > memory are the pages that you're actually most likely to want. The vast
> > majority of what you really want will simply have been discarded rather
> > than swapped.
> >
> > Having said that, Rafael is making some progress in these areas, such
> > that swsusp is eating less memory than it used to, so that swap
> > prefetching will be less important at resume time than it has been in the
> > past.

Regards,

Nigel


Attachments:
(No filename) (3.69 kB)
(No filename) (189.00 B)
Download all attachments

2006-03-14 05:12:53

by Con Kolivas

[permalink] [raw]
Subject: Re: does swsusp suck after resume for you? [was Re: Faster resuming of suspend technology.]

On Mon, 13 Mar 2006 11:03 pm, Con Kolivas wrote:
> On Monday 13 March 2006 22:36, Pavel Machek wrote:
> > 4) Congratulations, you are right person to help. Could you test if
> > Con's patches help?
>
> Ok this patch is only compile tested only but is reasonably straight
> forward. (I have no hardware to test it on atm). It relies on the previous
> 4 patches I sent out that update swap prefetch. To make it easier here is a
> single rolled up patch that goes on top of 2.6.16-rc6-mm1:
>
> http://ck.kolivas.org/patches/swap-prefetch/2.6.16-rc6-mm1-swap_prefetch_su
>spend_test.patch

Since my warning probably scared anyone from actually trying this patch I've
given it a thorough working over on my own laptop, booting with mem=128M. The
patch works fine and basically with the patch after resuming from disk I have
25MB more memory in use with pages prefetched from swap. This makes a
noticeable difference to me. That's a pretty artificial workload, so if
someone who actually has lousy wakeup after resume could test the patch it
would be appreciated.

Cheers,
Con

2006-03-14 08:24:24

by Andreas Mohr

[permalink] [raw]
Subject: Re: does swsusp suck after resume for you? [was Re: Faster resuming of suspend technology.]

Hi,

On Tue, Mar 14, 2006 at 04:13:10PM +1100, Con Kolivas wrote:
> Since my warning probably scared anyone from actually trying this patch I've
> given it a thorough working over on my own laptop, booting with mem=128M. The
> patch works fine and basically with the patch after resuming from disk I have
> 25MB more memory in use with pages prefetched from swap. This makes a
> noticeable difference to me. That's a pretty artificial workload, so if
> someone who actually has lousy wakeup after resume could test the patch it
> would be appreciated.

I did try it, but ran into weird unrelated compile failures multiple times
(sorry, no log).

Andreas Mohr

2006-03-14 10:33:03

by Pavel Machek

[permalink] [raw]
Subject: Re: [ck] Re: Faster resuming of suspend technology.

Hi!

> > > - It doesn't get the pages compressed, and so makes inefficient use of
> > > the storage and forces more pages to be discarded that would otherwise be
> > > necessary.
> >
> > "more pages to be discarded" is untrue. If you want to argue that swap
> > needs to be compressed, feel free to submit patches for swap
> > compression.
>
> If I'm trying to store an image of 5000 pages and I have 3000 pages of storage
> available, I can compress them with LZF and put all 5000 on disk (assuming
> the common 50% compression), or dicard 2000. More pages are discarded without
> compression.

Ok, you are right, if user is low on swap space, that's what will
happen. It is uncommon case, so I forgot about it.

> > (Compression is actually not as important as you paint it. Rafael
> > implemented it, only to find out that it is 20 percent speedup in
> > common cases -- and your gzip actually slows things down.)
>
> I don't use gzip. I agree it slows things down. But 20%? What algorithm did
> you use? It will also depend on the speed of your cpu and drive. (If the cpu
> is fast but the drive is slow or you're still only using synchronous I/O,
> yes, the improvement might only be 20%).

LZF. Problem is not the disk/compression speed; problem is that other
stuff takes way too long. Like copy of memory (I have 1.5G here) and
preparing of drivers... I think it takes about as long as actually
writing it to disk. Then there's the system boot included in
resume... that takes ages.

I'll probably have to figure out which drivers take long to suspend :-(.

> > > - Bringing the pages back in by swap prefetching or swapoffing or
> > > whatever is equally inefficient (I was going to say 'particularly in low
> > > memory situations', but immediately ate my words as I remembered that if
> > > you've just swsusp'd, you've freed at least half of memory anyway).
> >
> > ...but allows you to use machine immediately after resume, which
> > people want, as you have just seen.
>
> Just?

Well, in the begining of this thread someone wanted fast resume *and*
responsive system after it.

Old swsusp is fast resume (little data loaded), unresponsive system.
suspend 2 is slower resume (more data), responsive system.
swsusp + Con's patch should give:

fast resume to prompt (little data loaded)
unresponsive system at the very begining, but becoming okay as
background thread pulls back swapped pages.

I like his solution:
1) It is good for the user: seeing prompt early means user can start
typing commands etc.
2) It is simple enough for me :-).
Pavel
--
107: string strHome =

2006-03-14 11:52:14

by Pavel Machek

[permalink] [raw]
Subject: Re: does swsusp suck after resume for you? [was Re: Faster resuming of suspend technology.]

Hi!

> > > 4) Congratulations, you are right person to help. Could you test if
> > > Con's patches help?
> >
> > Ok this patch is only compile tested only but is reasonably straight
> > forward. (I have no hardware to test it on atm). It relies on the previous
> > 4 patches I sent out that update swap prefetch. To make it easier here is a
> > single rolled up patch that goes on top of 2.6.16-rc6-mm1:
> >
> > http://ck.kolivas.org/patches/swap-prefetch/2.6.16-rc6-mm1-swap_prefetch_su
> >spend_test.patch
>
> Since my warning probably scared anyone from actually trying this patch I've
> given it a thorough working over on my own laptop, booting with mem=128M. The
> patch works fine and basically with the patch after resuming from disk I have
> 25MB more memory in use with pages prefetched from swap. This makes a
> noticeable difference to me. That's a pretty artificial workload, so if
> someone who actually has lousy wakeup after resume could test the patch it
> would be appreciated.

Thanks for the patch...

BTW.. if you want this maximally useful, it would be nice to have
userspace interface for this. swsusp is done from userspace these days
(-mm kernel), and I guess it would be useful for "we have just
finished big&ugly memory trashing job, can we get our interactivity
back"? Like I'd probably cron-scheduled it just after updatedb, and
added it to scripts after particular lingvistic experiments...
Pavel
--
27:{

2006-03-14 12:33:37

by Con Kolivas

[permalink] [raw]
Subject: Re: does swsusp suck after resume for you? [was Re: Faster resuming of suspend technology.]

On Tuesday 14 March 2006 22:51, Pavel Machek wrote:
> > Since my warning probably scared anyone from actually trying this patch
> > I've given it a thorough working over on my own laptop, booting with
> > mem=128M. The patch works fine and basically with the patch after
> > resuming from disk I have 25MB more memory in use with pages prefetched
> > from swap. This makes a noticeable difference to me. That's a pretty
> > artificial workload, so if someone who actually has lousy wakeup after
> > resume could test the patch it would be appreciated.
>
> Thanks for the patch...
>
> BTW.. if you want this maximally useful, it would be nice to have
> userspace interface for this.

What sort of interface is suitable? There's a swap_prefetch tunable that is a
boolean but I could make that to be off=0, on=1, aggressive_prefetch=2 or
something. Or did you have something else in mind?

> swsusp is done from userspace these days
> (-mm kernel), and I guess it would be useful for "we have just
> finished big&ugly memory trashing job, can we get our interactivity
> back"? Like I'd probably cron-scheduled it just after updatedb, and
> added it to scripts after particular lingvistic experiments...

It may not be immediately obvious, but free ram is required for swap prefetch
to do anything so as to not have any negative effects on normal vm function.
After updatedb runs ram is full thus swap prefetch does nothing
unfortunately. You could certainly do some pointless "ram touching" to free
up the cached ram in an updatedb script though, but that could evict
something useful so I wouldn't recommend it.

Cheers,
Con

2006-03-14 12:44:19

by Pavel Machek

[permalink] [raw]
Subject: Re: does swsusp suck after resume for you? [was Re: Faster resuming of suspend technology.]

On ?t 14-03-06 23:33:12, Con Kolivas wrote:
> On Tuesday 14 March 2006 22:51, Pavel Machek wrote:
> > > Since my warning probably scared anyone from actually trying this patch
> > > I've given it a thorough working over on my own laptop, booting with
> > > mem=128M. The patch works fine and basically with the patch after
> > > resuming from disk I have 25MB more memory in use with pages prefetched
> > > from swap. This makes a noticeable difference to me. That's a pretty
> > > artificial workload, so if someone who actually has lousy wakeup after
> > > resume could test the patch it would be appreciated.
> >
> > Thanks for the patch...
> >
> > BTW.. if you want this maximally useful, it would be nice to have
> > userspace interface for this.
>
> What sort of interface is suitable? There's a swap_prefetch tunable that is a
> boolean but I could make that to be off=0, on=1, aggressive_prefetch=2 or
> something.

That sounds nice.
Pavel
--
116:

2006-03-14 17:36:56

by Lee Revell

[permalink] [raw]
Subject: Re: does swsusp suck after resume for you? [was Re: Faster resuming of suspend technology.]

On Tue, 2006-03-14 at 12:51 +0100, Pavel Machek wrote:
> "we have just
> finished big&ugly memory trashing job, can we get our interactivity
> back"? Like I'd probably cron-scheduled it just after updatedb

The updatedb problem is STILL not solved? I remember someone proposed
years ago to have it use fcntl() or fadvise() to tell the kernel that we
are about to read every file on the system and to please not wipe the
cache - I guess this was never done?

Lee

2006-03-14 18:07:49

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: does swsusp suck after resume for you? [was Re: Faster resuming of suspend technology.]

On Monday 13 March 2006 13:03, Con Kolivas wrote:
> On Monday 13 March 2006 22:36, Pavel Machek wrote:
> > 4) Congratulations, you are right person to help. Could you test if
> > Con's patches help?
>
> Ok this patch is only compile tested only but is reasonably straight forward.
> (I have no hardware to test it on atm). It relies on the previous 4 patches I
> sent out that update swap prefetch. To make it easier here is a single rolled
> up patch that goes on top of 2.6.16-rc6-mm1:
>
> http://ck.kolivas.org/patches/swap-prefetch/2.6.16-rc6-mm1-swap_prefetch_suspend_test.patch
>
> Otherwise the incremental patch is below.
>
> Usual blowing up warnings apply with this sort of patch. If it works well then
> /proc/meminfo should show a very large SwapCached value after resume.
>
}-- snip --{
> Index: linux-2.6.16-rc6-mm1/kernel/power/swsusp.c
> ===================================================================
> --- linux-2.6.16-rc6-mm1.orig/kernel/power/swsusp.c 2006-03-13 10:05:05.000000000 +1100
> +++ linux-2.6.16-rc6-mm1/kernel/power/swsusp.c 2006-03-13 22:42:52.000000000 +1100
> @@ -49,6 +49,7 @@
> #include <linux/bootmem.h>
> #include <linux/syscalls.h>
> #include <linux/highmem.h>
> +#include <linux/swap-prefetch.h>
>
> #include "power.h"
>
> @@ -269,5 +270,6 @@ int swsusp_resume(void)
> touch_softlockup_watchdog();
> device_power_up();
> local_irq_enable();
> + post_resume_swap_prefetch();
> return error;
> }

Hm, this code is only executed if there's an error during resume. You should
have placed the post_resume_swap_prefetch() call in swsusp_suspend(). :-)

Greetings,
Rafael

2006-03-14 21:35:17

by Con Kolivas

[permalink] [raw]
Subject: Re: does swsusp suck after resume for you? [was Re: Faster resuming of suspend technology.]

On Wednesday 15 March 2006 04:36, Lee Revell wrote:
> On Tue, 2006-03-14 at 12:51 +0100, Pavel Machek wrote:
> > "we have just
> > finished big&ugly memory trashing job, can we get our interactivity
> > back"? Like I'd probably cron-scheduled it just after updatedb
>
> The updatedb problem is STILL not solved? I remember someone proposed
> years ago to have it use fcntl() or fadvise() to tell the kernel that we
> are about to read every file on the system and to please not wipe the
> cache - I guess this was never done?

There is an POSIX_FADV_DONTNEED (I think that's the one), but userspace needs
updating to actually use it.

Cheers,
Con

2006-03-14 21:46:00

by Con Kolivas

[permalink] [raw]
Subject: Re: does swsusp suck after resume for you? [was Re: Faster resuming of suspend technology.]

On Wednesday 15 March 2006 05:06, Rafael J. Wysocki wrote:
> On Monday 13 March 2006 13:03, Con Kolivas wrote:
> > @@ -269,5 +270,6 @@ int swsusp_resume(void)
> > touch_softlockup_watchdog();
> > device_power_up();
> > local_irq_enable();
> > + post_resume_swap_prefetch();
> > return error;
> > }
>
> Hm, this code is only executed if there's an error during resume. You
> should have placed the post_resume_swap_prefetch() call in
> swsusp_suspend(). :-)

Gee you guys are fussy. You want the code to actually do what it's advertised
to do?

Anyway perhaps it was ordinary swap prefetch that was making the difference
after all. I think I'll let the current swap prefetch code settle for a while
before touching this just yet.

Cheers,
Con

2006-03-15 13:00:14

by Stefan Seyfried

[permalink] [raw]
Subject: Re: does swsusp suck aftre resume for you? [was Re: Re: Faster resuming of suspend technology.]

On Mon, Mar 13, 2006 at 12:36:31PM +0100, Pavel Machek wrote:

> Yes, I can do mem=128M... but then, I'd prefer not to code workarounds
> for machines noone uses any more.

I have machines that cannot be upgraded to more than 192MB and would
like to continue using them :-)

> 3) Does it still suck after setting image_size to high value (no =>
> good, we have simple fix)

no matter how high you set image_size, it will never be bigger than
~64MB on a 128MB machine, or i have gotten something seriously wrong.
--
Stefan Seyfried \ "I didn't want to write for pay. I
QA / R&D Team Mobile Devices \ wanted to be paid for what I write."
SUSE LINUX Products GmbH, N?rnberg \ -- Leonard Cohen

2006-03-15 18:08:36

by Pavel Machek

[permalink] [raw]
Subject: Re: does swsusp suck aftre resume for you? [was Re: Re: Faster resuming of suspend technology.]


On Wed 15-03-06 11:37:11, Stefan Seyfried wrote:
> On Mon, Mar 13, 2006 at 12:36:31PM +0100, Pavel Machek wrote:
>
> > Yes, I can do mem=128M... but then, I'd prefer not to code workarounds
> > for machines noone uses any more.
>
> I have machines that cannot be upgraded to more than 192MB and would
> like to continue using them :-)

Good :-).

> > 3) Does it still suck after setting image_size to high value (no =>
> > good, we have simple fix)
>
> no matter how high you set image_size, it will never be bigger than
> ~64MB on a 128MB machine, or i have gotten something seriously wrong.

No, you are right, but maybe 64MB image is enough to get acceptable
interactivity after resume? I'd like you to check it.

(It will probably suck. In such case, testing Con's patch would be
nice -- after trivial fix rafael pointed out).
Pavel
--
Thanks, Sharp!

2006-03-15 21:34:57

by Nigel Cunningham

[permalink] [raw]
Subject: Re: does swsusp suck aftre resume for you? [was Re: Re: Faster resuming of suspend technology.]

Hi.

On Thursday 16 March 2006 03:59, Pavel Machek wrote:
> On Wed 15-03-06 11:37:11, Stefan Seyfried wrote:
> > On Mon, Mar 13, 2006 at 12:36:31PM +0100, Pavel Machek wrote:
> > > Yes, I can do mem=128M... but then, I'd prefer not to code workarounds
> > > for machines noone uses any more.
> >
> > I have machines that cannot be upgraded to more than 192MB and would
> > like to continue using them :-)
>
> Good :-).
>
> > > 3) Does it still suck after setting image_size to high value (no =>
> > > good, we have simple fix)
> >
> > no matter how high you set image_size, it will never be bigger than
> > ~64MB on a 128MB machine, or i have gotten something seriously wrong.
>
> No, you are right, but maybe 64MB image is enough to get acceptable
> interactivity after resume? I'd like you to check it.
>
> (It will probably suck. In such case, testing Con's patch would be
> nice -- after trivial fix rafael pointed out).

If you could also test suspend2, that would be good. I've gained some renewed
motivation for getting it merged, and hearing that it still does better than
swsusp + extras would be helpful in building the case for it.

Regards,

Nigel


Attachments:
(No filename) (1.13 kB)
(No filename) (189.00 B)
Download all attachments

2006-03-16 02:20:07

by Con Kolivas

[permalink] [raw]
Subject: swsusp_suspend continues?

Hi Pavel

I've been playing with hooking in the post resume swap prefetch code into
swsusp_suspend and just started noting this on 2.6.16-rc6-mm1:
During the _suspend_ to disk cycle on this machine the swsusp_suspend function
appears to continue beyond swsusp_arch_suspend as I get the same messages
that I would normally get during a resume cycle such as this:

Suspending device platform
swsusp: Need to copy 14852 pages
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#0
and...
eth1: Coming out of suspend...
and so on

but then it manages to write to disk and power down anyway. Is this correct?

If I put post_resume_swap_prefetch at the end of swsusp_suspend it hits that
function on both resume _and_ suspend cycles. Am I missing something?

Cheers,
Con

2006-03-16 09:19:57

by Pavel Machek

[permalink] [raw]
Subject: Re: swsusp_suspend continues?

On ÄŒt 16-03-06 13:20:35, Con Kolivas wrote:
> Hi Pavel
>
> I've been playing with hooking in the post resume swap prefetch code into
> swsusp_suspend and just started noting this on 2.6.16-rc6-mm1:
> During the _suspend_ to disk cycle on this machine the swsusp_suspend function
> appears to continue beyond swsusp_arch_suspend as I get the same messages
> that I would normally get during a resume cycle such as this:
>
> Suspending device platform
> swsusp: Need to copy 14852 pages
> Intel machine check architecture supported.
> Intel machine check reporting enabled on CPU#0
> and...
> eth1: Coming out of suspend...
> and so on
>
> but then it manages to write to disk and power down anyway. Is this correct?

Yes. We need our hardware enabled for image write (disk would be
enough), so we resume it (and we resume all of it, because that was
easier to code).

> If I put post_resume_swap_prefetch at the end of swsusp_suspend it hits that
> function on both resume _and_ suspend cycles. Am I missing something?

No. That's just the way it is. See

/* Restore control flow magically appears here */

and

/* Code below is only ever reached in case of failure. Otherwise
* execution continues at place where swsusp_arch_suspend was called
*/
BUG_ON(!error);

Yes, I agree it is confusing, and feel free to suggest comment cleanups.

I'd suggest you hook at disk.c:pm_suspend_disk.

Or just include that /sys interface, and trigger it from userspace
just after resume. Actually I like that best. It is optional, it can
be triggered from userspace, and you will not have to deal with
suspend internals.

(And it will be useful to uswsusp, too, that avoids big chunks of
in-kernel suspend code).

Pavel
--
133:

2006-03-16 10:34:17

by Con Kolivas

[permalink] [raw]
Subject: Re: does swsusp suck after resume for you?

On Thursday 16 March 2006 04:59, Pavel Machek wrote:
> (It will probably suck. In such case, testing Con's patch would be
> nice -- after trivial fix rafael pointed out).

Ok here's a patch I've booted and tested with a modification to swap prefetch
that others might find useful, not just swsusp.

The tunable in /proc/sys/vm/swap_prefetch is now bitwise ORed:
1 = Normal background swap prefetching when load is light
2 = Aggressively swap prefetch as much as possible

And once the "aggressive" bit is set it will prefetch as much as it can and
then disable the aggressive bit. Thus if you set this value to 3 it will
prefetch aggressively and then drop back to the default of 1. This makes it
easy to simply set the aggressive flag once and forget about it. I've booted
and tested this feature and it's working nicely. Where exactly you'd set this
in your resume scripts I'm not sure. A rolled up patch against 2.6.16-rc6-mm1
is here for simplicity:
http://ck.kolivas.org/patches/swap-prefetch/2.6.16-rc6-mm1-swap_prefetch_suspend_test.patch

and the incremental on top of the 4 patches pending for the next -mm is below.

Comments and testers most welcome.

Cheers,
Con
---
Documentation/sysctl/vm.txt | 9 +++
mm/swap_prefetch.c | 119 +++++++++++++++++++++++++++++---------------
2 files changed, 90 insertions(+), 38 deletions(-)

Index: linux-2.6.16-rc6-mm1/mm/swap_prefetch.c
===================================================================
--- linux-2.6.16-rc6-mm1.orig/mm/swap_prefetch.c 2006-03-16 20:26:45.000000000 +1100
+++ linux-2.6.16-rc6-mm1/mm/swap_prefetch.c 2006-03-16 21:06:50.000000000 +1100
@@ -27,8 +27,18 @@
*/
#define PREFETCH_DELAY (HZ * 5)

-/* sysctl - enable/disable swap prefetching */
-int swap_prefetch __read_mostly = 1;
+#define PREFETCH_NORMAL (1 << 0)
+#define PREFETCH_AGGRESSIVE (1 << 1)
+/*
+ * sysctl - enable/disable swap prefetching bits
+ * This is composed of the bitflags PREFETCH_NORMAL and PREFETCH_AGGRESSIVE.
+ * Once PREFETCH_AGGRESSIVE is set, swap prefetching will be peformed as much
+ * as possible irrespective of load conditions and then the
+ * PREFETCH_AGGRESSIVE bit will be unset.
+ */
+int swap_prefetch __read_mostly = PREFETCH_NORMAL;
+
+#define aggressive_prefetch (unlikely(swap_prefetch & PREFETCH_AGGRESSIVE))

struct swapped_root {
unsigned long busy; /* vm busy */
@@ -291,43 +301,17 @@ static void examine_free_limits(void)
}

/*
- * We want to be absolutely certain it's ok to start prefetching.
+ * Have some hysteresis between where page reclaiming and prefetching
+ * will occur to prevent ping-ponging between them.
*/
-static int prefetch_suitable(void)
+static void set_suitable_nodes(void)
{
- unsigned long limit;
struct zone *z;
- int node, ret = 0, test_pagestate = 0;
-
- /* Purposefully racy */
- if (test_bit(0, &swapped.busy)) {
- __clear_bit(0, &swapped.busy);
- goto out;
- }
-
- /*
- * get_page_state and above_background_load are expensive so we only
- * perform them every SWAP_CLUSTER_MAX prefetched_pages.
- * We test to see if we're above_background_load as disk activity
- * even at low priority can cause interrupt induced scheduling
- * latencies.
- */
- if (!(sp_stat.prefetched_pages % SWAP_CLUSTER_MAX)) {
- if (above_background_load())
- goto out;
- test_pagestate = 1;
- }

- clear_current_prefetch_free();
-
- /*
- * Have some hysteresis between where page reclaiming and prefetching
- * will occur to prevent ping-ponging between them.
- */
for_each_zone(z) {
struct node_stats *ns;
unsigned long free;
- int idx;
+ int node, idx;

if (!populated_zone(z))
continue;
@@ -349,6 +333,45 @@ static int prefetch_suitable(void)
}
ns->current_free += free;
}
+}
+
+/*
+ * We want to be absolutely certain it's ok to start prefetching.
+ */
+static int prefetch_suitable(void)
+{
+ unsigned long limit;
+ int node, ret = 0, test_pagestate = 0;
+
+ if (aggressive_prefetch) {
+ clear_current_prefetch_free();
+ set_suitable_nodes();
+ if (!nodes_empty(sp_stat.prefetch_nodes))
+ ret = 1;
+ goto out;
+ }
+
+ /* Purposefully racy */
+ if (test_bit(0, &swapped.busy)) {
+ __clear_bit(0, &swapped.busy);
+ goto out;
+ }
+
+ /*
+ * get_page_state and above_background_load are expensive so we only
+ * perform them every SWAP_CLUSTER_MAX prefetched_pages.
+ * We test to see if we're above_background_load as disk activity
+ * even at low priority can cause interrupt induced scheduling
+ * latencies.
+ */
+ if (!(sp_stat.prefetched_pages % SWAP_CLUSTER_MAX)) {
+ if (above_background_load())
+ goto out;
+ test_pagestate = 1;
+ }
+
+ clear_current_prefetch_free();
+ set_suitable_nodes();

/*
* We iterate over each node testing to see if it is suitable for
@@ -421,6 +444,17 @@ static inline struct swapped_entry *prev
struct swapped_entry, swapped_list);
}

+static unsigned long pages_prefetched(void)
+{
+ unsigned long pages = sp_stat.prefetched_pages;
+
+ if (pages) {
+ lru_add_drain();
+ sp_stat.prefetched_pages = 0;
+ }
+ return pages;
+}
+
/*
* trickle_swap is the main function that initiates the swap prefetching. It
* first checks to see if the busy flag is set, and does not prefetch if it
@@ -438,7 +472,7 @@ static enum trickle_return trickle_swap(
* If laptop_mode is enabled don't prefetch to avoid hard drives
* doing unnecessary spin-ups
*/
- if (!swap_prefetch || laptop_mode)
+ if (!swap_prefetch || (laptop_mode && !aggressive_prefetch))
return ret;

examine_free_limits();
@@ -474,6 +508,14 @@ static enum trickle_return trickle_swap(
* delay attempting further prefetching.
*/
spin_unlock(&swapped.lock);
+ if (aggressive_prefetch) {
+ /*
+ * If we're prefetching aggressively and
+ * making progress then don't give up.
+ */
+ if (pages_prefetched())
+ continue;
+ }
break;
}

@@ -491,14 +533,15 @@ static enum trickle_return trickle_swap(
entry = prev_swapped_entry(entry);
spin_unlock(&swapped.lock);

- if (trickle_swap_cache_async(swp_entry, node) == TRICKLE_DELAY)
+ if (trickle_swap_cache_async(swp_entry, node) == TRICKLE_DELAY &&
+ !aggressive_prefetch)
break;
}

- if (sp_stat.prefetched_pages) {
- lru_add_drain();
- sp_stat.prefetched_pages = 0;
- }
+ /* Return value of pages_prefetched irrelevant here */
+ pages_prefetched();
+ if (aggressive_prefetch)
+ swap_prefetch &= ~PREFETCH_AGGRESSIVE;
return ret;
}

Index: linux-2.6.16-rc6-mm1/Documentation/sysctl/vm.txt
===================================================================
--- linux-2.6.16-rc6-mm1.orig/Documentation/sysctl/vm.txt 2006-03-13 10:04:51.000000000 +1100
+++ linux-2.6.16-rc6-mm1/Documentation/sysctl/vm.txt 2006-03-16 21:10:42.000000000 +1100
@@ -188,4 +188,13 @@ memory subsystem has been extremely idle
copying back pages from swap into the swapcache and keep a copy in swap. In
practice it can take many minutes before the vm is idle enough.

+This is value ORed together of
+1 = Normal background swap prefetching when load is light
+2 = Aggressively swap prefetch as much as possible
+
+When 2 is set, after the maximum amount possible has been prefetched, this bit
+is unset. ie Setting the value to 3 will prefetch aggressively then drop to 1.
+This is useful for doing aggressive prefetching for short periods in scripts
+such as after resuming from software suspend.
+
The default value is 1.

2006-03-16 10:46:32

by Pavel Machek

[permalink] [raw]
Subject: Re: does swsusp suck after resume for you?

> On Thursday 16 March 2006 04:59, Pavel Machek wrote:
> > (It will probably suck. In such case, testing Con's patch would be
> > nice -- after trivial fix rafael pointed out).
>
> Ok here's a patch I've booted and tested with a modification to swap prefetch
> that others might find useful, not just swsusp.
>
> The tunable in /proc/sys/vm/swap_prefetch is now bitwise ORed:
> 1 = Normal background swap prefetching when load is light
> 2 = Aggressively swap prefetch as much as possible
>
> And once the "aggressive" bit is set it will prefetch as much as it can and
> then disable the aggressive bit. Thus if you set this value to 3 it will
> prefetch aggressively and then drop back to the default of 1. This makes it
> easy to simply set the aggressive flag once and forget about it. I've booted
> and tested this feature and it's working nicely. Where exactly you'd set this
> in your resume scripts I'm not sure. A rolled up patch against 2.6.16-rc6-mm1
> is here for simplicity:
> http://ck.kolivas.org/patches/swap-prefetch/2.6.16-rc6-mm1-swap_prefetch_suspend_test.patch
>
> and the incremental on top of the 4 patches pending for the next -mm is below.
>
> Comments and testers most welcome.

Looks okay, but... what happens if I set /proc/sys/vm/swap_prefetch to
"2"? Do nothing but do it agresively?

Maybe having 0 = off, 1 = normal, 2 = aggressive would be less error
prone for the users.

Pavel

--
Thanks, Sharp!

2006-03-16 10:48:41

by Con Kolivas

[permalink] [raw]
Subject: Re: does swsusp suck after resume for you?

On Thursday 16 March 2006 21:46, Pavel Machek wrote:
> > On Thursday 16 March 2006 04:59, Pavel Machek wrote:
> > The tunable in /proc/sys/vm/swap_prefetch is now bitwise ORed:
> > 1 = Normal background swap prefetching when load is light
> > 2 = Aggressively swap prefetch as much as possible
> >
> > And once the "aggressive" bit is set it will prefetch as much as it can
> > and then disable the aggressive bit. Thus if you set this value to 3 it
> > will prefetch aggressively and then drop back to the default of 1. This
> > makes it easy to simply set the aggressive flag once and forget about it.
> > I've booted and tested this feature and it's working nicely. Where
> > exactly you'd set this in your resume scripts I'm not sure. A rolled up
> > patch against 2.6.16-rc6-mm1 is here for simplicity:
> > http://ck.kolivas.org/patches/swap-prefetch/2.6.16-rc6-mm1-swap_prefetch_
> >suspend_test.patch
> >
> > and the incremental on top of the 4 patches pending for the next -mm is
> > below.
> >
> > Comments and testers most welcome.
>
> Looks okay, but... what happens if I set /proc/sys/vm/swap_prefetch to
> "2"? Do nothing but do it agresively?
>
> Maybe having 0 = off, 1 = normal, 2 = aggressive would be less error
> prone for the users.

2 means aggressively prefetch as much as possible and then disable swap
prefetching from that point on. Too confusing?

Cheers,
Con

2006-03-16 10:50:55

by Pavel Machek

[permalink] [raw]
Subject: Re: does swsusp suck after resume for you?


> On Thursday 16 March 2006 21:46, Pavel Machek wrote:
> > > On Thursday 16 March 2006 04:59, Pavel Machek wrote:
> > > The tunable in /proc/sys/vm/swap_prefetch is now bitwise ORed:
> > > 1 = Normal background swap prefetching when load is light
> > > 2 = Aggressively swap prefetch as much as possible
> > >
> > > And once the "aggressive" bit is set it will prefetch as much as it can
> > > and then disable the aggressive bit. Thus if you set this value to 3 it
> > > will prefetch aggressively and then drop back to the default of 1. This
> > > makes it easy to simply set the aggressive flag once and forget about it.
> > > I've booted and tested this feature and it's working nicely. Where
> > > exactly you'd set this in your resume scripts I'm not sure. A rolled up
> > > patch against 2.6.16-rc6-mm1 is here for simplicity:
> > > http://ck.kolivas.org/patches/swap-prefetch/2.6.16-rc6-mm1-swap_prefetch_
> > >suspend_test.patch
> > >
> > > and the incremental on top of the 4 patches pending for the next -mm is
> > > below.
> > >
> > > Comments and testers most welcome.
> >
> > Looks okay, but... what happens if I set /proc/sys/vm/swap_prefetch to
> > "2"? Do nothing but do it agresively?
> >
> > Maybe having 0 = off, 1 = normal, 2 = aggressive would be less error
> > prone for the users.
>
> 2 means aggressively prefetch as much as possible and then disable swap
> prefetching from that point on. Too confusing?

Ahha... oops, yes, clever; no, I guess keep it.
Pavel
--
Thanks, Sharp!

2006-03-16 10:55:18

by Andreas Mohr

[permalink] [raw]
Subject: Re: [ck] Re: does swsusp suck after resume for you?

Hi,

On Thu, Mar 16, 2006 at 11:46:30AM +0100, Pavel Machek wrote:
> Looks okay, but... what happens if I set /proc/sys/vm/swap_prefetch to
> "2"? Do nothing but do it agresively?
>
> Maybe having 0 = off, 1 = normal, 2 = aggressive would be less error
> prone for the users.

Hmm, that way you'd prevent further extension of the bitmask (in a
bitmask-only-tunable manner, that is).
BTW: do we want (to avoid) more tunables or more bitmask tunables with
thus more options? Good general question, methinks...
And I don't think that having value 2 set exclusively would hurt.

Andreas

2006-03-16 11:31:56

by Con Kolivas

[permalink] [raw]
Subject: Re: [ck] Re: does swsusp suck after resume for you?

On Thursday 16 March 2006 21:33, Con Kolivas wrote:
> On Thursday 16 March 2006 04:59, Pavel Machek wrote:
> > (It will probably suck. In such case, testing Con's patch would be
> > nice -- after trivial fix rafael pointed out).
>
> Ok here's a patch I've booted and tested with a modification to swap
> prefetch that others might find useful, not just swsusp.
>
> The tunable in /proc/sys/vm/swap_prefetch is now bitwise ORed:
> 1 = Normal background swap prefetching when load is light
> 2 = Aggressively swap prefetch as much as possible
>
> And once the "aggressive" bit is set it will prefetch as much as it can and
> then disable the aggressive bit. Thus if you set this value to 3 it will
> prefetch aggressively and then drop back to the default of 1. This makes
> it easy to simply set the aggressive flag once and forget about it. I've
> booted and tested this feature and it's working nicely. Where exactly you'd
> set this in your resume scripts I'm not sure. A rolled up patch against
> 2.6.16-rc6-mm1 is here for simplicity:
> http://ck.kolivas.org/patches/swap-prefetch/2.6.16-rc6-mm1-swap_prefetch_su
>spend_test.patch

Wrong rollup sorry! That was the old one.

This is the correct rollup:
http://ck.kolivas.org/patches/swap-prefetch/2.6.16-rc6-mm1-swap_prefetch_test.patch

Cheers,
Con

2006-03-16 16:14:04

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: swsusp_suspend continues?

On Thursday 16 March 2006 10:19, Pavel Machek wrote:
> On ÄŒt 16-03-06 13:20:35, Con Kolivas wrote:
> > Hi Pavel
> >
> > I've been playing with hooking in the post resume swap prefetch code into
> > swsusp_suspend and just started noting this on 2.6.16-rc6-mm1:
> > During the _suspend_ to disk cycle on this machine the swsusp_suspend function
> > appears to continue beyond swsusp_arch_suspend as I get the same messages
> > that I would normally get during a resume cycle such as this:
> >
> > Suspending device platform
> > swsusp: Need to copy 14852 pages
> > Intel machine check architecture supported.
> > Intel machine check reporting enabled on CPU#0
> > and...
> > eth1: Coming out of suspend...
> > and so on
> >
> > but then it manages to write to disk and power down anyway. Is this correct?
>
> Yes. We need our hardware enabled for image write (disk would be
> enough), so we resume it (and we resume all of it, because that was
> easier to code).
>
> > If I put post_resume_swap_prefetch at the end of swsusp_suspend it hits that
> > function on both resume _and_ suspend cycles. Am I missing something?
>
> No. That's just the way it is.

But there is the in_suspend variable that you can use to avoid doing
unnecessary things during suspend (during resume in_suspend is 0).

> See
>
> /* Restore control flow magically appears here */
>
> and
>
> /* Code below is only ever reached in case of failure. Otherwise
> * execution continues at place where swsusp_arch_suspend was called
> */
> BUG_ON(!error);
>
> Yes, I agree it is confusing, and feel free to suggest comment cleanups.
>
> I'd suggest you hook at disk.c:pm_suspend_disk.

Or use in_suspend.

> Or just include that /sys interface, and trigger it from userspace
> just after resume. Actually I like that best. It is optional, it can
> be triggered from userspace, and you will not have to deal with
> suspend internals.
>
> (And it will be useful to uswsusp, too, that avoids big chunks of
> in-kernel suspend code).

Agreed.

Rafael

2006-03-16 21:33:31

by Con Kolivas

[permalink] [raw]
Subject: Re: does swsusp suck after resume for you?

> > > > The tunable in /proc/sys/vm/swap_prefetch is now bitwise ORed:
> > > > Thus if you set this value
> > > > to 3 it will prefetch aggressively and then drop back to the default
> > > > of 1. This makes it easy to simply set the aggressive flag once and
> > > > forget about it. I've booted and tested this feature and it's working
> > > > nicely. Where exactly you'd set this in your resume scripts I'm not
> > > > sure. A rolled up patch against 2.6.16-rc6-mm1 is here for
> > > > simplicity:

correct url:
http://ck.kolivas.org/patches/swap-prefetch/2.6.16-rc6-mm1-swap_prefetch_test.patch

> > 2 means aggressively prefetch as much as possible and then disable swap
> > prefetching from that point on. Too confusing?
>
> Ahha... oops, yes, clever; no, I guess keep it.

Ok the patch works fine for me and the feature is worthwhile in absolute terms
as well as for improving resume.

Pavel, while we're talking about improving behaviour after resume I had a look
at the mechanism used to free up ram before suspending and I can see scope
for some changes in the vm code that would improve the behaviour after
resuming. Is the mechanism used to free up ram going to continue being used
with uswsusp? If so, I'd like to have a go at improving the free up ram vm
code to make it behave nicer after resume. I have some ideas about how best
to free up ram differently from normal reclaim which would improve behaviour
post resume.

Cheers,
Con

2006-03-16 22:16:18

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: does swsusp suck after resume for you?

On Thursday 16 March 2006 22:33, Con Kolivas wrote:
> > > > > The tunable in /proc/sys/vm/swap_prefetch is now bitwise ORed:
> > > > > Thus if you set this value
> > > > > to 3 it will prefetch aggressively and then drop back to the default
> > > > > of 1. This makes it easy to simply set the aggressive flag once and
> > > > > forget about it. I've booted and tested this feature and it's working
> > > > > nicely. Where exactly you'd set this in your resume scripts I'm not
> > > > > sure. A rolled up patch against 2.6.16-rc6-mm1 is here for
> > > > > simplicity:
>
> correct url:
> http://ck.kolivas.org/patches/swap-prefetch/2.6.16-rc6-mm1-swap_prefetch_test.patch
>
> > > 2 means aggressively prefetch as much as possible and then disable swap
> > > prefetching from that point on. Too confusing?
> >
> > Ahha... oops, yes, clever; no, I guess keep it.
>
> Ok the patch works fine for me and the feature is worthwhile in absolute terms
> as well as for improving resume.
>
> Pavel, while we're talking about improving behaviour after resume I had a look
> at the mechanism used to free up ram before suspending and I can see scope
> for some changes in the vm code that would improve the behaviour after
> resuming. Is the mechanism used to free up ram going to continue being used
> with uswsusp?

Yes.

> If so, I'd like to have a go at improving the free up ram vm
> code to make it behave nicer after resume. I have some ideas about how best
> to free up ram differently from normal reclaim which would improve behaviour
> post resume.

That sounds really good to me. :-)

Greetings,
Rafael

2006-03-17 04:28:13

by Con Kolivas

[permalink] [raw]
Subject: [PATCH] swsusp reclaim tweaks was: Re: does swsusp suck after resume for you?

On Fri, 17 Mar 2006 09:15 am, Rafael J. Wysocki wrote:
> On Thursday 16 March 2006 22:33, Con Kolivas wrote:
> > If so, I'd like to have a go at improving the free up ram vm
> > code to make it behave nicer after resume. I have some ideas about how
> > best to free up ram differently from normal reclaim which would improve
> > behaviour post resume.
>
> That sounds really good to me. :-)

Ok here is a kind of directed memory reclaim for swsusp which is different to
ordinary memory reclaim. It reclaims memory in up to 4 passes with just
shrink_zone, without hooking into balance_pgdat thereby simplifying that
function as well.

The passes are as follows:
Reclaim from inactive_list only
Reclaim from active list but don't reclaim mapped
2nd pass of type 2
Reclaim mapped

and it replaces the current shrink_all_memory() function in situ, being passed
exactly the number of pages desired to be freed rather than doing it in
little chunks. This allows the memory reclaiming code to decide how
aggressively to delete pages on a per zone basis. This should leave slightly
more ram intact and should bias the pages stored in ram better based on
current active referenced->active not referenced->inactive lru. This should
improve the state of the vm immediately after resume.

Works for me. Please feel free to test and comment. Patch for 2.6.16-rc6-mm1.

Cheers,
Con
---
kernel/power/swsusp.c | 10 +--
mm/vmscan.c | 140 ++++++++++++++++++++++++++++----------------------
2 files changed, 83 insertions(+), 67 deletions(-)

Index: linux-2.6.16-rc6-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.16-rc6-mm1.orig/mm/vmscan.c 2006-03-14 13:53:19.000000000 +1100
+++ linux-2.6.16-rc6-mm1/mm/vmscan.c 2006-03-17 15:08:37.000000000 +1100
@@ -74,6 +74,15 @@ struct scan_control {
* In this context, it doesn't matter that we scan the
* whole list at once. */
int swap_cluster_max;
+
+ /*
+ * If we're doing suspend to disk, what pass is this.
+ * 3 = Reclaim from inactive_list only
+ * 2 = Reclaim from active list but don't reclaim mapped
+ * 1 = 2nd pass of type 2
+ * 0 = Reclaim mapped
+ */
+ int suspend;
};

#define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
@@ -1345,7 +1354,7 @@ static unsigned long shrink_zone(int pri
*/
zone->nr_scan_active += (zone->nr_active >> priority) + 1;
nr_active = zone->nr_scan_active;
- if (nr_active >= sc->swap_cluster_max)
+ if (nr_active >= sc->swap_cluster_max && sc->suspend < 3)
zone->nr_scan_active = 0;
else
nr_active = 0;
@@ -1402,6 +1411,7 @@ static unsigned long shrink_zones(int pr
unsigned long nr_reclaimed = 0;
int i;

+ sc->suspend = 0;
for (i = 0; zones[i] != NULL; i++) {
struct zone *zone = zones[i];

@@ -1422,7 +1432,12 @@ static unsigned long shrink_zones(int pr
}
return nr_reclaimed;
}
-
+
+#define for_each_priority(priority) \
+ for (priority = DEF_PRIORITY; \
+ priority >= 0; \
+ priority--)
+
/*
* This is the main entry point to direct page reclaim.
*
@@ -1466,7 +1481,7 @@ unsigned long try_to_free_pages(struct z
lru_pages += zone->nr_active + zone->nr_inactive;
}

- for (priority = DEF_PRIORITY; priority >= 0; priority--) {
+ for_each_priority(priority) {
sc.nr_mapped = read_page_state(nr_mapped);
sc.nr_scanned = 0;
if (!priority)
@@ -1516,10 +1531,6 @@ out:
* For kswapd, balance_pgdat() will work across all this node's zones until
* they are all at pages_high.
*
- * If `nr_pages' is non-zero then it is the number of pages which are to be
- * reclaimed, regardless of the zone occupancies. This is a software suspend
- * special.
- *
* Returns the number of pages which were actually freed.
*
* There is special handling here for zones which are full of pinned pages.
@@ -1537,10 +1548,8 @@ out:
* the page allocator fallback scheme to ensure that aging of pages is balanced
* across the zones.
*/
-static unsigned long balance_pgdat(pg_data_t *pgdat, unsigned long nr_pages,
- int order)
+static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
{
- unsigned long to_free = nr_pages;
int all_zones_ok;
int priority;
int i;
@@ -1550,7 +1559,8 @@ static unsigned long balance_pgdat(pg_da
struct scan_control sc = {
.gfp_mask = GFP_KERNEL,
.may_swap = 1,
- .swap_cluster_max = nr_pages ? nr_pages : SWAP_CLUSTER_MAX,
+ .swap_cluster_max = SWAP_CLUSTER_MAX,
+ .suspend = 0,
};

loop_again:
@@ -1567,7 +1577,7 @@ loop_again:
zone->temp_priority = DEF_PRIORITY;
}

- for (priority = DEF_PRIORITY; priority >= 0; priority--) {
+ for_each_priority(priority) {
int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */
unsigned long lru_pages = 0;

@@ -1577,31 +1587,27 @@ loop_again:

all_zones_ok = 1;

- if (nr_pages == 0) {
- /*
- * Scan in the highmem->dma direction for the highest
- * zone which needs scanning
- */
- for (i = pgdat->nr_zones - 1; i >= 0; i--) {
- struct zone *zone = pgdat->node_zones + i;
+ /*
+ * Scan in the highmem->dma direction for the highest
+ * zone which needs scanning
+ */
+ for (i = pgdat->nr_zones - 1; i >= 0; i--) {
+ struct zone *zone = pgdat->node_zones + i;

- if (!populated_zone(zone))
- continue;
+ if (!populated_zone(zone))
+ continue;

- if (zone->all_unreclaimable &&
- priority != DEF_PRIORITY)
- continue;
+ if (zone->all_unreclaimable &&
+ priority != DEF_PRIORITY)
+ continue;

- if (!zone_watermark_ok(zone, order,
- zone->pages_high, 0, 0)) {
- end_zone = i;
- goto scan;
- }
+ if (!zone_watermark_ok(zone, order,
+ zone->pages_high, 0, 0)) {
+ end_zone = i;
+ goto scan;
}
- goto out;
- } else {
- end_zone = pgdat->nr_zones - 1;
}
+ goto out;
scan:
for (i = 0; i <= end_zone; i++) {
struct zone *zone = pgdat->node_zones + i;
@@ -1628,11 +1634,9 @@ scan:
if (zone->all_unreclaimable && priority != DEF_PRIORITY)
continue;

- if (nr_pages == 0) { /* Not software suspend */
- if (!zone_watermark_ok(zone, order,
- zone->pages_high, end_zone, 0))
- all_zones_ok = 0;
- }
+ if (!zone_watermark_ok(zone, order,
+ zone->pages_high, end_zone, 0))
+ all_zones_ok = 0;
zone->temp_priority = priority;
if (zone->prev_priority > priority)
zone->prev_priority = priority;
@@ -1657,8 +1661,6 @@ scan:
total_scanned > nr_reclaimed + nr_reclaimed / 2)
sc.may_writepage = 1;
}
- if (nr_pages && to_free > nr_reclaimed)
- continue; /* swsusp: need to do more work */
if (all_zones_ok)
break; /* kswapd: all done */
/*
@@ -1674,7 +1676,7 @@ scan:
* matches the direct reclaim path behaviour in terms of impact
* on zone->*_priority.
*/
- if ((nr_reclaimed >= SWAP_CLUSTER_MAX) && !nr_pages)
+ if ((nr_reclaimed >= SWAP_CLUSTER_MAX))
break;
}
out:
@@ -1756,7 +1758,7 @@ static int kswapd(void *p)
}
finish_wait(&pgdat->kswapd_wait, &wait);

- balance_pgdat(pgdat, 0, order);
+ balance_pgdat(pgdat, order);
}
return 0;
}
@@ -1790,32 +1792,49 @@ void wakeup_kswapd(struct zone *zone, in
*/
unsigned long shrink_all_memory(unsigned long nr_pages)
{
- pg_data_t *pgdat;
unsigned long nr_to_free = nr_pages;
unsigned long ret = 0;
- unsigned retry = 2;
- struct reclaim_state reclaim_state = {
- .reclaimed_slab = 0,
+ struct scan_control sc = {
+ .gfp_mask = GFP_KERNEL,
+ .may_swap = 1,
+ .swap_cluster_max = nr_pages,
+ .suspend = 3,
};

delay_swap_prefetch();

- current->reclaim_state = &reclaim_state;
-repeat:
- for_each_online_pgdat(pgdat) {
- unsigned long freed;
+ do {
+ int priority;

- freed = balance_pgdat(pgdat, nr_to_free, 0);
- ret += freed;
- nr_to_free -= freed;
- if ((long)nr_to_free <= 0)
- break;
- }
- if (retry-- && ret < nr_pages) {
- blk_congestion_wait(WRITE, HZ/5);
- goto repeat;
- }
- current->reclaim_state = NULL;
+ for_each_priority(priority) {
+ struct zone *zone;
+
+ for_each_zone(zone) {
+ unsigned long freed;
+
+ if (!populated_zone(zone))
+ continue;
+
+ if (zone->all_unreclaimable &&
+ priority != DEF_PRIORITY)
+ continue;
+
+ /*
+ * shrink_active_list needs this to reclaim
+ * mapped pages
+ */
+ if (!sc.suspend)
+ zone->prev_priority = priority;
+ freed = shrink_zone(priority, zone, &sc);
+ ret += freed;
+ nr_to_free -= freed;
+ if ((long)nr_to_free <= 0)
+ break;
+ }
+ }
+ if (ret < nr_pages)
+ blk_congestion_wait(WRITE, HZ/5);
+ } while (--sc.suspend >= 0);
return ret;
}
#endif
@@ -1913,6 +1932,7 @@ static int __zone_reclaim(struct zone *z
.swap_cluster_max = max_t(unsigned long, nr_pages,
SWAP_CLUSTER_MAX),
.gfp_mask = gfp_mask,
+ .suspend = 0,
};

disable_swap_token();
Index: linux-2.6.16-rc6-mm1/kernel/power/swsusp.c
===================================================================
--- linux-2.6.16-rc6-mm1.orig/kernel/power/swsusp.c 2006-03-17 12:38:13.000000000 +1100
+++ linux-2.6.16-rc6-mm1/kernel/power/swsusp.c 2006-03-17 15:11:06.000000000 +1100
@@ -173,9 +173,6 @@ void free_all_swap_pages(int swap, struc
* Notice: all userland should be stopped before it is called, or
* livelock is possible.
*/
-
-#define SHRINK_BITE 10000
-
int swsusp_shrink_memory(void)
{
long size, tmp;
@@ -195,13 +192,12 @@ int swsusp_shrink_memory(void)
if (!is_highmem(zone))
tmp -= zone->free_pages;
if (tmp > 0) {
- tmp = shrink_all_memory(SHRINK_BITE);
+ tmp = shrink_all_memory(tmp);
if (!tmp)
return -ENOMEM;
pages += tmp;
- } else if (size > image_size / PAGE_SIZE) {
- tmp = shrink_all_memory(SHRINK_BITE);
- pages += tmp;
+ if (pages > size)
+ break;
}
printk("\b%c", p[i++%4]);
} while (tmp > 0);

2006-03-17 04:46:00

by Con Kolivas

[permalink] [raw]
Subject: Re: [ck] [PATCH] swsusp reclaim tweaks was: Re: does swsusp suck after resume for you?

On Fri, 17 Mar 2006 03:28 pm, Con Kolivas wrote:
> Ok here is a kind of directed memory reclaim for swsusp which is different
> to ordinary memory reclaim. It reclaims memory in up to 4 passes with just
> shrink_zone, without hooking into balance_pgdat thereby simplifying that
> function as well.
>
> The passes are as follows:
> Reclaim from inactive_list only
> Reclaim from active list but don't reclaim mapped
> 2nd pass of type 2
> Reclaim mapped

It may need to be made more aggressive with another reclaim mapped pass to
ensure it frees enough memory. That would be trivial to add.

Cheers,
Con

2006-03-17 05:23:44

by Mark Lord

[permalink] [raw]
Subject: 2.6.16-rc6: swsusp cannot find swap partition

Pavel,

I have two nearly identical Kubuntu-5.10 notebooks here,
both of which work perfectly with suspend-to-RAM and
just about everything else.

Both of them also did swsusp until today.
Now one of them fails, but the other still works.
The one that failed was just upgraded from a 2.6.12-based kernel
to the stock 2.6.16-rc6-git7, same kernel as the one that works.

I instrumented the swsusp code to try and see why it fails,
and here (attached) is the result. It's skipping over the swap
partition for some reason.

Why?

Cheers


Attachments:
swsusp.log (4.61 kB)

2006-03-17 05:34:49

by Mark Lord

[permalink] [raw]
Subject: Re: 2.6.16-rc6: swsusp cannot find swap partition

Mark Lord wrote:
> Pavel,
>
> I have two nearly identical Kubuntu-5.10 notebooks here,
> both of which work perfectly with suspend-to-RAM and
> just about everything else.
>
> Both of them also did swsusp until today.
> Now one of them fails, but the other still works.
> The one that failed was just upgraded from a 2.6.12-based kernel
> to the stock 2.6.16-rc6-git7, same kernel as the one that works.
>
> I instrumented the swsusp code to try and see why it fails,
> and here (attached) is the result. It's skipping over the swap
> partition for some reason.
>
> Why?

Ahh.. found it. Nevermind.

The swap partitions differ between the two machines,
but I had used (ages ago..) CONFIG_PM_STD_PARTITION="/dev/sda6"
in the kernel config on the good machine, and that's not quite
right for the other machine.

Cheers

2006-03-17 06:17:20

by Con Kolivas

[permalink] [raw]
Subject: [PATCH] swsusp reclaim tweaks 2

On Fri, 17 Mar 2006 03:46 pm, Con Kolivas wrote:
> On Fri, 17 Mar 2006 03:28 pm, Con Kolivas wrote:
> > Ok here is a kind of directed memory reclaim for swsusp which is
> > different to ordinary memory reclaim. It reclaims memory in up to 4
> > passes with just shrink_zone, without hooking into balance_pgdat thereby
> > simplifying that function as well.

> It may need to be made more aggressive with another reclaim mapped pass to
> ensure it frees enough memory. That would be trivial to add.

Indeed this was true after a few more suspend resume cycles. Here is a rework
which survived many suspend resume cycles. This worked nicely in combination
with the aggressive swap prefetch tunable being set on resume.

Ok, now please test :) Patch for 2.6.16-rc6-mm1

Cheers,
Con
---
kernel/power/swsusp.c | 10 +--
mm/vmscan.c | 145 ++++++++++++++++++++++++++++----------------------
2 files changed, 87 insertions(+), 68 deletions(-)

Index: linux-2.6.16-rc6-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.16-rc6-mm1.orig/mm/vmscan.c 2006-03-17 16:44:47.000000000 +1100
+++ linux-2.6.16-rc6-mm1/mm/vmscan.c 2006-03-17 16:57:33.000000000 +1100
@@ -74,6 +74,15 @@ struct scan_control {
* In this context, it doesn't matter that we scan the
* whole list at once. */
int swap_cluster_max;
+
+ /*
+ * If we're doing suspend to disk, what pass is this.
+ * 3 = Reclaim from inactive_list only
+ * 2 = Reclaim from active list but don't reclaim mapped
+ * 1 = 2nd pass of type 2
+ * 0 = Reclaim mapped
+ */
+ int suspend;
};

#define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
@@ -1345,7 +1354,7 @@ static unsigned long shrink_zone(int pri
*/
zone->nr_scan_active += (zone->nr_active >> priority) + 1;
nr_active = zone->nr_scan_active;
- if (nr_active >= sc->swap_cluster_max)
+ if (nr_active >= sc->swap_cluster_max && sc->suspend < 3)
zone->nr_scan_active = 0;
else
nr_active = 0;
@@ -1402,6 +1411,7 @@ static unsigned long shrink_zones(int pr
unsigned long nr_reclaimed = 0;
int i;

+ sc->suspend = 0;
for (i = 0; zones[i] != NULL; i++) {
struct zone *zone = zones[i];

@@ -1422,7 +1432,12 @@ static unsigned long shrink_zones(int pr
}
return nr_reclaimed;
}
-
+
+#define for_each_priority(priority) \
+ for (priority = DEF_PRIORITY; \
+ priority >= 0; \
+ priority--)
+
/*
* This is the main entry point to direct page reclaim.
*
@@ -1466,7 +1481,7 @@ unsigned long try_to_free_pages(struct z
lru_pages += zone->nr_active + zone->nr_inactive;
}

- for (priority = DEF_PRIORITY; priority >= 0; priority--) {
+ for_each_priority(priority) {
sc.nr_mapped = read_page_state(nr_mapped);
sc.nr_scanned = 0;
if (!priority)
@@ -1516,10 +1531,6 @@ out:
* For kswapd, balance_pgdat() will work across all this node's zones until
* they are all at pages_high.
*
- * If `nr_pages' is non-zero then it is the number of pages which are to be
- * reclaimed, regardless of the zone occupancies. This is a software suspend
- * special.
- *
* Returns the number of pages which were actually freed.
*
* There is special handling here for zones which are full of pinned pages.
@@ -1537,10 +1548,8 @@ out:
* the page allocator fallback scheme to ensure that aging of pages is balanced
* across the zones.
*/
-static unsigned long balance_pgdat(pg_data_t *pgdat, unsigned long nr_pages,
- int order)
+static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
{
- unsigned long to_free = nr_pages;
int all_zones_ok;
int priority;
int i;
@@ -1550,7 +1559,8 @@ static unsigned long balance_pgdat(pg_da
struct scan_control sc = {
.gfp_mask = GFP_KERNEL,
.may_swap = 1,
- .swap_cluster_max = nr_pages ? nr_pages : SWAP_CLUSTER_MAX,
+ .swap_cluster_max = SWAP_CLUSTER_MAX,
+ .suspend = 0,
};

loop_again:
@@ -1567,7 +1577,7 @@ loop_again:
zone->temp_priority = DEF_PRIORITY;
}

- for (priority = DEF_PRIORITY; priority >= 0; priority--) {
+ for_each_priority(priority) {
int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */
unsigned long lru_pages = 0;

@@ -1577,31 +1587,27 @@ loop_again:

all_zones_ok = 1;

- if (nr_pages == 0) {
- /*
- * Scan in the highmem->dma direction for the highest
- * zone which needs scanning
- */
- for (i = pgdat->nr_zones - 1; i >= 0; i--) {
- struct zone *zone = pgdat->node_zones + i;
+ /*
+ * Scan in the highmem->dma direction for the highest
+ * zone which needs scanning
+ */
+ for (i = pgdat->nr_zones - 1; i >= 0; i--) {
+ struct zone *zone = pgdat->node_zones + i;

- if (!populated_zone(zone))
- continue;
+ if (!populated_zone(zone))
+ continue;

- if (zone->all_unreclaimable &&
- priority != DEF_PRIORITY)
- continue;
+ if (zone->all_unreclaimable &&
+ priority != DEF_PRIORITY)
+ continue;

- if (!zone_watermark_ok(zone, order,
- zone->pages_high, 0, 0)) {
- end_zone = i;
- goto scan;
- }
+ if (!zone_watermark_ok(zone, order,
+ zone->pages_high, 0, 0)) {
+ end_zone = i;
+ goto scan;
}
- goto out;
- } else {
- end_zone = pgdat->nr_zones - 1;
}
+ goto out;
scan:
for (i = 0; i <= end_zone; i++) {
struct zone *zone = pgdat->node_zones + i;
@@ -1628,11 +1634,9 @@ scan:
if (zone->all_unreclaimable && priority != DEF_PRIORITY)
continue;

- if (nr_pages == 0) { /* Not software suspend */
- if (!zone_watermark_ok(zone, order,
- zone->pages_high, end_zone, 0))
- all_zones_ok = 0;
- }
+ if (!zone_watermark_ok(zone, order,
+ zone->pages_high, end_zone, 0))
+ all_zones_ok = 0;
zone->temp_priority = priority;
if (zone->prev_priority > priority)
zone->prev_priority = priority;
@@ -1657,8 +1661,6 @@ scan:
total_scanned > nr_reclaimed + nr_reclaimed / 2)
sc.may_writepage = 1;
}
- if (nr_pages && to_free > nr_reclaimed)
- continue; /* swsusp: need to do more work */
if (all_zones_ok)
break; /* kswapd: all done */
/*
@@ -1674,7 +1676,7 @@ scan:
* matches the direct reclaim path behaviour in terms of impact
* on zone->*_priority.
*/
- if ((nr_reclaimed >= SWAP_CLUSTER_MAX) && !nr_pages)
+ if ((nr_reclaimed >= SWAP_CLUSTER_MAX))
break;
}
out:
@@ -1756,7 +1758,7 @@ static int kswapd(void *p)
}
finish_wait(&pgdat->kswapd_wait, &wait);

- balance_pgdat(pgdat, 0, order);
+ balance_pgdat(pgdat, order);
}
return 0;
}
@@ -1790,32 +1792,52 @@ void wakeup_kswapd(struct zone *zone, in
*/
unsigned long shrink_all_memory(unsigned long nr_pages)
{
- pg_data_t *pgdat;
- unsigned long nr_to_free = nr_pages;
unsigned long ret = 0;
- unsigned retry = 2;
- struct reclaim_state reclaim_state = {
- .reclaimed_slab = 0,
+ struct scan_control sc = {
+ .gfp_mask = GFP_KERNEL,
+ .may_swap = 1,
+ .swap_cluster_max = nr_pages,
+ .suspend = 3,
+ .may_writepage = 1,
};

delay_swap_prefetch();

- current->reclaim_state = &reclaim_state;
-repeat:
- for_each_online_pgdat(pgdat) {
- unsigned long freed;
+ do {
+ int priority;

- freed = balance_pgdat(pgdat, nr_to_free, 0);
- ret += freed;
- nr_to_free -= freed;
- if ((long)nr_to_free <= 0)
- break;
- }
- if (retry-- && ret < nr_pages) {
- blk_congestion_wait(WRITE, HZ/5);
- goto repeat;
- }
- current->reclaim_state = NULL;
+ for_each_priority(priority) {
+ struct zone *zone;
+ unsigned long lru_pages = 0;
+
+ for_each_zone(zone) {
+ unsigned long freed;
+
+ if (!populated_zone(zone))
+ continue;
+
+ if (zone->all_unreclaimable &&
+ priority != DEF_PRIORITY)
+ continue;
+
+ lru_pages += zone->nr_active +
+ zone->nr_inactive;
+ /*
+ * shrink_active_list needs this to reclaim
+ * mapped pages
+ */
+ if (!sc.suspend)
+ zone->prev_priority = 0;
+ freed = shrink_zone(priority, zone, &sc);
+ ret += freed;
+ if (ret > nr_pages)
+ goto out;
+ }
+ shrink_slab(0, sc.gfp_mask, lru_pages);
+ blk_congestion_wait(WRITE, HZ / 5);
+ }
+ } while (--sc.suspend >= 0);
+out:
return ret;
}
#endif
@@ -1913,6 +1935,7 @@ static int __zone_reclaim(struct zone *z
.swap_cluster_max = max_t(unsigned long, nr_pages,
SWAP_CLUSTER_MAX),
.gfp_mask = gfp_mask,
+ .suspend = 0,
};

disable_swap_token();
Index: linux-2.6.16-rc6-mm1/kernel/power/swsusp.c
===================================================================
--- linux-2.6.16-rc6-mm1.orig/kernel/power/swsusp.c 2006-03-17 16:44:47.000000000 +1100
+++ linux-2.6.16-rc6-mm1/kernel/power/swsusp.c 2006-03-17 16:45:11.000000000 +1100
@@ -173,9 +173,6 @@ void free_all_swap_pages(int swap, struc
* Notice: all userland should be stopped before it is called, or
* livelock is possible.
*/
-
-#define SHRINK_BITE 10000
-
int swsusp_shrink_memory(void)
{
long size, tmp;
@@ -195,13 +192,12 @@ int swsusp_shrink_memory(void)
if (!is_highmem(zone))
tmp -= zone->free_pages;
if (tmp > 0) {
- tmp = shrink_all_memory(SHRINK_BITE);
+ tmp = shrink_all_memory(tmp);
if (!tmp)
return -ENOMEM;
pages += tmp;
- } else if (size > image_size / PAGE_SIZE) {
- tmp = shrink_all_memory(SHRINK_BITE);
- pages += tmp;
+ if (pages > size)
+ break;
}
printk("\b%c", p[i++%4]);
} while (tmp > 0);

2006-03-17 17:32:57

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH] swsusp reclaim tweaks 2

On Friday 17 March 2006 07:17, Con Kolivas wrote:
> On Fri, 17 Mar 2006 03:46 pm, Con Kolivas wrote:
> > On Fri, 17 Mar 2006 03:28 pm, Con Kolivas wrote:
> > > Ok here is a kind of directed memory reclaim for swsusp which is
> > > different to ordinary memory reclaim. It reclaims memory in up to 4
> > > passes with just shrink_zone, without hooking into balance_pgdat thereby
> > > simplifying that function as well.
>
> > It may need to be made more aggressive with another reclaim mapped pass to
> > ensure it frees enough memory. That would be trivial to add.
>
> Indeed this was true after a few more suspend resume cycles. Here is a rework
> which survived many suspend resume cycles. This worked nicely in combination
> with the aggressive swap prefetch tunable being set on resume.
>
> Ok, now please test :) Patch for 2.6.16-rc6-mm1
>
> Cheers,
> Con
> ---
}-- snip --{

I'm not an mm expert, but I like that.

> Index: linux-2.6.16-rc6-mm1/kernel/power/swsusp.c
> ===================================================================
> --- linux-2.6.16-rc6-mm1.orig/kernel/power/swsusp.c 2006-03-17 16:44:47.000000000 +1100
> +++ linux-2.6.16-rc6-mm1/kernel/power/swsusp.c 2006-03-17 16:45:11.000000000 +1100
> @@ -173,9 +173,6 @@ void free_all_swap_pages(int swap, struc
> * Notice: all userland should be stopped before it is called, or
> * livelock is possible.
> */
> -
> -#define SHRINK_BITE 10000
> -
> int swsusp_shrink_memory(void)
> {
> long size, tmp;
> @@ -195,13 +192,12 @@ int swsusp_shrink_memory(void)
> if (!is_highmem(zone))
> tmp -= zone->free_pages;
> if (tmp > 0) {
> - tmp = shrink_all_memory(SHRINK_BITE);
> + tmp = shrink_all_memory(tmp);
> if (!tmp)
> return -ENOMEM;
> pages += tmp;
> - } else if (size > image_size / PAGE_SIZE) {

If you drop this, swsusp can free less memory than you want (image_size is
ignored). Generally we want it to free memory until
size <= image_size / PAGE_SIZE.

Appended is a fix on top of your patch (untested).

> - tmp = shrink_all_memory(SHRINK_BITE);
> - pages += tmp;
> + if (pages > size)
> + break;
> }
> printk("\b%c", p[i++%4]);
> } while (tmp > 0);

kernel/power/swsusp.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

Index: linux-2.6.16-rc6-mm1/kernel/power/swsusp.c
===================================================================
--- linux-2.6.16-rc6-mm1.orig/kernel/power/swsusp.c
+++ linux-2.6.16-rc6-mm1/kernel/power/swsusp.c
@@ -191,13 +191,13 @@ int swsusp_shrink_memory(void)
for_each_zone (zone)
if (!is_highmem(zone))
tmp -= zone->free_pages;
+ if (tmp <= 0)
+ tmp = size - image_size / PAGE_SIZE;
if (tmp > 0) {
tmp = shrink_all_memory(tmp);
if (!tmp)
return -ENOMEM;
pages += tmp;
- if (pages > size)
- break;
}
printk("\b%c", p[i++%4]);
} while (tmp > 0);

2006-03-18 04:15:26

by Con Kolivas

[permalink] [raw]
Subject: [PATCH][RFC] mm: swsusp shrink_all_memory tweaks

This patch is a rewrite of the shrink_all_memory function used by swsusp
prior to suspending to disk.

The special hooks into balance_pgdat for shrink_all_memory have been removed
thus simplifying that function significantly.

Some code will now be compiled out in the !CONFIG_PM case.

shrink_all_memory now uses shrink_zone and shrink_slab directly with an extra
entry in the struct scan_control suspend_pass. This is used to alter what
lists will be shrunk by shrink_zone on successive passes. The aim of this is
to alter the reclaim logic to choose the best pages to keep on resume, to
free the minimum amount of memory required to suspend and free the memory
faster.

Signed-off-by: Con Kolivas <[email protected]>

include/linux/swap.h | 45 +++++++++++++
kernel/power/swsusp.c | 10 ---
mm/vmscan.c | 161 +++++++++++++++++++++++---------------------------
3 files changed, 125 insertions(+), 91 deletions(-)

Index: linux-2.6.16-rc6-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.16-rc6-mm1.orig/mm/vmscan.c 2006-03-18 13:29:38.000000000 +1100
+++ linux-2.6.16-rc6-mm1/mm/vmscan.c 2006-03-18 13:58:51.000000000 +1100
@@ -55,27 +55,6 @@ typedef enum {
PAGE_CLEAN,
} pageout_t;

-struct scan_control {
- /* Incremented by the number of inactive pages that were scanned */
- unsigned long nr_scanned;
-
- unsigned long nr_mapped; /* From page_state */
-
- /* This context's GFP mask */
- gfp_t gfp_mask;
-
- int may_writepage;
-
- /* Can pages be swapped as part of reclaim? */
- int may_swap;
-
- /* This context's SWAP_CLUSTER_MAX. If freeing memory for
- * suspend, we effectively ignore SWAP_CLUSTER_MAX.
- * In this context, it doesn't matter that we scan the
- * whole list at once. */
- int swap_cluster_max;
-};
-
#define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))

#ifdef ARCH_HAS_PREFETCH
@@ -1327,7 +1306,8 @@ static void shrink_active_list(unsigned
}

/*
- * This is a basic per-zone page freer. Used by both kswapd and direct reclaim.
+ * This is a basic per-zone page freer. Used by kswapd, direct reclaim and
+ * the swsusp specific shrink_all_memory functions.
*/
static unsigned long shrink_zone(int priority, struct zone *zone,
struct scan_control *sc)
@@ -1345,7 +1325,7 @@ static unsigned long shrink_zone(int pri
*/
zone->nr_scan_active += (zone->nr_active >> priority) + 1;
nr_active = zone->nr_scan_active;
- if (nr_active >= sc->swap_cluster_max)
+ if (nr_active >= sc->swap_cluster_max && suspend_scan_active(sc))
zone->nr_scan_active = 0;
else
nr_active = 0;
@@ -1422,7 +1402,12 @@ static unsigned long shrink_zones(int pr
}
return nr_reclaimed;
}
-
+
+#define for_each_priority_reverse(priority) \
+ for (priority = DEF_PRIORITY; \
+ priority >= 0; \
+ priority--)
+
/*
* This is the main entry point to direct page reclaim.
*
@@ -1466,7 +1451,7 @@ unsigned long try_to_free_pages(struct z
lru_pages += zone->nr_active + zone->nr_inactive;
}

- for (priority = DEF_PRIORITY; priority >= 0; priority--) {
+ for_each_priority_reverse(priority) {
sc.nr_mapped = read_page_state(nr_mapped);
sc.nr_scanned = 0;
if (!priority)
@@ -1516,10 +1501,6 @@ out:
* For kswapd, balance_pgdat() will work across all this node's zones until
* they are all at pages_high.
*
- * If `nr_pages' is non-zero then it is the number of pages which are to be
- * reclaimed, regardless of the zone occupancies. This is a software suspend
- * special.
- *
* Returns the number of pages which were actually freed.
*
* There is special handling here for zones which are full of pinned pages.
@@ -1537,10 +1518,8 @@ out:
* the page allocator fallback scheme to ensure that aging of pages is balanced
* across the zones.
*/
-static unsigned long balance_pgdat(pg_data_t *pgdat, unsigned long nr_pages,
- int order)
+static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
{
- unsigned long to_free = nr_pages;
int all_zones_ok;
int priority;
int i;
@@ -1550,7 +1529,7 @@ static unsigned long balance_pgdat(pg_da
struct scan_control sc = {
.gfp_mask = GFP_KERNEL,
.may_swap = 1,
- .swap_cluster_max = nr_pages ? nr_pages : SWAP_CLUSTER_MAX,
+ .swap_cluster_max = SWAP_CLUSTER_MAX,
};

loop_again:
@@ -1567,7 +1546,7 @@ loop_again:
zone->temp_priority = DEF_PRIORITY;
}

- for (priority = DEF_PRIORITY; priority >= 0; priority--) {
+ for_each_priority_reverse(priority) {
int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */
unsigned long lru_pages = 0;

@@ -1577,31 +1556,27 @@ loop_again:

all_zones_ok = 1;

- if (nr_pages == 0) {
- /*
- * Scan in the highmem->dma direction for the highest
- * zone which needs scanning
- */
- for (i = pgdat->nr_zones - 1; i >= 0; i--) {
- struct zone *zone = pgdat->node_zones + i;
+ /*
+ * Scan in the highmem->dma direction for the highest
+ * zone which needs scanning
+ */
+ for (i = pgdat->nr_zones - 1; i >= 0; i--) {
+ struct zone *zone = pgdat->node_zones + i;

- if (!populated_zone(zone))
- continue;
+ if (!populated_zone(zone))
+ continue;

- if (zone->all_unreclaimable &&
- priority != DEF_PRIORITY)
- continue;
+ if (zone->all_unreclaimable &&
+ priority != DEF_PRIORITY)
+ continue;

- if (!zone_watermark_ok(zone, order,
- zone->pages_high, 0, 0)) {
- end_zone = i;
- goto scan;
- }
+ if (!zone_watermark_ok(zone, order, zone->pages_high,
+ 0, 0)) {
+ end_zone = i;
+ goto scan;
}
- goto out;
- } else {
- end_zone = pgdat->nr_zones - 1;
}
+ goto out;
scan:
for (i = 0; i <= end_zone; i++) {
struct zone *zone = pgdat->node_zones + i;
@@ -1628,11 +1603,9 @@ scan:
if (zone->all_unreclaimable && priority != DEF_PRIORITY)
continue;

- if (nr_pages == 0) { /* Not software suspend */
- if (!zone_watermark_ok(zone, order,
- zone->pages_high, end_zone, 0))
- all_zones_ok = 0;
- }
+ if (!zone_watermark_ok(zone, order, zone->pages_high,
+ end_zone, 0))
+ all_zones_ok = 0;
zone->temp_priority = priority;
if (zone->prev_priority > priority)
zone->prev_priority = priority;
@@ -1657,8 +1630,6 @@ scan:
total_scanned > nr_reclaimed + nr_reclaimed / 2)
sc.may_writepage = 1;
}
- if (nr_pages && to_free > nr_reclaimed)
- continue; /* swsusp: need to do more work */
if (all_zones_ok)
break; /* kswapd: all done */
/*
@@ -1674,7 +1645,7 @@ scan:
* matches the direct reclaim path behaviour in terms of impact
* on zone->*_priority.
*/
- if ((nr_reclaimed >= SWAP_CLUSTER_MAX) && !nr_pages)
+ if (nr_reclaimed >= SWAP_CLUSTER_MAX)
break;
}
out:
@@ -1756,7 +1727,7 @@ static int kswapd(void *p)
}
finish_wait(&pgdat->kswapd_wait, &wait);

- balance_pgdat(pgdat, 0, order);
+ balance_pgdat(pgdat, order);
}
return 0;
}
@@ -1786,36 +1757,58 @@ void wakeup_kswapd(struct zone *zone, in
#ifdef CONFIG_PM
/*
* Try to free `nr_pages' of memory, system-wide. Returns the number of freed
- * pages.
+ * pages. It does this via shrink_zone in passes. Rather than trying to age
+ * LRUs the aim is to preserve the overall LRU order by reclaiming
+ * preferentially inactive > active > active referenced > active mapped
*/
unsigned long shrink_all_memory(unsigned long nr_pages)
{
- pg_data_t *pgdat;
- unsigned long nr_to_free = nr_pages;
unsigned long ret = 0;
- unsigned retry = 2;
- struct reclaim_state reclaim_state = {
- .reclaimed_slab = 0,
+ struct scan_control sc = {
+ .gfp_mask = GFP_KERNEL,
+ .may_swap = 1,
+ .swap_cluster_max = nr_pages,
+ .suspend_pass = 3,
+ .may_writepage = 1,
};

delay_swap_prefetch();

- current->reclaim_state = &reclaim_state;
-repeat:
- for_each_online_pgdat(pgdat) {
- unsigned long freed;
+ do {
+ int priority;

- freed = balance_pgdat(pgdat, nr_to_free, 0);
- ret += freed;
- nr_to_free -= freed;
- if ((long)nr_to_free <= 0)
- break;
- }
- if (retry-- && ret < nr_pages) {
- blk_congestion_wait(WRITE, HZ/5);
- goto repeat;
- }
- current->reclaim_state = NULL;
+ for_each_priority_reverse(priority) {
+ struct zone *zone;
+ unsigned long lru_pages = 0;
+
+ for_each_zone(zone) {
+ unsigned long freed;
+
+ if (!populated_zone(zone))
+ continue;
+
+ if (zone->all_unreclaimable &&
+ priority != DEF_PRIORITY)
+ continue;
+
+ lru_pages += zone->nr_active +
+ zone->nr_inactive;
+ /*
+ * shrink_active_list needs this to reclaim
+ * mapped pages
+ */
+ if (!sc.suspend_pass)
+ zone->prev_priority = 0;
+ freed = shrink_zone(priority, zone, &sc);
+ ret += freed;
+ if (ret > nr_pages)
+ goto out;
+ }
+ shrink_slab(0, sc.gfp_mask, lru_pages);
+ }
+ blk_congestion_wait(WRITE, HZ / 5);
+ } while (--sc.suspend_pass >= 0);
+out:
return ret;
}
#endif
Index: linux-2.6.16-rc6-mm1/kernel/power/swsusp.c
===================================================================
--- linux-2.6.16-rc6-mm1.orig/kernel/power/swsusp.c 2006-03-18 13:29:38.000000000 +1100
+++ linux-2.6.16-rc6-mm1/kernel/power/swsusp.c 2006-03-18 13:30:52.000000000 +1100
@@ -173,9 +173,6 @@ void free_all_swap_pages(int swap, struc
* Notice: all userland should be stopped before it is called, or
* livelock is possible.
*/
-
-#define SHRINK_BITE 10000
-
int swsusp_shrink_memory(void)
{
long size, tmp;
@@ -194,14 +191,13 @@ int swsusp_shrink_memory(void)
for_each_zone (zone)
if (!is_highmem(zone))
tmp -= zone->free_pages;
+ if (tmp <= 0)
+ tmp = size - image_size / PAGE_SIZE;
if (tmp > 0) {
- tmp = shrink_all_memory(SHRINK_BITE);
+ tmp = shrink_all_memory(tmp);
if (!tmp)
return -ENOMEM;
pages += tmp;
- } else if (size > image_size / PAGE_SIZE) {
- tmp = shrink_all_memory(SHRINK_BITE);
- pages += tmp;
}
printk("\b%c", p[i++%4]);
} while (tmp > 0);
Index: linux-2.6.16-rc6-mm1/include/linux/swap.h
===================================================================
--- linux-2.6.16-rc6-mm1.orig/include/linux/swap.h 2006-03-18 13:29:38.000000000 +1100
+++ linux-2.6.16-rc6-mm1/include/linux/swap.h 2006-03-18 14:50:11.000000000 +1100
@@ -66,6 +66,51 @@ typedef struct {
unsigned long val;
} swp_entry_t;

+struct scan_control {
+ /* Incremented by the number of inactive pages that were scanned */
+ unsigned long nr_scanned;
+
+ unsigned long nr_mapped; /* From page_state */
+
+ /* This context's GFP mask */
+ gfp_t gfp_mask;
+
+ int may_writepage;
+
+ /* Can pages be swapped as part of reclaim? */
+ int may_swap;
+
+ /* This context's SWAP_CLUSTER_MAX. If freeing memory for
+ * suspend, we effectively ignore SWAP_CLUSTER_MAX.
+ * In this context, it doesn't matter that we scan the
+ * whole list at once. */
+ int swap_cluster_max;
+
+#ifdef CONFIG_PM
+ /*
+ * If we're doing suspend to disk, what pass is this.
+ * We decrement to allow code to transparently do normal reclaim
+ * without explicitly setting it to 0.
+ *
+ * 3 = Reclaim from inactive_list only
+ * 2 = Reclaim from active list but don't reclaim mapped
+ * 1 = 2nd pass of type 2
+ * 0 = Reclaim mapped (normal reclaim)
+ */
+ int suspend_pass;
+#endif
+};
+
+/*
+ * When scanning for the swsusp function shrink_all_memory we only shrink
+ * active lists on the 2nd pass.
+ */
+#ifdef CONFIG_PM
+#define suspend_scan_active(sc) ((sc)->suspend_pass < 3)
+#else
+#define suspend_scan_active(sc) 1
+#endif
+
/*
* current->reclaim_state points to one of these when a task is running
* memory reclaim

2006-03-18 04:41:24

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH][RFC] mm: swsusp shrink_all_memory tweaks

Con Kolivas wrote:

> @@ -1567,7 +1546,7 @@ loop_again:
> zone->temp_priority = DEF_PRIORITY;
> }
>
> - for (priority = DEF_PRIORITY; priority >= 0; priority--) {
> + for_each_priority_reverse(priority) {

What's this for? The for loop is simple and easy to read, after
the change, you have to look somewhere else to see what it does.

> Index: linux-2.6.16-rc6-mm1/include/linux/swap.h
> ===================================================================
> --- linux-2.6.16-rc6-mm1.orig/include/linux/swap.h 2006-03-18 13:29:38.000000000 +1100
> +++ linux-2.6.16-rc6-mm1/include/linux/swap.h 2006-03-18 14:50:11.000000000 +1100
> @@ -66,6 +66,51 @@ typedef struct {
> unsigned long val;
> } swp_entry_t;
>
> +struct scan_control {

Why did you put this here? scan_control really can't go outside vmscan.c,
it is meant only to ease the passing of lots of parameters, and not as a
consistent interface.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-03-18 04:47:11

by Con Kolivas

[permalink] [raw]
Subject: Re: [PATCH][RFC] mm: swsusp shrink_all_memory tweaks

On Saturday 18 March 2006 15:41, Nick Piggin wrote:
> Con Kolivas wrote:
> > @@ -1567,7 +1546,7 @@ loop_again:
> > zone->temp_priority = DEF_PRIORITY;
> > }
> >
> > - for (priority = DEF_PRIORITY; priority >= 0; priority--) {
> > + for_each_priority_reverse(priority) {
>
> What's this for? The for loop is simple and easy to read, after
> the change, you have to look somewhere else to see what it does.

Saw the same for loop 3 times and couldn't resist.

> > Index: linux-2.6.16-rc6-mm1/include/linux/swap.h
> > ===================================================================
> > --- linux-2.6.16-rc6-mm1.orig/include/linux/swap.h 2006-03-18
> > 13:29:38.000000000 +1100 +++
> > linux-2.6.16-rc6-mm1/include/linux/swap.h 2006-03-18 14:50:11.000000000
> > +1100 @@ -66,6 +66,51 @@ typedef struct {
> > unsigned long val;
> > } swp_entry_t;
> >
> > +struct scan_control {
>
> Why did you put this here? scan_control really can't go outside vmscan.c,
> it is meant only to ease the passing of lots of parameters, and not as a
> consistent interface.

#ifdeffery

Cheers,
Con

2006-03-18 04:52:27

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH][RFC] mm: swsusp shrink_all_memory tweaks

Con Kolivas wrote:
> On Saturday 18 March 2006 15:41, Nick Piggin wrote:

>>>Index: linux-2.6.16-rc6-mm1/include/linux/swap.h
>>>===================================================================
>>>--- linux-2.6.16-rc6-mm1.orig/include/linux/swap.h 2006-03-18
>>>13:29:38.000000000 +1100 +++
>>>linux-2.6.16-rc6-mm1/include/linux/swap.h 2006-03-18 14:50:11.000000000
>>>+1100 @@ -66,6 +66,51 @@ typedef struct {
>>> unsigned long val;
>>> } swp_entry_t;
>>>
>>>+struct scan_control {
>>
>>Why did you put this here? scan_control really can't go outside vmscan.c,
>>it is meant only to ease the passing of lots of parameters, and not as a
>>consistent interface.
>
>
> #ifdeffery
>

Sorry I don't understand...

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-03-18 04:57:13

by Con Kolivas

[permalink] [raw]
Subject: Re: [PATCH][RFC] mm: swsusp shrink_all_memory tweaks

On Saturday 18 March 2006 15:52, Nick Piggin wrote:
> Con Kolivas wrote:
> > On Saturday 18 March 2006 15:41, Nick Piggin wrote:
> >>>Index: linux-2.6.16-rc6-mm1/include/linux/swap.h
> >>>===================================================================
> >>>--- linux-2.6.16-rc6-mm1.orig/include/linux/swap.h 2006-03-18
> >>>13:29:38.000000000 +1100 +++
> >>>linux-2.6.16-rc6-mm1/include/linux/swap.h 2006-03-18 14:50:11.000000000
> >>>+1100 @@ -66,6 +66,51 @@ typedef struct {
> >>> unsigned long val;
> >>> } swp_entry_t;
> >>>
> >>>+struct scan_control {
> >>
> >>Why did you put this here? scan_control really can't go outside vmscan.c,
> >>it is meant only to ease the passing of lots of parameters, and not as a
> >>consistent interface.
> >
> > #ifdeffery
>
> Sorry I don't understand...

My bad.

I added the suspend_pass member to struct scan_control within an #ifdef
CONFIG_PM to allow it to not be unnecessarily compiled in in the !CONFIG_PM
case and wanted to avoid having the #ifdefs in vmscan.c so moved it to a
header file.

Cheers,
Con

2006-03-18 05:45:05

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH][RFC] mm: swsusp shrink_all_memory tweaks

Con Kolivas wrote:
> On Saturday 18 March 2006 15:52, Nick Piggin wrote:
>
>>Con Kolivas wrote:
>>
>>>
>>>#ifdeffery
>>
>>Sorry I don't understand...
>
>
> My bad.
>
> I added the suspend_pass member to struct scan_control within an #ifdef
> CONFIG_PM to allow it to not be unnecessarily compiled in in the !CONFIG_PM
> case and wanted to avoid having the #ifdefs in vmscan.c so moved it to a
> header file.
>

Oh no, that rule thumb isn't actually "don't put ifdefs in .c files", but
people commonly say it that way anyway. The rule is actually that you should
put ifdefs in declarations rather than call/usage sites.

You did the right thing there by introducing the accessor, which moves the
ifdef out of code that wants to query the member right? But you can still
leave it in the .c file if it is local (which it is).

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-03-18 06:15:21

by Con Kolivas

[permalink] [raw]
Subject: Re: [PATCH][RFC] mm: swsusp shrink_all_memory tweaks

cc'ed GregKH for comment hopefully.

On Saturday 18 March 2006 16:44, Nick Piggin wrote:
> Con Kolivas wrote:
> > I added the suspend_pass member to struct scan_control within an #ifdef
> > CONFIG_PM to allow it to not be unnecessarily compiled in in the
> > !CONFIG_PM case and wanted to avoid having the #ifdefs in vmscan.c so
> > moved it to a header file.
>
> Oh no, that rule thumb isn't actually "don't put ifdefs in .c files", but
> people commonly say it that way anyway. The rule is actually that you
> should put ifdefs in declarations rather than call/usage sites.

There isn't a formal reference to this in the Codingstyle documentation, but
Greg's 2002 ols presentation says simply says no ifdefs in .c files.

http://www.kroah.com/linux/talks/ols_2002_kernel_codingstyle_talk/html/mgp00031.html

I'm confused now because I've been working very hard to do this with all code.

> You did the right thing there by introducing the accessor, which moves the
> ifdef out of code that wants to query the member right? But you can still
> leave it in the .c file if it is local (which it is).

Once again I'm happy to do the right thing; I'm just not sure what that is.

Cheers,
Con

2006-03-18 08:30:44

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH][RFC] mm: swsusp shrink_all_memory tweaks

Con Kolivas wrote:
> cc'ed GregKH for comment hopefully.

>>You did the right thing there by introducing the accessor, which moves the
>>ifdef out of code that wants to query the member right? But you can still
>>leave it in the .c file if it is local (which it is).
>
>
> Once again I'm happy to do the right thing; I'm just not sure what that is.
>

Well, struct scan_control escaping from vmscan.c is not the right thing
(try to get that past Andrew!). Obviously in this case, having the ifdef
in the .c file is OK.

I guess Greg's presentation is a first order approximation to get people
thinking in the right way. I mean we do it all the time, and in core kernel
code too (our favourite sched.c is a prime example).

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-03-18 09:40:46

by Con Kolivas

[permalink] [raw]
Subject: Re: [PATCH][RFC] mm: swsusp shrink_all_memory tweaks

On Saturday 18 March 2006 19:30, Nick Piggin wrote:
> Con Kolivas wrote:
> > cc'ed GregKH for comment hopefully.
> >
> >>You did the right thing there by introducing the accessor, which moves
> >> the ifdef out of code that wants to query the member right? But you can
> >> still leave it in the .c file if it is local (which it is).
> >
> > Once again I'm happy to do the right thing; I'm just not sure what that
> > is.
>
> Well, struct scan_control escaping from vmscan.c is not the right thing
> (try to get that past Andrew!). Obviously in this case, having the ifdef
> in the .c file is OK.
>
> I guess Greg's presentation is a first order approximation to get people
> thinking in the right way. I mean we do it all the time, and in core kernel
> code too (our favourite sched.c is a prime example).

Ok here's a respin without touching swap.h and leaving the code otherwise the
same.

Cheers,
Con
---
This patch is a rewrite of the shrink_all_memory function used by swsusp
prior to suspending to disk.

The special hooks into balance_pgdat for shrink_all_memory have been removed
thus simplifying that function significantly.

Some code will now be compiled out in the !CONFIG_PM case.

shrink_all_memory now uses shrink_zone and shrink_slab directly with an extra
entry in the struct scan_control suspend_pass. This is used to alter what
lists will be shrunk by shrink_zone on successive passes. The aim of this is
to alter the reclaim logic to choose the best pages to keep on resume, to
free the minimum amount of memory required to suspend and free the memory
faster.

Signed-off-by: Con Kolivas <[email protected]>

kernel/power/swsusp.c | 10 ---
mm/vmscan.c | 164 ++++++++++++++++++++++++++++++--------------------
2 files changed, 104 insertions(+), 70 deletions(-)

Index: linux-2.6.16-rc6-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.16-rc6-mm1.orig/mm/vmscan.c 2006-03-18 13:29:38.000000000 +1100
+++ linux-2.6.16-rc6-mm1/mm/vmscan.c 2006-03-18 19:47:38.000000000 +1100
@@ -74,8 +74,32 @@ struct scan_control {
* In this context, it doesn't matter that we scan the
* whole list at once. */
int swap_cluster_max;
+
+#ifdef CONFIG_PM
+ /*
+ * If we're doing suspend to disk, what pass is this.
+ * We decrement to allow code to transparently do normal reclaim
+ * without explicitly setting it to 0.
+ *
+ * 3 = Reclaim from inactive_list only
+ * 2 = Reclaim from active list but don't reclaim mapped
+ * 1 = 2nd pass of type 2
+ * 0 = Reclaim mapped (normal reclaim)
+ */
+ int suspend_pass;
+#endif
};

+/*
+ * When scanning for the swsusp function shrink_all_memory we only shrink
+ * active lists on the 2nd pass.
+ */
+#ifdef CONFIG_PM
+#define suspend_scan_active(sc) ((sc)->suspend_pass < 3)
+#else
+#define suspend_scan_active(sc) 1
+#endif
+
#define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))

#ifdef ARCH_HAS_PREFETCH
@@ -1327,7 +1351,8 @@ static void shrink_active_list(unsigned
}

/*
- * This is a basic per-zone page freer. Used by both kswapd and direct reclaim.
+ * This is a basic per-zone page freer. Used by kswapd, direct reclaim and
+ * the swsusp specific shrink_all_memory functions.
*/
static unsigned long shrink_zone(int priority, struct zone *zone,
struct scan_control *sc)
@@ -1345,7 +1370,7 @@ static unsigned long shrink_zone(int pri
*/
zone->nr_scan_active += (zone->nr_active >> priority) + 1;
nr_active = zone->nr_scan_active;
- if (nr_active >= sc->swap_cluster_max)
+ if (nr_active >= sc->swap_cluster_max && suspend_scan_active(sc))
zone->nr_scan_active = 0;
else
nr_active = 0;
@@ -1422,7 +1447,12 @@ static unsigned long shrink_zones(int pr
}
return nr_reclaimed;
}
-
+
+#define for_each_priority_reverse(priority) \
+ for (priority = DEF_PRIORITY; \
+ priority >= 0; \
+ priority--)
+
/*
* This is the main entry point to direct page reclaim.
*
@@ -1466,7 +1496,7 @@ unsigned long try_to_free_pages(struct z
lru_pages += zone->nr_active + zone->nr_inactive;
}

- for (priority = DEF_PRIORITY; priority >= 0; priority--) {
+ for_each_priority_reverse(priority) {
sc.nr_mapped = read_page_state(nr_mapped);
sc.nr_scanned = 0;
if (!priority)
@@ -1516,10 +1546,6 @@ out:
* For kswapd, balance_pgdat() will work across all this node's zones until
* they are all at pages_high.
*
- * If `nr_pages' is non-zero then it is the number of pages which are to be
- * reclaimed, regardless of the zone occupancies. This is a software suspend
- * special.
- *
* Returns the number of pages which were actually freed.
*
* There is special handling here for zones which are full of pinned pages.
@@ -1537,10 +1563,8 @@ out:
* the page allocator fallback scheme to ensure that aging of pages is balanced
* across the zones.
*/
-static unsigned long balance_pgdat(pg_data_t *pgdat, unsigned long nr_pages,
- int order)
+static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
{
- unsigned long to_free = nr_pages;
int all_zones_ok;
int priority;
int i;
@@ -1550,7 +1574,7 @@ static unsigned long balance_pgdat(pg_da
struct scan_control sc = {
.gfp_mask = GFP_KERNEL,
.may_swap = 1,
- .swap_cluster_max = nr_pages ? nr_pages : SWAP_CLUSTER_MAX,
+ .swap_cluster_max = SWAP_CLUSTER_MAX,
};

loop_again:
@@ -1567,7 +1591,7 @@ loop_again:
zone->temp_priority = DEF_PRIORITY;
}

- for (priority = DEF_PRIORITY; priority >= 0; priority--) {
+ for_each_priority_reverse(priority) {
int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */
unsigned long lru_pages = 0;

@@ -1577,31 +1601,27 @@ loop_again:

all_zones_ok = 1;

- if (nr_pages == 0) {
- /*
- * Scan in the highmem->dma direction for the highest
- * zone which needs scanning
- */
- for (i = pgdat->nr_zones - 1; i >= 0; i--) {
- struct zone *zone = pgdat->node_zones + i;
+ /*
+ * Scan in the highmem->dma direction for the highest
+ * zone which needs scanning
+ */
+ for (i = pgdat->nr_zones - 1; i >= 0; i--) {
+ struct zone *zone = pgdat->node_zones + i;

- if (!populated_zone(zone))
- continue;
+ if (!populated_zone(zone))
+ continue;

- if (zone->all_unreclaimable &&
- priority != DEF_PRIORITY)
- continue;
+ if (zone->all_unreclaimable &&
+ priority != DEF_PRIORITY)
+ continue;

- if (!zone_watermark_ok(zone, order,
- zone->pages_high, 0, 0)) {
- end_zone = i;
- goto scan;
- }
+ if (!zone_watermark_ok(zone, order, zone->pages_high,
+ 0, 0)) {
+ end_zone = i;
+ goto scan;
}
- goto out;
- } else {
- end_zone = pgdat->nr_zones - 1;
}
+ goto out;
scan:
for (i = 0; i <= end_zone; i++) {
struct zone *zone = pgdat->node_zones + i;
@@ -1628,11 +1648,9 @@ scan:
if (zone->all_unreclaimable && priority != DEF_PRIORITY)
continue;

- if (nr_pages == 0) { /* Not software suspend */
- if (!zone_watermark_ok(zone, order,
- zone->pages_high, end_zone, 0))
- all_zones_ok = 0;
- }
+ if (!zone_watermark_ok(zone, order, zone->pages_high,
+ end_zone, 0))
+ all_zones_ok = 0;
zone->temp_priority = priority;
if (zone->prev_priority > priority)
zone->prev_priority = priority;
@@ -1657,8 +1675,6 @@ scan:
total_scanned > nr_reclaimed + nr_reclaimed / 2)
sc.may_writepage = 1;
}
- if (nr_pages && to_free > nr_reclaimed)
- continue; /* swsusp: need to do more work */
if (all_zones_ok)
break; /* kswapd: all done */
/*
@@ -1674,7 +1690,7 @@ scan:
* matches the direct reclaim path behaviour in terms of impact
* on zone->*_priority.
*/
- if ((nr_reclaimed >= SWAP_CLUSTER_MAX) && !nr_pages)
+ if (nr_reclaimed >= SWAP_CLUSTER_MAX)
break;
}
out:
@@ -1756,7 +1772,7 @@ static int kswapd(void *p)
}
finish_wait(&pgdat->kswapd_wait, &wait);

- balance_pgdat(pgdat, 0, order);
+ balance_pgdat(pgdat, order);
}
return 0;
}
@@ -1786,36 +1802,58 @@ void wakeup_kswapd(struct zone *zone, in
#ifdef CONFIG_PM
/*
* Try to free `nr_pages' of memory, system-wide. Returns the number of freed
- * pages.
+ * pages. It does this via shrink_zone in passes. Rather than trying to age
+ * LRUs the aim is to preserve the overall LRU order by reclaiming
+ * preferentially inactive > active > active referenced > active mapped
*/
unsigned long shrink_all_memory(unsigned long nr_pages)
{
- pg_data_t *pgdat;
- unsigned long nr_to_free = nr_pages;
unsigned long ret = 0;
- unsigned retry = 2;
- struct reclaim_state reclaim_state = {
- .reclaimed_slab = 0,
+ struct scan_control sc = {
+ .gfp_mask = GFP_KERNEL,
+ .may_swap = 1,
+ .swap_cluster_max = nr_pages,
+ .suspend_pass = 3,
+ .may_writepage = 1,
};

delay_swap_prefetch();

- current->reclaim_state = &reclaim_state;
-repeat:
- for_each_online_pgdat(pgdat) {
- unsigned long freed;
+ do {
+ int priority;

- freed = balance_pgdat(pgdat, nr_to_free, 0);
- ret += freed;
- nr_to_free -= freed;
- if ((long)nr_to_free <= 0)
- break;
- }
- if (retry-- && ret < nr_pages) {
- blk_congestion_wait(WRITE, HZ/5);
- goto repeat;
- }
- current->reclaim_state = NULL;
+ for_each_priority_reverse(priority) {
+ struct zone *zone;
+ unsigned long lru_pages = 0;
+
+ for_each_zone(zone) {
+ unsigned long freed;
+
+ if (!populated_zone(zone))
+ continue;
+
+ if (zone->all_unreclaimable &&
+ priority != DEF_PRIORITY)
+ continue;
+
+ lru_pages += zone->nr_active +
+ zone->nr_inactive;
+ /*
+ * shrink_active_list needs this to reclaim
+ * mapped pages
+ */
+ if (!sc.suspend_pass)
+ zone->prev_priority = 0;
+ freed = shrink_zone(priority, zone, &sc);
+ ret += freed;
+ if (ret > nr_pages)
+ goto out;
+ }
+ shrink_slab(0, sc.gfp_mask, lru_pages);
+ }
+ blk_congestion_wait(WRITE, HZ / 5);
+ } while (--sc.suspend_pass >= 0);
+out:
return ret;
}
#endif
Index: linux-2.6.16-rc6-mm1/kernel/power/swsusp.c
===================================================================
--- linux-2.6.16-rc6-mm1.orig/kernel/power/swsusp.c 2006-03-18 13:29:38.000000000 +1100
+++ linux-2.6.16-rc6-mm1/kernel/power/swsusp.c 2006-03-18 13:30:52.000000000 +1100
@@ -173,9 +173,6 @@ void free_all_swap_pages(int swap, struc
* Notice: all userland should be stopped before it is called, or
* livelock is possible.
*/
-
-#define SHRINK_BITE 10000
-
int swsusp_shrink_memory(void)
{
long size, tmp;
@@ -194,14 +191,13 @@ int swsusp_shrink_memory(void)
for_each_zone (zone)
if (!is_highmem(zone))
tmp -= zone->free_pages;
+ if (tmp <= 0)
+ tmp = size - image_size / PAGE_SIZE;
if (tmp > 0) {
- tmp = shrink_all_memory(SHRINK_BITE);
+ tmp = shrink_all_memory(tmp);
if (!tmp)
return -ENOMEM;
pages += tmp;
- } else if (size > image_size / PAGE_SIZE) {
- tmp = shrink_all_memory(SHRINK_BITE);
- pages += tmp;
}
printk("\b%c", p[i++%4]);
} while (tmp > 0);

2006-03-20 08:56:53

by Pavel Machek

[permalink] [raw]
Subject: Re: does swsusp suck after resume for you?

On P? 17-03-06 08:33:26, Con Kolivas wrote:
> > > > > The tunable in /proc/sys/vm/swap_prefetch is now bitwise ORed:
> > > > > Thus if you set this value
> > > > > to 3 it will prefetch aggressively and then drop back to the default
> > > > > of 1. This makes it easy to simply set the aggressive flag once and
> > > > > forget about it. I've booted and tested this feature and it's working
> > > > > nicely. Where exactly you'd set this in your resume scripts I'm not
> > > > > sure. A rolled up patch against 2.6.16-rc6-mm1 is here for
> > > > > simplicity:
>
> correct url:
> http://ck.kolivas.org/patches/swap-prefetch/2.6.16-rc6-mm1-swap_prefetch_test.patch

I'm sorry, I'm leaving for mountains tommorow, so it will take me a
while to test it.

> > > 2 means aggressively prefetch as much as possible and then disable swap
> > > prefetching from that point on. Too confusing?
> >
> > Ahha... oops, yes, clever; no, I guess keep it.
>
> Ok the patch works fine for me and the feature is worthwhile in absolute terms
> as well as for improving resume.

Good.

> Pavel, while we're talking about improving behaviour after resume I had a look
> at the mechanism used to free up ram before suspending and I can see scope
> for some changes in the vm code that would improve the behaviour after
> resuming. Is the mechanism used to free up ram going to continue being used
> with uswsusp? If so, I'd like to have a go at improving the free up
> ram vm

Yes, it is.

> code to make it behave nicer after resume. I have some ideas about how best
> to free up ram differently from normal reclaim which would improve behaviour
> post resume.

One possible improvement would be to never ever return 0 if there can
still be more memory freed. Rafael did some ugly workaround, but we
do not even understand the problem.
Pavel
--
82: return SampleTable;

2006-03-20 12:46:22

by Jun OKAJIMA

[permalink] [raw]
Subject: Re: Faster resuming of suspend technology.

>>
>> But you have to read all of the pages at some point so the hard disk is
>> going to be the bottleneck no matter what you do. And since Suspend2
>> currently saves the cache as a contiguous stream, possibly compressed, it
>> should be a good bit faster than seeking around the disk loading the files
>> from the filesystem.
>
>Agreed.
>

First, sorry for delaying to reply. I needed time to consider.
The conclusionis, "I also agree".
BTW, this discussion continues in CK list also.
If folks have interest about faster resuming with background swap reading,
check CK list. Of course, I agree the CK list discussion also. Background
reading for suspend image is almost same as what I imagined when I posted
first. Although I am not sure about the implementation, like that extending
swap prefetch feature or such is the best way or not.



>> > >> Especially, your way has problem if you boot( resume ) not from HDD
>> > >> but for example, from NFS server or CD-R or even from Internet.
>> > >
>> > >Resuming from the internet? Scary. Anyway, I hope I'll understand better
>> > > what you're getting at after your next reply.
>> >
>> > In Japan, it is not so scary.
>> > We have 100Mbps symmetric FTTH ( optical Fiber To The Home), and
>> > more than 1M homes have it, and price is about 30USD/month.
>> > With this, theoretically you can download 600MB ISO image in one min,
>> > and actually you can download 100MBytes suspend image within 30sec.
>> > So, not click to run (e.g. Java applet) but "click to resume" is not
>> > dreaming but rather feasible. You still think it is scary on this
>> > situation?
>>
>> I don't think the scary part is speed, but security. I for one wouldn't
>> want to resume from an image hosted on a remote machine unless I had some
>> way to be sure it wasn't tampered with, like gpg signing or something.
>
>Another issues is that at the moment, hotplugging is work in progress. In
>order to resume, you currently need the same kernel build you're booting
>with, and the same hardware configuration in the resumed system. As hotplug
>matures, this restriction might relax, and we could probably come up with a
>way around the former restriction, but at the moment, it really only makes
>sense to try to resume an image you created using the same machine.
>

Wait, wait. Let make it clear that what we are discussing.

For me, the theme is "faster resuming with suspend technology", not swsusp2.
I mean, in this point of view, the most practical candidate for now would be
Xen suspend, not swsusp2. Of course, hotplugging once comes, swsusp2 will be
a good candidate also, and hopefully what I call "generic suspend image"
would be possible.

I admit that Jim Crilly's concern is right, but with using Xen suspend,
it can be solved very easily. What you do is just like this:
[Xen DOM0]# wget http://www.geocity.com/1235089/suspend_image/debian.image
[Xen DOM0]# gpg --verify debian.image
[Xen DOM0]# xen --resume debian.image


--- Okajima, Jun. Tokyo, Japan.

2006-03-21 11:34:12

by Jun OKAJIMA

[permalink] [raw]
Subject: Fwd: Faster resuming of suspend technology.


This is forwarded from Cunningham's mail which we failed to post LKML
because of false positive of LKML's spam filter.
I also post this to Xen-devel, swsusp2-devel.

For the purpose of faster booting(=resuming, in this case),
You think what suspend technology is good in what aspect?

--- Okajima.

-----------
Hi.

On Saturday 18 March 2006 03:46, Jun OKAJIMA wrote:
> >> > >> Especially, your way has problem if you boot( resume ) not from HDD
> >> > >> but for example, from NFS server or CD-R or even from Internet.
> >> > >
> >> > >Resuming from the internet? Scary. Anyway, I hope I'll understand
> >> > > better what you're getting at after your next reply.
> >> >
> >> > In Japan, it is not so scary.
> >> > We have 100Mbps symmetric FTTH ( optical Fiber To The Home), and
> >> > more than 1M homes have it, and price is about 30USD/month.
> >> > With this, theoretically you can download 600MB ISO image in one min,
> >> > and actually you can download 100MBytes suspend image within 30sec.
> >> > So, not click to run (e.g. Java applet) but "click to resume" is not
> >> > dreaming but rather feasible. You still think it is scary on this
> >> > situation?
> >>
> >> I don't think the scary part is speed, but security. I for one wouldn't
> >> want to resume from an image hosted on a remote machine unless I had
> >> some way to be sure it wasn't tampered with, like gpg signing or
> >> something.
> >
> >Another issues is that at the moment, hotplugging is work in progress. In
> >order to resume, you currently need the same kernel build you're booting
> >with, and the same hardware configuration in the resumed system. As
> > hotplug matures, this restriction might relax, and we could probably come
> > up with a way around the former restriction, but at the moment, it really
> > only makes sense to try to resume an image you created using the same
> > machine.
>
> Wait, wait. Let make it clear that what we are discussing.
>
> For me, the theme is "faster resuming with suspend technology", not
> swsusp2. I mean, in this point of view, the most practical candidate for
> now would be Xen suspend, not swsusp2. Of course, hotplugging once comes,
> swsusp2 will be a good candidate also, and hopefully what I call "generic
> suspend image" would be possible.

I wasn't thinking suspend2 was the topic, but I'll freely admit my bias and
say I think it's the best tool for the job, for a number of reasons:

First, speed is not the only criteria that should be considered. There's also
memory overhead, the difference in speed post-resume, reliability,
flexibility and the list goes on.

Second, Xen would not be the most practical candidate now. It would be slower
than suspend2 because suspend2 is reading the image as fast as the hardware
will allow it (Ok. Perhaps algorithm changes could make small improvements
here and there). In contrast, what is Xen doing? I'm not claiming knowledge
of its internals, but I'm sure it will have at least some emphasis on
keeping other vms (or whatever it calls them) running and interactive while
the resume is occuring. It will therefore surely be resuming at something less
than the fastest possible rate.

Additionally, Xen cannot solve the problems raised by the kernel lacking
complete hotplug support. Only further development in the kernel itself can
address those issues.

> I admit that Jim Crilly's concern is right, but with using Xen suspend,
> it can be solved very easily. What you do is just like this:
> [Xen DOM0]# wget
> http://www.geocity.com/1235089/suspend_image/debian.image [Xen DOM0]# gpg
> --verify debian.image
> [Xen DOM0]# xen --resume debian.image

Given this example, I guess you're talking about Xen (or vmware for that
matter) providing an abstraction of the hardware that's really available.
Doesn't this still have the problems I mentioned above, namely that your Xen
image can't possibly have support for any possible hardware the user might
have, allowing that hardware to be used with full functionality and full
speed. Surely such any such solution must be viewed as second best, at best?

Regards,

Nigel

2006-03-27 23:54:10

by Jun OKAJIMA

[permalink] [raw]
Subject: Re: Fwd: Faster resuming of suspend technology.

>
>I wasn't thinking suspend2 was the topic, but I'll freely admit my bias and
>say I think it's the best tool for the job, for a number of reasons:
>
>First, speed is not the only criteria that should be considered. There's also
>memory overhead, the difference in speed post-resume, reliability,
>flexibility and the list goes on.
>
>Second, Xen would not be the most practical candidate now. It would be slower
>than suspend2 because suspend2 is reading the image as fast as the hardware
>will allow it (Ok. Perhaps algorithm changes could make small improvements
>here and there). In contrast, what is Xen doing? I'm not claiming knowledge
>of its internals, but I'm sure it will have at least some emphasis on
>keeping other vms (or whatever it calls them) running and interactive while
>the resume is occuring. It will therefore surely be resuming at something less
>than the fastest possible rate.
>
>Additionally, Xen cannot solve the problems raised by the kernel lacking
>complete hotplug support. Only further development in the kernel itself can
>address those issues.
>

I made very easy testing.

H/W
CPU:Sempron64 2600+
MEM:1G for Xen3.0 (I put 768MB for dom0, and 256MB for domU)
256MB for swsusp2
WAN:100Mbps FTTH ( up to about 8MBytes/sec , from ISP's web server).
HDD:250G 7200rpm ATA
DVD:x16 DVD-R ATA
S/W
SuSE10 with Xen3.0
Using KDE3 desktop, with Firefox and OOo 2.0 Writer launched.

Performance:
swsusp2 -> about 10sec after "uncompressing Linux kernel".
(from HDD, of course.)
Xen resume -> almost same! But needs to boot dom0 first.

On Xen experiment, I booted dom0 from HDD, but loaded the suspend image
from x16 DVD-R. And, it resumed about in 10sec including decompressing
time of suspend image. This means, Xen can resume almost same speed
as swsusp2 from DVD-R, with H/W abstraction which current swsusp2 lacks.
(Note: I did vnc reconnection workaround manually, so the time is just
an estimation.)
And, for example, if you boot dom0 up within 10 sec, ( and this
is quite possible, check my site http://www.machboot.com/), you can get
KDE3+FF1.5+OOo2 workinig within 20sec measured from ISOLINUX loaded,
with x16 DVD-R. Yes, DVD is not slow any more!.

And, I also tried to do Xen resume from Internet.
What I did was very easy. Just did like this.
(Sorry, no gpg yet.)
# wget $URL -o - | gzip -d > /tmp/$TMP.chk && xm restore /tmp/$TMP.chk

The result is, I succeeded to "boot" (actually resume) KDE3+FF1.5+OOo2
in about 15 sec from Internet!. I believe this is the fastest record of
Internet booting ever.

What I want to say is, using Xen suspend is one way to "boot" your desktop
faster, especially if you use big apps and big window manager.

Note: This experiment is very easy one, and no guarantee of correctness or
reproducitivity. Must have many mistakes and misunderstanding and
misconception, so on. I am afraid that even me could not reproduce it.
So, dont accept this figure on faith, but treat as just one suggestion.
But, I believe my suggestion must be meaningful one.
Dont you want to boot your desktop within 20 sec from x48 CD-R?
I suggest that this is not just a dream, but maybe feasible.

>> I admit that Jim Crilly's concern is right, but with using Xen suspend,
>> it can be solved very easily. What you do is just like this:
>> [Xen DOM0]# wget
>> http://www.geocity.com/1235089/suspend_image/debian.image [Xen DOM0]# gpg
>> --verify debian.image
>> [Xen DOM0]# xen --resume debian.image
>
>Given this example, I guess you're talking about Xen (or vmware for that
>matter) providing an abstraction of the hardware that's really available.
>Doesn't this still have the problems I mentioned above, namely that your Xen
>image can't possibly have support for any possible hardware the user might
>have, allowing that hardware to be used with full functionality and full
>speed. Surely such any such solution must be viewed as second best, at best?
>
>

I have not checked this feature yet.
I only have one Xen installed PC and to make matters worse,
the condition of the PC is very unstable, so it is a bit tough
to check this by myself.

Do somebody know about this?
I mean, Xen really does not have an abstraction layer of the H/W?
I think it must have and you can use the same suspend image on all Xen PCs.

--- Okajima, Jun. Tokyo, Japan.

2006-03-28 00:29:52

by Nigel Cunningham

[permalink] [raw]
Subject: Re: Fwd: Faster resuming of suspend technology.

Hi.

On Tuesday 28 March 2006 09:57, Jun OKAJIMA wrote:
> >I wasn't thinking suspend2 was the topic, but I'll freely admit my bias
> > and say I think it's the best tool for the job, for a number of reasons:
> >
> >First, speed is not the only criteria that should be considered. There's
> > also memory overhead, the difference in speed post-resume, reliability,
> > flexibility and the list goes on.
> >
> >Second, Xen would not be the most practical candidate now. It would be
> > slower than suspend2 because suspend2 is reading the image as fast as the
> > hardware will allow it (Ok. Perhaps algorithm changes could make small
> > improvements here and there). In contrast, what is Xen doing? I'm not
> > claiming knowledge of its internals, but I'm sure it will have at least
> > some emphasis on keeping other vms (or whatever it calls them) running
> > and interactive while the resume is occuring. It will therefore surely be
> > resuming at something less than the fastest possible rate.
> >
> >Additionally, Xen cannot solve the problems raised by the kernel lacking
> >complete hotplug support. Only further development in the kernel itself
> > can address those issues.
>
> I made very easy testing.
>
> H/W
> CPU:Sempron64 2600+
> MEM:1G for Xen3.0 (I put 768MB for dom0, and 256MB for domU)
> 256MB for swsusp2
> WAN:100Mbps FTTH ( up to about 8MBytes/sec , from ISP's web server).
> HDD:250G 7200rpm ATA
> DVD:x16 DVD-R ATA
> S/W
> SuSE10 with Xen3.0
> Using KDE3 desktop, with Firefox and OOo 2.0 Writer launched.
>
> Performance:
> swsusp2 -> about 10sec after "uncompressing Linux kernel".
> (from HDD, of course.)

How was suspend2 configured? On a 7200rpm ATA drive, I'd expect 36MB/s
throughput. That alone would give you your 10s. But if you add LZF
compression to the mix, you should be able to resume in half the time
(literally - LZF usually acheived ~50% compression on an image).

> Xen resume -> almost same! But needs to boot dom0 first.

Impressive. I was afraid it might take much longer. Is that getting all the
image in, or is more of an image pulled in as necessary?

> On Xen experiment, I booted dom0 from HDD, but loaded the suspend image
> from x16 DVD-R. And, it resumed about in 10sec including decompressing
> time of suspend image. This means, Xen can resume almost same speed
> as swsusp2 from DVD-R, with H/W abstraction which current swsusp2 lacks.
> (Note: I did vnc reconnection workaround manually, so the time is just
> an estimation.)
> And, for example, if you boot dom0 up within 10 sec, ( and this
> is quite possible, check my site http://www.machboot.com/), you can get
> KDE3+FF1.5+OOo2 workinig within 20sec measured from ISOLINUX loaded,
> with x16 DVD-R. Yes, DVD is not slow any more!.
>
> And, I also tried to do Xen resume from Internet.
> What I did was very easy. Just did like this.
> (Sorry, no gpg yet.)
> # wget $URL -o - | gzip -d > /tmp/$TMP.chk && xm restore /tmp/$TMP.chk
>
> The result is, I succeeded to "boot" (actually resume) KDE3+FF1.5+OOo2
> in about 15 sec from Internet!. I believe this is the fastest record of
> Internet booting ever.

Impressive!

> What I want to say is, using Xen suspend is one way to "boot" your desktop
> faster, especially if you use big apps and big window manager.
>
> Note: This experiment is very easy one, and no guarantee of correctness or
> reproducitivity. Must have many mistakes and misunderstanding and
> misconception, so on. I am afraid that even me could not reproduce it.
> So, dont accept this figure on faith, but treat as just one suggestion.
> But, I believe my suggestion must be meaningful one.
> Dont you want to boot your desktop within 20 sec from x48 CD-R?
> I suggest that this is not just a dream, but maybe feasible.

For live cds, it might be attractive, but for your average hdd based
installation, I wouldn't think that using the cd would be that interesting.
Nevertheless, yes - booting more quickly from whatever media is desirable.

> >> I admit that Jim Crilly's concern is right, but with using Xen suspend,
> >> it can be solved very easily. What you do is just like this:
> >> [Xen DOM0]# wget

(pretend url removed so LKML servers don't think this is spam)

> >> gpg --verify debian.image
> >> [Xen DOM0]# xen --resume debian.image
> >
> >Given this example, I guess you're talking about Xen (or vmware for that
> >matter) providing an abstraction of the hardware that's really available.
> >Doesn't this still have the problems I mentioned above, namely that your
> > Xen image can't possibly have support for any possible hardware the user
> > might have, allowing that hardware to be used with full functionality and
> > full speed. Surely such any such solution must be viewed as second best,
> > at best?
>
> I have not checked this feature yet.
> I only have one Xen installed PC and to make matters worse,
> the condition of the PC is very unstable, so it is a bit tough
> to check this by myself.
>
> Do somebody know about this?
> I mean, Xen really does not have an abstraction layer of the H/W?

Yeah, I guess it would too. Sorry for my wonky thinking there.

> I think it must have and you can use the same suspend image on all Xen PCs.

Yeah.

Regards,

Nigel


Attachments:
(No filename) (5.14 kB)
(No filename) (189.00 B)
Download all attachments

2006-03-28 12:48:20

by Keir Fraser

[permalink] [raw]
Subject: Re: [Xen-devel] Re: Fwd: Faster resuming of suspend technology.


On 28 Mar 2006, at 01:28, Nigel Cunningham wrote:

>> I think it must have and you can use the same suspend image on all
>> Xen PCs.
>
> Yeah.

Certain CPU features can screw things up. So moving from a CPU that
supports SSE2 to one that doesn't is unlikely to work well, for
example. But as long as CPU capabilities are a reasonably close match
then yes, you should be able to use a suspend image anywhere.

-- Keir