LinuxLists.cc - Software suspend testing in 2.6.0-test1

[permalink] [raw]

Subject: Re: Software suspend testing in 2.6.0-test1

Hi!

> I have done some testing of the software suspend function in
> 2.6.0-test1. It works mostly very well, but I have found two problems.
>
> The first problem is that software suspend fails if a process is
> stopped before you invoke suspend. (For example, by starting cat from
> the shell and pressing ctrl-z.) When the processes are woken up again,
> the cat process is stuck in the schedule loop in refrigerator(),
> sucking up all available cpu time.

Thanks for a report. If it came with a patch it would be better
;-). I'll take a look.

> The second problem is that freeing memory seems to be much slower than
> it has to be. It appears to be caused by the call to
> blk_congestion_wait() in balance_pgdat(). The patch below makes page
> freeing much faster, although I'm quite sure the patch is not correct.
>
> How can we fix this properly? The disk is mostly idle during page
> freeing, but it looks like blk_congestion_wait still doesn't return
> until the timeout expires. I tried HZ/2 and that made the page freeing
> extremely slow.
>
> --- linux/mm/vmscan.c.old Thu Jul 17 21:30:09 2003
> +++ linux/mm/vmscan.c Thu Jul 17 21:29:58 2003
> @@ -930,7 +930,7 @@
> }
> if (all_zones_ok)
> break;
> - blk_congestion_wait(WRITE, HZ/10);
> + blk_congestion_wait(WRITE, HZ/50);
> }
> return nr_pages - to_free;
> }

This is certainly not okay. Andrew, you know more about vm
internals... What does this ugly constant mean? Would it be possible
to somehow make it variable and set it to zero during software suspend
memory freeing?
Pavel

--
When do you have a heart between your knees?
[Johanka's followup: and *two* hearts?]

2003-07-17 20:02:07

[permalink] [raw]

Subject: Re: Software suspend testing in 2.6.0-test1

Pavel Machek <[email protected]> wrote:
>
> > --- linux/mm/vmscan.c.old Thu Jul 17 21:30:09 2003
> > +++ linux/mm/vmscan.c Thu Jul 17 21:29:58 2003
> > @@ -930,7 +930,7 @@
> > }
> > if (all_zones_ok)
> > break;
> > - blk_congestion_wait(WRITE, HZ/10);
> > + blk_congestion_wait(WRITE, HZ/50);
> > }
> > return nr_pages - to_free;
> > }
>
> This is certainly not okay. Andrew, you know more about vm
> internals... What does this ugly constant mean?

Most of the time the timeout is a "can't happen" - blk_congestion_wait() is
terminated by completion of writeout. The timeout is mainly there to
prevent hangs if weird and rare races happen. Otherwise we'd need lots
more locking.

I don't think we want to be calling it at all if reclaim is working well.
Something like this.

diff -puN mm/vmscan.c~a mm/vmscan.c
--- 25/mm/vmscan.c~a Thu Jul 17 13:05:36 2003
+++ 25-akpm/mm/vmscan.c Thu Jul 17 13:05:58 2003
@@ -921,7 +921,7 @@ static int balance_pgdat(pg_data_t *pgda
if (i < ZONE_HIGHMEM) {
reclaim_state->reclaimed_slab = 0;
shrink_slab(max_scan + nr_mapped, GFP_KERNEL);
- to_free += reclaim_state->reclaimed_slab;
+ to_free -= reclaim_state->reclaimed_slab;
}
if (zone->all_unreclaimable)
continue;
@@ -930,7 +930,8 @@ static int balance_pgdat(pg_data_t *pgda
}
if (all_zones_ok)
break;
- blk_congestion_wait(WRITE, HZ/10);
+ if (to_free)
+ blk_congestion_wait(WRITE, HZ/10);
}
return nr_pages - to_free;
}

_

2003-07-18 10:11:40

[permalink] [raw]

Subject: Re: Software suspend testing in 2.6.0-test1

Peter Osterlund <[email protected]> wrote:
>
> If this patch is an acceptable approach to fix the problem,

Seems reasonable.

> the balance_pgdat function should probably be cleaned up.

Well it was rather bolted on the side of the kswapd code. But from an API
perspective, being able to tell it how many page to free is a bit more
flexible. Minor point.

However I'm trying to remember why the code exists at all. Why doesn't
swsusp just allocate lots of pages then free them again?

Something like:

LIST_HEAD(list);
int sleep_count = 0;

while (sleep_count < 10) {
page = __alloc_pages(0, GFP_ATOMIC);
if (page) {
list_add(&page->list, &list);
} else {
blk_congestion_wait(WRITE, HZ/20);
sleep_count++;
}
}
<free all the pages on `list'>

2003-07-18 10:07:53

[permalink] [raw]

Subject: Re: Software suspend testing in 2.6.0-test1

Andrew Morton <[email protected]> writes:

> Pavel Machek <[email protected]> wrote:
> >
> > > --- linux/mm/vmscan.c.old Thu Jul 17 21:30:09 2003
> > > +++ linux/mm/vmscan.c Thu Jul 17 21:29:58 2003
> > > @@ -930,7 +930,7 @@
> > > }
> > > if (all_zones_ok)
> > > break;
> > > - blk_congestion_wait(WRITE, HZ/10);
> > > + blk_congestion_wait(WRITE, HZ/50);
> > > }
> > > return nr_pages - to_free;
> > > }
> >
> > This is certainly not okay. Andrew, you know more about vm
> > internals... What does this ugly constant mean?
>
> Most of the time the timeout is a "can't happen" - blk_congestion_wait() is
> terminated by completion of writeout. The timeout is mainly there to
> prevent hangs if weird and rare races happen. Otherwise we'd need lots
> more locking.
>
> I don't think we want to be calling it at all if reclaim is working well.
> Something like this.
...
> - blk_congestion_wait(WRITE, HZ/10);
> + if (to_free)
> + blk_congestion_wait(WRITE, HZ/10);

It doesn't make any difference in my tests. to_free is very large when
called from the suspend code, so it almost never becomes zero.

With the patch below, page freeing is very fast on my computer. The
idea is to not call blk_congestion_wait at all until we can't make any
more progress. Then make the call, try again, and if we still can't
make any more progress, give up.

I've tested this patch in two scenarios. The first is to boot rh7.3 on
a 256MB computer to runlevel 3, log in and start emacs, then
immediately suspend. The old code could easily take 15s while the new
code runs in ~1s.

The second test case was to also start a program that allocates 200MB
and goes to sleep. The new code swaps out all data and the disk light
is turned on constantly. The old code swapped out some data (the disk
light was not turned on constantly), then gave up and failed to
suspend. (I only tested the old code once. Maybe this was just "bad
luck".)

If this patch is an acceptable approach to fix the problem, the
balance_pgdat function should probably be cleaned up. We either call
it with nr_pages=0 (from kswapd) or nr_pages="big number" (from
shrink_all_memory). Maybe we should replace the nr_pages parameter
with a boolean parameter, or just split the function in two. It should
be possible to make some simplifications for the "free as much as
possible" case.

What do you think?

diff -u -r orig/linux-p1/kernel/suspend.c linux-p1/kernel/suspend.c
--- orig/linux-p1/kernel/suspend.c Sat Jul 12 00:52:16 2003
+++ linux-p1/kernel/suspend.c Fri Jul 18 11:30:29 2003
@@ -621,9 +621,20 @@
*/
static void free_some_memory(void)
{
+ int state = 0;
+
printk("Freeing memory: ");
- while (shrink_all_memory(10000))
+ for (;;) {
+ if (shrink_all_memory(10000)) {
+ state = 0;
+ } else {
+ state++;
+ if (state > 1)
+ break;
+ blk_congestion_wait(WRITE, HZ/5);
+ }
printk(".");
+ }
printk("|\n");
}

diff -u -r orig/linux-p1/mm/vmscan.c linux-p1/mm/vmscan.c
--- orig/linux-p1/mm/vmscan.c Mon Jul 14 09:07:43 2003
+++ linux-p1/mm/vmscan.c Fri Jul 18 11:30:38 2003
@@ -921,7 +921,7 @@
if (i < ZONE_HIGHMEM) {
reclaim_state->reclaimed_slab = 0;
shrink_slab(max_scan + nr_mapped, GFP_KERNEL);
- to_free += reclaim_state->reclaimed_slab;
+ to_free -= reclaim_state->reclaimed_slab;
}
if (zone->all_unreclaimable)
continue;
@@ -930,7 +930,8 @@
}
if (all_zones_ok)
break;
- blk_congestion_wait(WRITE, HZ/10);
+ if (nr_pages == 0)
+ blk_congestion_wait(WRITE, HZ/10);
}
return nr_pages - to_free;
}

--
Peter Osterlund - [email protected]
http://w1.894.telia.com/~u89404340

2003-07-18 15:12:37

[permalink] [raw]

Subject: Re: Software suspend testing in 2.6.0-test1

Hi!

> > If this patch is an acceptable approach to fix the problem,
>
> Seems reasonable.
>
> > the balance_pgdat function should probably be cleaned up.
>
> Well it was rather bolted on the side of the kswapd code. But from an API
> perspective, being able to tell it how many page to free is a bit more
> flexible. Minor point.
>
> However I'm trying to remember why the code exists at all. Why doesn't
> swsusp just allocate lots of pages then free them again?

Because that either

a) does not free enough pages or

b) triggers OOM killer.

It was actually your idea, IIRC ;-).

Ahha, you seem to be addressing that in your code. Peter, perhaps you
want to test that one?
Pavel

> Something like:
>
> LIST_HEAD(list);
> int sleep_count = 0;
>
> while (sleep_count < 10) {
> page = __alloc_pages(0, GFP_ATOMIC);
> if (page) {
> list_add(&page->list, &list);
> } else {
> blk_congestion_wait(WRITE, HZ/20);
> sleep_count++;
> }
> }
> <free all the pages on `list'>

--
When do you have a heart between your knees?
[Johanka's followup: and *two* hearts?]

2003-07-18 15:43:08

[permalink] [raw]

Subject: Re: Software suspend testing in 2.6.0-test1

Pavel Machek <[email protected]> writes:

> > However I'm trying to remember why the code exists at all. Why doesn't
> > swsusp just allocate lots of pages then free them again?
>
> Because that either
>
> a) does not free enough pages or
>
> b) triggers OOM killer.
>
> It was actually your idea, IIRC ;-).
>
> Ahha, you seem to be addressing that in your code. Peter, perhaps you
> want to test that one?

I tried the patch below, but it didn't work. Nothing (or very little)
was swapped out to disk. I also tried using GFP_KERNEL, but that
seemed to cause a deadlock. (Maybe it would have gone OOM if I had
waited long enough). I think the problem is that pdflush and friends
are already frozen when this code runs.

--- linux/kernel/suspend.c.old Fri Jul 18 15:46:48 2003
+++ linux/kernel/suspend.c Fri Jul 18 15:45:51 2003
@@ -621,10 +621,32 @@
*/
static void free_some_memory(void)
{
- printk("Freeing memory: ");
- while (shrink_all_memory(10000))
- printk(".");
+ LIST_HEAD(list);
+ struct page *page, *tmp;
+ int sleep_count = 0;
+ int i = 0;
+
+ while (sleep_count < 10) {
+ page = alloc_page(GFP_ATOMIC);
+ if (page) {
+ list_add(&page->list, &list);
+ sleep_count = 0;
+ } else {
+ blk_congestion_wait(WRITE, HZ/20);
+ sleep_count++;
+ }
+ i++;
+ if (!(i%1000))
+ printk(".");
+ }
printk("|\n");
+
+ i = 0;
+ list_for_each_entry_safe(page, tmp, &list, list) {
+ __free_page(page);
+ i++;
+ }
+ printk("%d pages freed\n", i);
}

/* Make disk drivers accept operations, again */

--
Peter Osterlund - [email protected]
http://w1.894.telia.com/~u89404340

2003-07-18 16:41:30

[permalink] [raw]

Subject: Re: Software suspend testing in 2.6.0-test1

Peter Osterlund <[email protected]> wrote:
>
> I tried the patch below, but it didn't work. Nothing (or very little)
> was swapped out to disk. I also tried using GFP_KERNEL, but that
> seemed to cause a deadlock. (Maybe it would have gone OOM if I had
> waited long enough). I think the problem is that pdflush and friends
> are already frozen when this code runs.

Oh, we shouldn't be doing this sort of thing when the kernel threads are
refrigerated. We do need kswapd services for the trick you tried.

And all flavours of ext3_writepage() can block on kjournald activity, so if
kjournald is refrigerated during the memory shrink the machine can deadlock.

It would be much better to freeze kernel threads _after_ doing the big
memory shrink.

2003-07-18 17:36:15

[permalink] [raw]

Subject: Re: Software suspend testing in 2.6.0-test1

Hi!

> > I tried the patch below, but it didn't work. Nothing (or very little)
> > was swapped out to disk. I also tried using GFP_KERNEL, but that
> > seemed to cause a deadlock. (Maybe it would have gone OOM if I had
> > waited long enough). I think the problem is that pdflush and friends
> > are already frozen when this code runs.
>
> Oh, we shouldn't be doing this sort of thing when the kernel threads are
> refrigerated. We do need kswapd services for the trick you tried.
>
> And all flavours of ext3_writepage() can block on kjournald activity, so if
> kjournald is refrigerated during the memory shrink the machine can deadlock.
>
> It would be much better to freeze kernel threads _after_ doing the big
> memory shrink.

I wanted to avoid that: we do want user threads refrigerated at that
point so that we know noone is allocating memory as we are trying to
do memory shrink. I'd like to avoid having refrigerator run in two
phases....

Pavel
--
When do you have a heart between your knees?
[Johanka's followup: and *two* hearts?]

2003-07-18 17:44:13

by Patrick Mochel

[permalink] [raw]

Subject: Re: Software suspend testing in 2.6.0-test1

> I wanted to avoid that: we do want user threads refrigerated at that
> point so that we know noone is allocating memory as we are trying to
> do memory shrink. I'd like to avoid having refrigerator run in two
> phases....

But we should be the only process running, and we can guarantee that by
not sleeping and doing preempt_disable() when we begin. Especially
if we start the refrigeration sequence after we shrink memory. Right?

-pat

2003-07-18 17:50:14

[permalink] [raw]

Subject: Re: Software suspend testing in 2.6.0-test1

Hi!

> > I wanted to avoid that: we do want user threads refrigerated at that
> > point so that we know noone is allocating memory as we are trying to
> > do memory shrink. I'd like to avoid having refrigerator run in two
> > phases....
>
> But we should be the only process running, and we can guarantee that by
> not sleeping and doing preempt_disable() when we begin. Especially
> if we start the refrigeration sequence after we shrink
> memory. Right?

If we refrigerate after we shrink, userspace can allocate everything
just after shrink.
Pavel
--
When do you have a heart between your knees?
[Johanka's followup: and *two* hearts?]

2003-07-18 19:30:06

[permalink] [raw]

Subject: Re: Software suspend testing in 2.6.0-test1

Pavel Machek <[email protected]> wrote:
>
> > It would be much better to freeze kernel threads _after_ doing the big
> > memory shrink.
>
> I wanted to avoid that: we do want user threads refrigerated at that
> point so that we know noone is allocating memory as we are trying to
> do memory shrink.

freeze(threads which have task->mm)
shrink_memory();
freeze(threads which have !task->mm)

That's pretty simple.

> I'd like to avoid having refrigerator run in two phases....

The whole thing can deadlock without kernel thread services.

2003-07-18 19:44:28

[permalink] [raw]

Subject: Re: Software suspend testing in 2.6.0-test1

Andrew Morton <[email protected]> writes:

> Peter Osterlund <[email protected]> wrote:
> >
> > I tried the patch below, but it didn't work. Nothing (or very little)
> > was swapped out to disk. I also tried using GFP_KERNEL, but that
> > seemed to cause a deadlock. (Maybe it would have gone OOM if I had
> > waited long enough). I think the problem is that pdflush and friends
> > are already frozen when this code runs.
>
> Oh, we shouldn't be doing this sort of thing when the kernel threads are
> refrigerated. We do need kswapd services for the trick you tried.
>
> And all flavours of ext3_writepage() can block on kjournald activity, so if
> kjournald is refrigerated during the memory shrink the machine can deadlock.
>
> It would be much better to freeze kernel threads _after_ doing the big
> memory shrink.

The patch below works, but doesn't really solve anything, because it
is just as slow as the original code. This is not surprising because
it is still balance_pgdat() that does the page freeing, the only
difference being that it is now called from kswapd which is woken up
by the alloc_page() call.

Note that I had to use HZ/5 in free_some_memory() or else the loop
would terminate too early. The main problem seems to be that
balance_pgdat() is too slow when freeing memory. The function can fail
to free memory in the inner loop either because the disk is congested
or because too few pages are scanned, but in both cases the function
calls blk_congestion_wait(), and in the second case the disk may be
idle and then blk_congestion_wait() doesn't return until the timeout
expires.

diff -u -r -N linux-p0/fs/jbd/journal.c linux-p1/fs/jbd/journal.c
--- linux-p0/fs/jbd/journal.c Fri Jul 18 21:07:12 2003
+++ linux-p1/fs/jbd/journal.c Fri Jul 18 21:13:15 2003
@@ -132,6 +132,8 @@

daemonize("kjournald");

+ current->flags |= PF_IOTHREAD;
+
/* Set up an interval timer which can be used to trigger a
commit wakeup after the commit interval expires */
init_timer(&timer);
diff -u -r -N linux-p0/kernel/suspend.c linux-p1/kernel/suspend.c
--- linux-p0/kernel/suspend.c Fri Jul 18 21:08:45 2003
+++ linux-p1/kernel/suspend.c Fri Jul 18 21:13:18 2003
@@ -621,10 +621,32 @@
*/
static void free_some_memory(void)
{
- printk("Freeing memory: ");
- while (shrink_all_memory(10000))
- printk(".");
+ LIST_HEAD(list);
+ struct page *page, *tmp;
+ int sleep_count = 0;
+ int i = 0;
+
+ while (sleep_count < 10) {
+ page = alloc_page(GFP_ATOMIC);
+ if (page) {
+ list_add(&page->list, &list);
+ sleep_count = 0;
+ } else {
+ blk_congestion_wait(WRITE, HZ/5);
+ sleep_count++;
+ }
+ i++;
+ if (!(i%1000))
+ printk(".");
+ }
printk("|\n");
+
+ i = 0;
+ list_for_each_entry_safe(page, tmp, &list, list) {
+ __free_page(page);
+ i++;
+ }
+ printk("%d pages freed\n", i);
}

/* Make disk drivers accept operations, again */
diff -u -r -N linux-p0/mm/pdflush.c linux-p1/mm/pdflush.c
--- linux-p0/mm/pdflush.c Fri Jul 18 21:08:47 2003
+++ linux-p1/mm/pdflush.c Fri Jul 18 21:13:20 2003
@@ -88,7 +88,7 @@
{
daemonize("pdflush");

- current->flags |= PF_FLUSHER;
+ current->flags |= PF_FLUSHER | PF_IOTHREAD;
my_work->fn = NULL;
my_work->who = current;
INIT_LIST_HEAD(&my_work->list);
diff -u -r -N linux-p0/mm/vmscan.c linux-p1/mm/vmscan.c
--- linux-p0/mm/vmscan.c Fri Jul 18 21:08:47 2003
+++ linux-p1/mm/vmscan.c Fri Jul 18 21:13:22 2003
@@ -921,7 +921,7 @@
if (i < ZONE_HIGHMEM) {
reclaim_state->reclaimed_slab = 0;
shrink_slab(max_scan + nr_mapped, GFP_KERNEL);
- to_free += reclaim_state->reclaimed_slab;
+ to_free -= reclaim_state->reclaimed_slab;
}
if (zone->all_unreclaimable)
continue;
@@ -976,10 +976,11 @@
* us from recursively trying to free more memory as we're
* trying to free the first piece of memory in the first place).
*/
- tsk->flags |= PF_MEMALLOC|PF_KSWAPD;
+ tsk->flags |= PF_MEMALLOC|PF_KSWAPD|PF_IOTHREAD;

for ( ; ; ) {
struct page_state ps;
+ int freed;

if (current->flags & PF_FREEZE)
refrigerator(PF_IOTHREAD);
@@ -987,7 +988,8 @@
schedule();
finish_wait(&pgdat->kswapd_wait, &wait);
get_page_state(&ps);
- balance_pgdat(pgdat, 0, &ps);
+ freed = balance_pgdat(pgdat, 0, &ps);
+ printk("kswapd: freed %d pages\n", freed);
}
}

--
Peter Osterlund - [email protected]
http://w1.894.telia.com/~u89404340

2003-07-18 20:08:05

[permalink] [raw]

Subject: Re: Software suspend testing in 2.6.0-test1

Peter Osterlund <[email protected]> wrote:
>
> The patch below works, but doesn't really solve anything, because it
> is just as slow as the original code.

That's OK - it's a step in the right direction.

> This is not surprising because
> it is still balance_pgdat() that does the page freeing, the only
> difference being that it is now called from kswapd which is woken up
> by the alloc_page() call.

Yup. Probaby we should not be throttling kswapd if it is making good
progress, but this needs broad testing and thought. Something like this,
untested:

--- 25/mm/vmscan.c~less-kswapd-throttling Thu Jul 17 15:35:09 2003
+++ 25-akpm/mm/vmscan.c Thu Jul 17 15:35:12 2003
@@ -930,7 +930,8 @@ static int balance_pgdat(pg_data_t *pgda
}
if (all_zones_ok)
break;
- blk_congestion_wait(WRITE, HZ/10);
+ if (to_free)
+ blk_congestion_wait(WRITE, HZ/10);
}
return nr_pages - to_free;
}

> Note that I had to use HZ/5 in free_some_memory() or else the loop
> would terminate too early. The main problem seems to be that
> balance_pgdat() is too slow when freeing memory. The function can fail
> to free memory in the inner loop either because the disk is congested
> or because too few pages are scanned, but in both cases the function
> calls blk_congestion_wait(), and in the second case the disk may be
> idle and then blk_congestion_wait() doesn't return until the timeout
> expires.

hm, perhaps because kswapd is throttled rather than doing work.

Try the above change, see if you still need the extended timeout in
free_some_memory().

Also, play around with using __GFP_WAIT|__GFP_HIGH|__GFP_NORETRY rather
than GFP_ATOMIC.

2003-07-18 21:01:59

by Nigel Cunningham

[permalink] [raw]

Subject: Re: Software suspend testing in 2.6.0-test1

On Sat, 2003-07-19 at 04:45, Andrew Morton wrote:
> Oh, we shouldn't be doing this sort of thing when the kernel threads are
> refrigerated. We do need kswapd services for the trick you tried.
>
> And all flavours of ext3_writepage() can block on kjournald activity, so if
> kjournald is refrigerated during the memory shrink the machine can deadlock.
>
> It would be much better to freeze kernel threads _after_ doing the big
> memory shrink.

Yes, that's what the 2.4 version does. And it freezes the kernel threads
in a particular order to avoid deadlocking.

Regards,

Nigel

--
Nigel Cunningham
495 St Georges Road South, Hastings 4201, New Zealand

You see, at just the right time, when we were still powerless,
Christ died for the ungodly.
-- Romans 5:6, NIV.

2003-07-18 21:07:02

by Nigel Cunningham

[permalink] [raw]

Subject: Re: Software suspend testing in 2.6.0-test1

Hi.

In 2.4, I ended up fusing the freezer and shrinking memory together. The
essence is:

freeze all other processes
while need to free memory
unfreeze processes
shrink memory until think we have enough
freeze processes

Regards,

Nigel

On Sat, 2003-07-19 at 06:04, Pavel Machek wrote:
> Hi!
>
> > > I wanted to avoid that: we do want user threads refrigerated at that
> > > point so that we know noone is allocating memory as we are trying to
> > > do memory shrink. I'd like to avoid having refrigerator run in two
> > > phases....
> >
> > But we should be the only process running, and we can guarantee that by
> > not sleeping and doing preempt_disable() when we begin. Especially
> > if we start the refrigeration sequence after we shrink
> > memory. Right?
>
> If we refrigerate after we shrink, userspace can allocate everything
> just after shrink.
> Pavel
--
Nigel Cunningham
495 St Georges Road South, Hastings 4201, New Zealand

You see, at just the right time, when we were still powerless,
Christ died for the ungodly.
-- Romans 5:6, NIV.

2003-07-18 21:59:59

[permalink] [raw]

Subject: Re: Software suspend testing in 2.6.0-test1

Hi!

> > The patch below works, but doesn't really solve anything, because it
> > is just as slow as the original code.
>
> That's OK - it's a step in the right direction.

Just beware: It is *not* okay (for production) to set kflushd as a
PF_IOTHREAD. Its okay for testing through.
Pavel
--
When do you have a heart between your knees?
[Johanka's followup: and *two* hearts?]

2003-07-20 00:11:44

[permalink] [raw]

Subject: Re: Software suspend testing in 2.6.0-test1

Andrew Morton <[email protected]> writes:

> Yup. Probaby we should not be throttling kswapd if it is making good
> progress, but this needs broad testing and thought. Something like this,
> untested:
>
> --- 25/mm/vmscan.c~less-kswapd-throttling Thu Jul 17 15:35:09 2003
> +++ 25-akpm/mm/vmscan.c Thu Jul 17 15:35:12 2003
> @@ -930,7 +930,8 @@ static int balance_pgdat(pg_data_t *pgda
> }
> if (all_zones_ok)
> break;
> - blk_congestion_wait(WRITE, HZ/10);
> + if (to_free)
> + blk_congestion_wait(WRITE, HZ/10);
> }
> return nr_pages - to_free;
> }
>
>
> > Note that I had to use HZ/5 in free_some_memory() or else the loop
> > would terminate too early. The main problem seems to be that
> > balance_pgdat() is too slow when freeing memory. The function can fail
> > to free memory in the inner loop either because the disk is congested
> > or because too few pages are scanned, but in both cases the function
> > calls blk_congestion_wait(), and in the second case the disk may be
> > idle and then blk_congestion_wait() doesn't return until the timeout
> > expires.
>
> hm, perhaps because kswapd is throttled rather than doing work.
>
> Try the above change, see if you still need the extended timeout in
> free_some_memory().

I have tried the change, but the writeout is still very slow. (Maybe
somewhat faster than the original code, but far from being limited by
disk bandwidth.)

It turns out the extended timeout in free_some_memory() is needed to
handle things like disk spin up time and thermal recalibration.
Ideally there should not be a timeout at all in that function. It
should wait until the congestion goes away or until all I/O is
finished. But that's probably much harder to implement. Is there some
global counter somewhere that can be used to figure out how many pages
are currently being on their way to the disk?

> Also, play around with using __GFP_WAIT|__GFP_HIGH|__GFP_NORETRY rather
> than GFP_ATOMIC.

I did, without much success.

One thing that does seem to work though, is to only throttle in
balance_pgdat() if we have called writepage() since the last
throttling, like in the patch below. (ugly, writepage_called should
not be a static variable.) With that patch I get full disk bandwidth
speed during page freeing.

Another disturbing thing I've seen is that on one of my machines, the
suspend process sometimes fails to write anything to disk. When this
happens, the suspend will fail if not enough memory is free, and the
machine hangs shortly afterwards. I've seen this with plain
2.6.0-test1 and with all patches I have tested, with and without
preempt. On another laptop, I have never seen this problem. (I'll try
to debug this further, but it is complicated by the fact that the
problematic laptop doesn't have a serial port.)

diff -u -r orig/linux-page/fs/jbd/journal.c linux-page/fs/jbd/journal.c
--- orig/linux-page/fs/jbd/journal.c Fri Jul 18 21:07:12 2003
+++ linux-page/fs/jbd/journal.c Sun Jul 20 01:08:56 2003
@@ -132,6 +132,8 @@

daemonize("kjournald");

+ current->flags |= PF_IOTHREAD;
+
/* Set up an interval timer which can be used to trigger a
commit wakeup after the commit interval expires */
init_timer(&timer);
diff -u -r orig/linux-page/kernel/suspend.c linux-page/kernel/suspend.c
--- orig/linux-page/kernel/suspend.c Sat Jul 19 21:07:14 2003
+++ linux-page/kernel/suspend.c Sun Jul 20 01:17:33 2003
@@ -623,10 +623,32 @@
*/
static void free_some_memory(void)
{
- printk("Freeing memory: ");
- while (shrink_all_memory(10000))
- printk(".");
+ LIST_HEAD(list);
+ struct page *page, *tmp;
+ int sleep_count = 0;
+ int i = 0;
+
+ while (sleep_count < 10) {
+ page = alloc_page(__GFP_HIGH | __GFP_NOWARN);
+ if (page) {
+ list_add(&page->list, &list);
+ sleep_count = 0;
+ } else {
+ blk_congestion_wait(WRITE, HZ/5);
+ sleep_count++;
+ }
+ i++;
+ if (!(i%1000))
+ printk(".");
+ }
printk("|\n");
+
+ i = 0;
+ list_for_each_entry_safe(page, tmp, &list, list) {
+ __free_page(page);
+ i++;
+ }
+ printk("%d pages freed\n", i);
}

/* Make disk drivers accept operations, again */
diff -u -r orig/linux-page/mm/pdflush.c linux-page/mm/pdflush.c
--- orig/linux-page/mm/pdflush.c Fri Jul 18 21:08:47 2003
+++ linux-page/mm/pdflush.c Sun Jul 20 01:08:56 2003
@@ -88,7 +88,7 @@
{
daemonize("pdflush");

- current->flags |= PF_FLUSHER;
+ current->flags |= PF_FLUSHER | PF_IOTHREAD;
my_work->fn = NULL;
my_work->who = current;
INIT_LIST_HEAD(&my_work->list);
diff -u -r orig/linux-page/mm/vmscan.c linux-page/mm/vmscan.c
--- orig/linux-page/mm/vmscan.c Fri Jul 18 21:08:47 2003
+++ linux-page/mm/vmscan.c Sun Jul 20 01:40:24 2003
@@ -49,6 +49,8 @@
int vm_swappiness = 60;
static long total_memory;

+static int writepage_called;
+
#ifdef ARCH_HAS_PREFETCH
#define prefetch_prev_lru_page(_page, _base, _field) \
do { \
@@ -339,6 +341,7 @@

SetPageReclaim(page);
res = mapping->a_ops->writepage(page, &wbc);
+ writepage_called = 1;

if (res == WRITEPAGE_ACTIVATE) {
ClearPageReclaim(page);
@@ -894,6 +897,8 @@
for (priority = DEF_PRIORITY; priority; priority--) {
int all_zones_ok = 1;

+ writepage_called = 0;
+
for (i = 0; i < pgdat->nr_zones; i++) {
struct zone *zone = pgdat->node_zones + i;
int nr_mapped = 0;
@@ -921,7 +926,7 @@
if (i < ZONE_HIGHMEM) {
reclaim_state->reclaimed_slab = 0;
shrink_slab(max_scan + nr_mapped, GFP_KERNEL);
- to_free += reclaim_state->reclaimed_slab;
+ to_free -= reclaim_state->reclaimed_slab;
}
if (zone->all_unreclaimable)
continue;
@@ -930,7 +935,9 @@
}
if (all_zones_ok)
break;
- blk_congestion_wait(WRITE, HZ/10);
+// if (to_free)
+ if (writepage_called)
+ blk_congestion_wait(WRITE, HZ/10);
}
return nr_pages - to_free;
}
@@ -976,10 +983,11 @@
* us from recursively trying to free more memory as we're
* trying to free the first piece of memory in the first place).
*/
- tsk->flags |= PF_MEMALLOC|PF_KSWAPD;
+ tsk->flags |= PF_MEMALLOC|PF_KSWAPD|PF_IOTHREAD;

for ( ; ; ) {
struct page_state ps;
+ int freed;

if (current->flags & PF_FREEZE)
refrigerator(PF_IOTHREAD);
@@ -987,7 +995,8 @@
schedule();
finish_wait(&pgdat->kswapd_wait, &wait);
get_page_state(&ps);
- balance_pgdat(pgdat, 0, &ps);
+ freed = balance_pgdat(pgdat, 0, &ps);
+// printk("kswapd: freed %d pages\n", freed);
}
}

--
Peter Osterlund - [email protected]
http://w1.894.telia.com/~u89404340

2003-07-20 00:45:14

[permalink] [raw]

Subject: Re: Software suspend testing in 2.6.0-test1

Peter Osterlund <[email protected]> wrote:
>
> I have tried the change, but the writeout is still very slow. (Maybe
> somewhat faster than the original code, but far from being limited by
> disk bandwidth.)

Did you fix swsusp to leave kswapd unrefrigerated during the shrink? If
not, the change wouldn't make any difference.

2003-07-20 07:31:26

[permalink] [raw]

Subject: Re: Software suspend testing in 2.6.0-test1

Andrew Morton <[email protected]> writes:

> Peter Osterlund <[email protected]> wrote:
> >
> > I have tried the change, but the writeout is still very slow. (Maybe
> > somewhat faster than the original code, but far from being limited by
> > disk bandwidth.)
>
> Did you fix swsusp to leave kswapd unrefrigerated during the shrink? If
> not, the change wouldn't make any difference.

Yes, handled by this part of the patch:

@@ -976,10 +983,11 @@
* us from recursively trying to free more memory as we're
* trying to free the first piece of memory in the first place).
*/
- tsk->flags |= PF_MEMALLOC|PF_KSWAPD;
+ tsk->flags |= PF_MEMALLOC|PF_KSWAPD|PF_IOTHREAD;

Without that change, nothing got swapped to disk. It looks like
__alloc_pages(GFP_ATOMIC,...) only wakes up the kswapd threads. Is the
pdflush threads needed during memory freeing? My patch leaves them
unrefrigerated too, but Pavel said that wasn't safe for some reason.

--
Peter Osterlund - [email protected]
http://w1.894.telia.com/~u89404340

2003-07-21 09:47:12

[permalink] [raw]

Subject: Re: Software suspend testing in 2.6.0-test1

Pavel Machek <[email protected]> writes:

> > I have done some testing of the software suspend function in
> > 2.6.0-test1. It works mostly very well, but I have found two problems.
> >
> > The first problem is that software suspend fails if a process is
> > stopped before you invoke suspend. (For example, by starting cat from
> > the shell and pressing ctrl-z.) When the processes are woken up again,
> > the cat process is stuck in the schedule loop in refrigerator(),
> > sucking up all available cpu time.
>
> Thanks for a report. If it came with a patch it would be better
> ;-). I'll take a look.

Here is a patch to make software suspend correctly handle stopped
processes. It should apply cleanly on top of the patch you sent to
Linus to decouple SOFTWARE_SUSPEND and ACPI_SLEEP.

diff -u -r orig/linux-stopped/kernel/suspend.c linux-stopped/kernel/suspend.c
--- orig/linux-stopped/kernel/suspend.c Sun Jul 20 11:14:58 2003
+++ linux-stopped/kernel/suspend.c Mon Jul 21 11:51:51 2003
@@ -164,14 +164,18 @@
* Refrigerator and related stuff
*/

-#define INTERESTING(p) \
- /* We don't want to touch kernel_threads..*/ \
- if (p->flags & PF_IOTHREAD) \
- continue; \
- if (p == current) \
- continue; \
- if (p->state == TASK_ZOMBIE) \
- continue;
+/* 0 = Ignore this process when freezing/thawing, 1 = freeze/thaw this process */
+static inline int interesting_process(struct task_struct *p)
+{
+ if (p->flags & PF_IOTHREAD)
+ return 0;
+ if (p == current)
+ return 0;
+ if ((p->state == TASK_ZOMBIE) || (p->state == TASK_DEAD))
+ return 0;
+
+ return 1;
+}

#define TIMEOUT (6 * HZ) /* Timeout for stopping processes */

@@ -214,9 +218,12 @@
read_lock(&tasklist_lock);
do_each_thread(g, p) {
unsigned long flags;
- INTERESTING(p);
+ if (!interesting_process(p))
+ continue;
if (p->flags & PF_FROZEN)
continue;
+ if (p->state == TASK_STOPPED)
+ continue;

/* FIXME: smp problem here: we may not access other process' flags
without locking */
@@ -247,12 +254,15 @@
printk( "Restarting tasks..." );
read_lock(&tasklist_lock);
do_each_thread(g, p) {
- INTERESTING(p);
-
- if (p->flags & PF_FROZEN) p->flags &= ~PF_FROZEN;
- else
- printk(KERN_INFO " Strange, %s not stopped\n", p->comm );
- wake_up_process(p);
+ if (!interesting_process(p))
+ continue;
+
+ p->flags &= ~PF_FREEZE;
+ if (p->flags & PF_FROZEN) {
+ p->flags &= ~PF_FROZEN;
+ wake_up_process(p);
+ } else
+ PRINTK(KERN_INFO " Strange, %s not frozen\n", p->comm );
} while_each_thread(g, p);

read_unlock(&tasklist_lock);

--
Peter Osterlund - [email protected]
http://w1.894.telia.com/~u89404340

2003-07-21 12:47:12

[permalink] [raw]

Subject: Re: Software suspend testing in 2.6.0-test1

Hi!

> > Thanks for a report. If it came with a patch it would be better
> > ;-). I'll take a look.
>
> Here is a patch to make software suspend correctly handle stopped
> processes. It should apply cleanly on top of the patch you sent to
> Linus to decouple SOFTWARE_SUSPEND and ACPI_SLEEP.
>
> @@ -164,14 +164,18 @@
> * Refrigerator and related stuff
> */
>
> -#define INTERESTING(p) > - /* We don't want to touch kernel_threads..*/ > - if (p->flags & PF_IOTHREAD) > - continue; > - if (p == current) > - continue; > - if (p->state == TASK_ZOMBIE) > - continue;
> +/* 0 = Ignore this process when freezing/thawing, 1 = freeze/thaw this process */
> +static inline int interesting_process(struct task_struct *p)
> +{
> + if (p->flags & PF_IOTHREAD)
> + return 0;
> + if (p == current)
> + return 0;
> + if ((p->state == TASK_ZOMBIE) || (p->state == TASK_DEAD))
> + return 0;
> +
> + return 1;
> +}
>

Simple cleanup, okay.

> @@ -214,9 +218,12 @@
> read_lock(&tasklist_lock);
> do_each_thread(g, p) {
> unsigned long flags;
> - INTERESTING(p);
> + if (!interesting_process(p))
> + continue;
> if (p->flags & PF_FROZEN)
> continue;
> + if (p->state == TASK_STOPPED)
> + continue;
>
> /* FIXME: smp problem here: we may not access other process' flags
> without locking */

No need to handle TASK_STOPPED tasks, as they are "frozen" already. Ok.

> read_lock(&tasklist_lock);
> do_each_thread(g, p) {
> - INTERESTING(p);
> -
> - if (p->flags & PF_FROZEN) p->flags &= ~PF_FROZEN;
> - else
> - printk(KERN_INFO " Strange, %s not stopped\n", p->comm );
> - wake_up_process(p);
> + if (!interesting_process(p))
> + continue;
> +
> + p->flags &= ~PF_FREEZE;
> + if (p->flags & PF_FROZEN) {
> + p->flags &= ~PF_FROZEN;
> + wake_up_process(p);
> + } else
> +
But why do you touch PF_FROZEN here? Refrigerator should do that.
And wake_up_process should not be needed...
If it is in refrigerator, it polls PF_FREEZE...

--
Pavel
Written on sharp zaurus, because my Velo1 broke. If you have Velo you don't need...

2003-07-21 14:23:49

[permalink] [raw]

Subject: Re: Software suspend testing in 2.6.0-test1

Pavel Machek <[email protected]> writes:

> > @@ -214,9 +218,12 @@
> > read_lock(&tasklist_lock);
> > do_each_thread(g, p) {
> > unsigned long flags;
> > - INTERESTING(p);
> > + if (!interesting_process(p))
> > + continue;
> > if (p->flags & PF_FROZEN)
> > continue;
> > + if (p->state == TASK_STOPPED)
> > + continue;
> >
> > /* FIXME: smp problem here: we may not access other process' flags
> > without locking */
>
> No need to handle TASK_STOPPED tasks, as they are "frozen" already. Ok.

If the process was stopped before swsusp was invoked, the process will
never reach refrigerator() and therefore never set the PF_FROZEN
flag. Without the test for TASK_STOPPED, swsusp gives up and claims
that it can't stop all processes.

> > read_lock(&tasklist_lock);
> > do_each_thread(g, p) {
> > - INTERESTING(p);
> > -
> > - if (p->flags & PF_FROZEN) p->flags &= ~PF_FROZEN;
> > - else
> > - printk(KERN_INFO " Strange, %s not stopped\n", p->comm );
> > - wake_up_process(p);
> > + if (!interesting_process(p))
> > + continue;
> > +
> > + p->flags &= ~PF_FREEZE;
> > + if (p->flags & PF_FROZEN) {
> > + p->flags &= ~PF_FROZEN;
> > + wake_up_process(p);
> > + } else
> > +
> But why do you touch PF_FROZEN here? Refrigerator should do that.
> And wake_up_process should not be needed...
> If it is in refrigerator, it polls PF_FREEZE...

Note that the old code always called wake_up_process(), which is
necessary to make the process run one more iteration in refrigerator()
and relize that it is time to unfreeze.

The patch changes things so that wake_up_process() is NOT called if
the process is stopped at some other place than in refrigerator().
This ensures that processes that were stopped before we invoked swsusp
are still stopped after resume.

I manually clear PF_FREEZE here in an attempt to handle a race
condition, but I realize I need to understand more of the scheduler
and signal code before I know for sure if this is necessary and/or
sufficient.

--
Peter Osterlund - [email protected]
http://w1.894.telia.com/~u89404340

2003-07-21 21:14:02

[permalink] [raw]

Subject: Re: Software suspend testing in 2.6.0-test1

Hi!

> > But why do you touch PF_FROZEN here? Refrigerator should do that.
> > And wake_up_process should not be needed...
> > If it is in refrigerator, it polls PF_FREEZE...
>
> Note that the old code always called wake_up_process(), which is
> necessary to make the process run one more iteration in refrigerator()
> and relize that it is time to unfreeze.
>
> The patch changes things so that wake_up_process() is NOT called if
> the process is stopped at some other place than in refrigerator().
> This ensures that processes that were stopped before we invoked swsusp
> are still stopped after resume.

Yes, but you still print warning for them. I hopefully killed that.

> I manually clear PF_FREEZE here in an attempt to handle a race
> condition, but I realize I need to understand more of the scheduler
> and signal code before I know for sure if this is necessary and/or
> sufficient.

I do not see the race so I killed that... Here's what I applied.

Pavel

--- /usr/src/tmp/linux/kernel/suspend.c 2003-07-21 22:15:39.000000000 +0200
+++ /usr/src/linux/kernel/suspend.c 2003-07-21 22:11:28.000000000 +0200
@@ -5,7 +5,7 @@
* machine suspend feature using pretty near only high-level routines
*
* Copyright (C) 1998-2001 Gabor Kuti <[email protected]>
- * Copyright (C) 1998,2001,2002 Pavel Machek <[email protected]>
+ * Copyright (C) 1998,2001-2003 Pavel Machek <[email protected]>
*
* I'd like to thank the following people for their work:
*
@@ -84,7 +84,6 @@
#define ADDRESS2(x) __ADDRESS(__pa(x)) /* Needed for x86-64 where some pages are in memory twice */

/* References to section boundaries */
-extern char _text, _etext, _edata, __bss_start, _end;
extern char __nosave_begin, __nosave_end;

extern int is_head_of_free_region(struct page *);
@@ -164,14 +163,21 @@
* Refrigerator and related stuff
*/

-#define INTERESTING(p) \
- /* We don't want to touch kernel_threads..*/ \
- if (p->flags & PF_IOTHREAD) \
- continue; \
- if (p == current) \
- continue; \
- if (p->state == TASK_ZOMBIE) \
- continue;
+/* 0 = Ignore this process when freezing/thawing, 1 = freeze/thaw this process */
+static inline int interesting_process(struct task_struct *p)
+{
+ if (p->flags & PF_IOTHREAD)
+ return 0;
+ if (p == current)
+ return 0;
+ if ((p->state == TASK_ZOMBIE) || (p->state == TASK_DEAD))
+ return 0;
+ if (p->state == TASK_STOPPED)
+ return 0;
+
+
+ return 1;
+}

#define TIMEOUT (6 * HZ) /* Timeout for stopping processes */

@@ -214,7 +220,8 @@
read_lock(&tasklist_lock);
do_each_thread(g, p) {
unsigned long flags;
- INTERESTING(p);
+ if (!interesting_process(p))
+ continue;
if (p->flags & PF_FROZEN)
continue;

@@ -247,13 +254,14 @@
printk( "Restarting tasks..." );
read_lock(&tasklist_lock);
do_each_thread(g, p) {
- INTERESTING(p);
-
- if (p->flags & PF_FROZEN)
- p->flags &= ~PF_FROZEN;
- else
- printk(KERN_INFO " Strange, %s not stopped\n", p->comm );
- wake_up_process(p);
+ if (!interesting_process(p))
+ continue;
+
+ if (p->flags & PF_FROZEN) {
+ p->flags &= ~PF_FROZEN;
+ wake_up_process(p);
+ } else
+ PRINTK(KERN_ERR " Strange, %s not frozen\n", p->comm );
} while_each_thread(g, p);

read_unlock(&tasklist_lock);

--
When do you have a heart between your knees?
[Johanka's followup: and *two* hearts?]

2003-07-21 23:36:16

[permalink] [raw]

Subject: Re: Software suspend testing in 2.6.0-test1

Hi!

Pavel Machek <[email protected]> writes:

> > > But why do you touch PF_FROZEN here? Refrigerator should do that.
> > > And wake_up_process should not be needed...
> > > If it is in refrigerator, it polls PF_FREEZE...
> >
> > Note that the old code always called wake_up_process(), which is
> > necessary to make the process run one more iteration in refrigerator()
> > and relize that it is time to unfreeze.
> >
> > The patch changes things so that wake_up_process() is NOT called if
> > the process is stopped at some other place than in refrigerator().
> > This ensures that processes that were stopped before we invoked swsusp
> > are still stopped after resume.
>
> Yes, but you still print warning for them. I hopefully killed that.
...
> +static inline int interesting_process(struct task_struct *p)
> +{
> + if (p->flags & PF_IOTHREAD)
> + return 0;
> + if (p == current)
> + return 0;
> + if ((p->state == TASK_ZOMBIE) || (p->state == TASK_DEAD))
> + return 0;
> + if (p->state == TASK_STOPPED)
> + return 0;
> +
> +
> + return 1;
> +}

But this doesn't work. We can't skip stopped tasks in the
thaw_processes() function, because frozen tasks are also stopped
tasks. Therefore nothing will be woken up during resume.

It's probably best to just delete the printk("strange,...") line.

--
Peter Osterlund - [email protected]
http://w1.894.telia.com/~u89404340

2003-07-22 10:49:48