2001-10-06 08:06:47

by Simon Kirby

[permalink] [raw]
Subject: 2.4.11pre4 swapping out all over the place

2.4.11pre4 is swapping out on me while burning a CD at 4x.
2.4.11pre2 (or maybe it was pre1) seemed to be a lot better.
2.4.10pre10 still seems to be the best for me so far...

Simon-

[ Stormix Technologies Inc. ][ NetNation Communications Inc. ]
[ [email protected] ][ [email protected] ]
[ Opinions expressed are not necessarily those of my employers. ]


2001-10-06 08:59:47

by Tobias Ringstrom

[permalink] [raw]
Subject: Re: 2.4.11pre4 swapping out all over the place

On Sat, 6 Oct 2001, Simon Kirby wrote:

> 2.4.11pre4 is swapping out on me while burning a CD at 4x.
> 2.4.11pre2 (or maybe it was pre1) seemed to be a lot better.
> 2.4.10pre10 still seems to be the best for me so far...

I can confirm that in 2.4.11-pre4, the used-once pages are causing
page-out activity, as opposed to 2.4.11-pre2 which did not. Streaming i/o
performce is down, and so is the interactive responsiveness (a lot). To
reproduce, run

dd if=/dev/hde1 of=/dev/null bs=4k

This should not produce paging. The block size is not important.

/Tobias

2001-10-06 16:10:24

by Linus Torvalds

[permalink] [raw]
Subject: Re: 2.4.11pre4 swapping out all over the place


On Sat, 6 Oct 2001, Tobias Ringstrom wrote:
>
> I can confirm that in 2.4.11-pre4, the used-once pages are causing
> page-out activity, as opposed to 2.4.11-pre2 which did not.

Yeah, try_to_free_pages() gives up much too early: Marcelo removed the
loop from it. That also shows the problems with the out_of_memory() logic
much more easily.

Can you try just undoing the changes to try_to_free_pages() itself?

Linus

2001-10-06 17:50:08

by Tobias Ringstrom

[permalink] [raw]
Subject: Re: 2.4.11pre4 swapping out all over the place

On Sat, 6 Oct 2001, Linus Torvalds wrote:

> On Sat, 6 Oct 2001, Tobias Ringstrom wrote:
> >
> > I can confirm that in 2.4.11-pre4, the used-once pages are causing
> > page-out activity, as opposed to 2.4.11-pre2 which did not.
>
> Yeah, try_to_free_pages() gives up much too early: Marcelo removed the
> loop from it. That also shows the problems with the out_of_memory() logic
> much more easily.
>
> Can you try just undoing the changes to try_to_free_pages() itself?

Sure, replacing try_to_free_pages() in 2.4.11-pre4 with the one in
2.4.11-pre3 solves the problem.

/Tobias

2001-10-06 21:59:25

by Linus Torvalds

[permalink] [raw]
Subject: Re: 2.4.11pre4 swapping out all over the place


On Sat, 6 Oct 2001, Tobias Ringstrom wrote:
>
> Sure, replacing try_to_free_pages() in 2.4.11-pre4 with the one in
> 2.4.11-pre3 solves the problem.

Ok, can you try this slightly more involved patch instead? It basically
keeps the old try_to_free_pages() (it _looks_ different, but the logic is
the same), but also should honour the OOM-killer.

Linus

-----
diff -u --recursive --new-file pre4/linux/mm/oom_kill.c linux/mm/oom_kill.c
--- pre4/linux/mm/oom_kill.c Thu Oct 4 19:52:11 2001
+++ linux/mm/oom_kill.c Fri Oct 5 13:13:43 2001
@@ -241,13 +241,12 @@
return 0;

/*
- * If the buffer and page cache (excluding swap cache) are over
+ * If the buffer and page cache (including swap cache) are over
* their (/proc tunable) minimum, we're still not OOM. We test
* this to make sure we don't return OOM when the system simply
* has a hard time with the cache.
*/
cache_mem = atomic_read(&page_cache_size);
- cache_mem -= swapper_space.nrpages;
limit = 2;
limit *= num_physpages / 100;

diff -u --recursive --new-file pre4/linux/mm/page_alloc.c linux/mm/page_alloc.c
--- pre4/linux/mm/page_alloc.c Thu Oct 4 19:52:11 2001
+++ linux/mm/page_alloc.c Sat Oct 6 14:54:59 2001
@@ -357,6 +357,7 @@

/* here we're in the low on memory slow path */

+rebalance:
if (current->flags & PF_MEMALLOC) {
zone = zonelist->zones;
for (;;) {
@@ -371,48 +372,28 @@
return NULL;
}

- rebalance:
page = balance_classzone(classzone, gfp_mask, order, &freed);
if (page)
return page;

zone = zonelist->zones;
- if (likely(freed)) {
- for (;;) {
- zone_t *z = *(zone++);
- if (!z)
- break;
+ for (;;) {
+ zone_t *z = *(zone++);
+ if (!z)
+ break;

- if (zone_free_pages(z, order) > z->pages_min) {
- page = rmqueue(z, order);
- if (page)
- return page;
- }
- }
- goto rebalance;
- } else {
- /*
- * Check that no other task is been killed meanwhile,
- * in such a case we can succeed the allocation.
- */
- for (;;) {
- zone_t *z = *(zone++);
- if (!z)
- break;
-
- if (zone_free_pages(z, order) > z->pages_min) {
- page = rmqueue(z, order);
- if (page)
- return page;
- }
+ if (zone_free_pages(z, order) > z->pages_min) {
+ page = rmqueue(z, order);
+ if (page)
+ return page;
}
-
- goto rebalance;
}

- printk(KERN_NOTICE "__alloc_pages: %u-order allocation failed (gfp=0x%x/%i) from %p\n",
- order, gfp_mask, !!(current->flags & PF_MEMALLOC), __builtin_return_address(0));
- return NULL;
+ /* Yield for kswapd, and try again */
+ current->policy |= SCHED_YIELD;
+ __set_current_state(TASK_RUNNING);
+ schedule();
+ goto rebalance;
}

/*
diff -u --recursive --new-file pre4/linux/mm/vmscan.c linux/mm/vmscan.c
--- pre4/linux/mm/vmscan.c Thu Oct 4 19:52:11 2001
+++ linux/mm/vmscan.c Sat Oct 6 14:54:59 2001
@@ -553,14 +556,16 @@
int try_to_free_pages(zone_t * classzone, unsigned int gfp_mask, unsigned int order)
{
int ret = 0;
+ int priority = DEF_PRIORITY;
int nr_pages = SWAP_CLUSTER_MAX;

- nr_pages = shrink_caches(DEF_PRIORITY, classzone, gfp_mask, nr_pages);
+ do {
+ nr_pages = shrink_caches(priority, classzone, gfp_mask, nr_pages);
+ if (nr_pages <= 0)
+ return 1;

- if (nr_pages < SWAP_CLUSTER_MAX)
- ret |= 1;
-
- ret |= swap_out(DEF_PRIORITY, classzone, gfp_mask, SWAP_CLUSTER_MAX << 2);
+ ret |= swap_out(priority, classzone, gfp_mask, SWAP_CLUSTER_MAX << 2);
+ } while (--priority);

return ret;
}

2001-10-06 22:53:26

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.11pre4 swapping out all over the place

On Sat, Oct 06, 2001 at 01:06:56AM -0700, Simon Kirby wrote:
> 2.4.10pre10 still seems to be the best for me so far...

Can you test with -aa too? The VM in -aa isn't exactly the same as in
mainline.

Andrea

2001-10-07 04:59:13

by Simon Kirby

[permalink] [raw]
Subject: Re: 2.4.11pre4 swapping out all over the place

On Sat, Oct 06, 2001 at 02:59:04PM -0700, Linus Torvalds wrote:

> On Sat, 6 Oct 2001, Tobias Ringstrom wrote:
> >
> > Sure, replacing try_to_free_pages() in 2.4.11-pre4 with the one in
> > 2.4.11-pre3 solves the problem.
>
> Ok, can you try this slightly more involved patch instead? It basically
> keeps the old try_to_free_pages() (it _looks_ different, but the logic is
> the same), but also should honour the OOM-killer.

Hmm...Was this supposed to apply against pre4? The second hunk of
page_alloc.c fails for me, and I just tried again with a fresh tree.

Ah, it applied with patch -l. Whitespace problems?

Simon-

> diff -u --recursive --new-file pre4/linux/mm/oom_kill.c linux/mm/oom_kill.c
> --- pre4/linux/mm/oom_kill.c Thu Oct 4 19:52:11 2001
> +++ linux/mm/oom_kill.c Fri Oct 5 13:13:43 2001
> @@ -241,13 +241,12 @@
> return 0;
>
> /*
> - * If the buffer and page cache (excluding swap cache) are over
> + * If the buffer and page cache (including swap cache) are over
> * their (/proc tunable) minimum, we're still not OOM. We test
> * this to make sure we don't return OOM when the system simply
> * has a hard time with the cache.
> */
> cache_mem = atomic_read(&page_cache_size);
> - cache_mem -= swapper_space.nrpages;
> limit = 2;
> limit *= num_physpages / 100;
>
> diff -u --recursive --new-file pre4/linux/mm/page_alloc.c linux/mm/page_alloc.c
> --- pre4/linux/mm/page_alloc.c Thu Oct 4 19:52:11 2001
> +++ linux/mm/page_alloc.c Sat Oct 6 14:54:59 2001
> @@ -357,6 +357,7 @@
>
> /* here we're in the low on memory slow path */
>
> +rebalance:
> if (current->flags & PF_MEMALLOC) {
> zone = zonelist->zones;
> for (;;) {
> @@ -371,48 +372,28 @@
> return NULL;
> }
>
> - rebalance:
> page = balance_classzone(classzone, gfp_mask, order, &freed);
> if (page)
> return page;
>
> zone = zonelist->zones;
> - if (likely(freed)) {
> - for (;;) {
> - zone_t *z = *(zone++);
> - if (!z)
> - break;
> + for (;;) {
> + zone_t *z = *(zone++);
> + if (!z)
> + break;
>
> - if (zone_free_pages(z, order) > z->pages_min) {
> - page = rmqueue(z, order);
> - if (page)
> - return page;
> - }
> - }
> - goto rebalance;
> - } else {
> - /*
> - * Check that no other task is been killed meanwhile,
> - * in such a case we can succeed the allocation.
> - */
> - for (;;) {
> - zone_t *z = *(zone++);
> - if (!z)
> - break;
> -
> - if (zone_free_pages(z, order) > z->pages_min) {
> - page = rmqueue(z, order);
> - if (page)
> - return page;
> - }
> + if (zone_free_pages(z, order) > z->pages_min) {
> + page = rmqueue(z, order);
> + if (page)
> + return page;
> }
> -
> - goto rebalance;
> }
>
> - printk(KERN_NOTICE "__alloc_pages: %u-order allocation failed (gfp=0x%x/%i) from %p\n",
> - order, gfp_mask, !!(current->flags & PF_MEMALLOC), __builtin_return_address(0));
> - return NULL;
> + /* Yield for kswapd, and try again */
> + current->policy |= SCHED_YIELD;
> + __set_current_state(TASK_RUNNING);
> + schedule();
> + goto rebalance;
> }
>
> /*
> diff -u --recursive --new-file pre4/linux/mm/vmscan.c linux/mm/vmscan.c
> --- pre4/linux/mm/vmscan.c Thu Oct 4 19:52:11 2001
> +++ linux/mm/vmscan.c Sat Oct 6 14:54:59 2001
> @@ -553,14 +556,16 @@
> int try_to_free_pages(zone_t * classzone, unsigned int gfp_mask, unsigned int order)
> {
> int ret = 0;
> + int priority = DEF_PRIORITY;
> int nr_pages = SWAP_CLUSTER_MAX;
>
> - nr_pages = shrink_caches(DEF_PRIORITY, classzone, gfp_mask, nr_pages);
> + do {
> + nr_pages = shrink_caches(priority, classzone, gfp_mask, nr_pages);
> + if (nr_pages <= 0)
> + return 1;
>
> - if (nr_pages < SWAP_CLUSTER_MAX)
> - ret |= 1;
> -
> - ret |= swap_out(DEF_PRIORITY, classzone, gfp_mask, SWAP_CLUSTER_MAX << 2);
> + ret |= swap_out(priority, classzone, gfp_mask, SWAP_CLUSTER_MAX << 2);
> + } while (--priority);
>
> return ret;
> }
>

[ Stormix Technologies Inc. ][ NetNation Communications Inc. ]
[ [email protected] ][ [email protected] ]
[ Opinions expressed are not necessarily those of my employers. ]

2001-10-07 04:57:13

by David Miller

[permalink] [raw]
Subject: Re: 2.4.11pre4 swapping out all over the place

From: Linus Torvalds <[email protected]>
Date: Sat, 6 Oct 2001 14:59:04 -0700 (PDT)

Ok, can you try this slightly more involved patch instead?
...

--- pre4/linux/mm/oom_kill.c Thu Oct 4 19:52:11 2001
+++ linux/mm/oom_kill.c Fri Oct 5 13:13:43 2001
@@ -241,13 +241,12 @@

I think you hand edited this patch a little too much this
time :-) Or something else went wrong, as I had to apply
this by hand myself as patch complained about the first chunk
being malformed.

Franks a lot,
David S. Miller
[email protected]

2001-10-07 07:43:11

by Tobias Ringstrom

[permalink] [raw]
Subject: Re: 2.4.11pre4 swapping out all over the place

On Sat, 6 Oct 2001, Linus Torvalds wrote:
>
> Ok, can you try this slightly more involved patch instead? It basically
> keeps the old try_to_free_pages() (it _looks_ different, but the logic is
> the same), but also should honour the OOM-killer.

Yes, this patch also solves the problem.

I just noticed that when reading from an umounted block device, the pages
are not cached between runs, i.e. the cache is dropped on close(). If the
block device contains a mounted filesystem, the pages are not dropped.
Is this intentional?

I was also thinking about Simon's CD-burning case, and the fact that the
used-once logic really does not work very well for such cases. You
usually first run mkisofs to create the image, and then read the image
when writing the CD. This is similar to running

dd if=/dev/zero of=/tmp/cd bs=1M count=300
dd if=/tmp/cd of=/dev/null

In this case, the pages are activated. That is not too bad, since the
system now seems to be able to free even active cache pages before paging
out stuff. (BTW, does it always free all cache before paging out?
That would most likely be very bad for many scenarios.)

So, for the CD-burning case, a used-twice algorithm would probably perform
better. Or perhaps the pages should be activated after having been _read_
more than once, not counting the writes. I wish I had the time to try
this out... :-(

This should only matter if the file is smaller than the amount of
available RAM, which is not too common for CD images.

/Tobias

2001-10-07 17:40:08

by Linus Torvalds

[permalink] [raw]
Subject: Re: 2.4.11pre4 swapping out all over the place


On Sun, 7 Oct 2001, Tobias Ringstrom wrote:
>
> On Sat, 6 Oct 2001, Linus Torvalds wrote:
> >
> > Ok, can you try this slightly more involved patch instead? It basically
> > keeps the old try_to_free_pages() (it _looks_ different, but the logic is
> > the same), but also should honour the OOM-killer.
>
> Yes, this patch also solves the problem.

Good.

> I just noticed that when reading from an umounted block device, the pages
> are not cached between runs, i.e. the cache is dropped on close(). If the
> block device contains a mounted filesystem, the pages are not dropped.
> Is this intentional?

It's intentional, although something that can probably be discussed. The
reasons for it are:
- devices with broken or unreliable disk change mechanisms
- the current dynamic [de-]allocation of block device data structures.
- historical coherency reasons.

None of the reasons for it are very strong - the block device data
structure one is a _current_ implementation detail that has a lot of
reasons going for it, but it's not something that is inherent in any real
major design (we could reasonably easily delay the block device data
structure release for some time).

> I was also thinking about Simon's CD-burning case, and the fact that the
> used-once logic really does not work very well for such cases. You
> usually first run mkisofs to create the image, and then read the image
> when writing the CD. This is similar to running
>
> dd if=/dev/zero of=/tmp/cd bs=1M count=300
> dd if=/tmp/cd of=/dev/null

Well, I actually champion not considering writes accesses _at_all_, but I
was overruled by Marcelo Tosatti. However, I bet that a good example would
change his mind - and a benchmark.

This is easy to test: remove the "SetPageReferenced(page);" from the
generic_file_write() function (note how the current code is a mix between
my "writes aren't references" and Marcelo's "writes are accesses like
reads" - it only marks a page referenced, it never actually activates it).

Are there other valid points in this discussion? I'd love to have a strong
reason for doing what we're doing (which we don't have right now).

> In this case, the pages are activated. That is not too bad, since the
> system now seems to be able to free even active cache pages before paging
> out stuff. (BTW, does it always free all cache before paging out?
> That would most likely be very bad for many scenarios.)

No, activating the pages only makes them slightly harder to get rid of,
they don't become "pinned".

Linus

2001-10-07 17:56:21

by Alexander Viro

[permalink] [raw]
Subject: Re: 2.4.11pre4 swapping out all over the place



On Sun, 7 Oct 2001, Linus Torvalds wrote:

> > I just noticed that when reading from an umounted block device, the pages
> > are not cached between runs, i.e. the cache is dropped on close(). If the
> > block device contains a mounted filesystem, the pages are not dropped.
> > Is this intentional?

Actually, they are dropped as soon as there's no openers left. mounted
fs counts as opener, so does opened file.

> It's intentional, although something that can probably be discussed. The
> reasons for it are:
> - devices with broken or unreliable disk change mechanisms
> - the current dynamic [de-]allocation of block device data structures.
> - historical coherency reasons.

- current logics for driver use count. Block ones do MOD_INC_US_COUNT
on->open() and MOD_DEC_USE_COUNT on ->release().