LinuxLists.cc - [PATCH 02/12] mm: supporting variables and functions for balanced zone aging

[permalink] [raw]

Subject: Re: [PATCH 02/12] mm: supporting variables and functions for balanced zone aging

Wu Fengguang <[email protected]> wrote:
>
> The zone aging rates are currently imbalanced,

ZONE_DMA is out of whack. It shouldn't be, and I'm not aware of anyone
getting in and working out why. I certainly wouldn't want to go and add
all this stuff without having a good understanding of _why_ it's out of
whack. Perhaps it's just some silly bug, like the thing I pointed at in
the previous email.

> the gap can be as large as 3 times,

What's the testcase?

> which can severely damage read-ahead requests and shorten their
> effective life time.

Have you any performance numbers for this?

2005-12-01 12:03:57

[permalink] [raw]

Subject: Re: [PATCH 02/12] mm: supporting variables and functions for balanced zone aging

On Thu, Dec 01, 2005 at 02:37:14AM -0800, Andrew Morton wrote:
> Wu Fengguang <[email protected]> wrote:
> >
> > The zone aging rates are currently imbalanced,
>
> ZONE_DMA is out of whack. It shouldn't be, and I'm not aware of anyone
> getting in and working out why. I certainly wouldn't want to go and add
> all this stuff without having a good understanding of _why_ it's out of
> whack. Perhaps it's just some silly bug, like the thing I pointed at in
> the previous email.

Yep, my rule is that if ever the DMA zone is reclaimed for watermark, it will
be running wild ;) So I leave it out by setting classzone_idx=0, and let the
age balancing code to catch it up. This scheme works fine: tested to be OK from
64M to 2G memory.

> > the gap can be as large as 3 times,
>
> What's the testcase?
>
> > which can severely damage read-ahead requests and shorten their
> > effective life time.
>
> Have you any performance numbers for this?

That's months ago, if I remember it right, the number of concurrent readers the
adaptive read-ahead code can handle without much thrashing was raised from ~100
to 800 with the balancing work.

This is my original announce back then:

The page aging problem showed up when I was testing many slow reads with
limited memory. Pages in the DMA zone were found out to be aged about 3
times faster than that of Normal zone in systems with 64-512M memory.
That is a BIG threat to the read-ahead pages. So I added some code to
make the aging rates synchronized. You can see the effect by running:
$ tar c / | cat > /dev/null &
$ watch -n1 'grep "age " /proc/zoneinfo'
There are still some extra DMA scans in the direct page reclaim path.
It tend to happen in large memory system and is therefore not a big
problem.

And here is some numbers collected by Magnus Damm:
http://lkml.org/lkml/2005/10/25/50

Good night!
Wu

2005-12-01 22:40:50

[permalink] [raw]

Subject: Re: [PATCH 02/12] mm: supporting variables and functions for balanced zone aging

Hi Andrew,

On Thu, Dec 01, 2005 at 02:37:14AM -0800, Andrew Morton wrote:
> Wu Fengguang <[email protected]> wrote:
> >
> > The zone aging rates are currently imbalanced,
>
> ZONE_DMA is out of whack. It shouldn't be, and I'm not aware of anyone
> getting in and working out why. I certainly wouldn't want to go and add
> all this stuff without having a good understanding of _why_ it's out of
> whack. Perhaps it's just some silly bug, like the thing I pointed at in
> the previous email.

I think that the problem is caused by the interaction between
the way reclaiming is quantified and parallel allocators.

The zones have different sizes, and each zone reclaim iteration
scans the same number of pages. It is unfair.

On top of that, kswapd is likely to block while doing its job,
which means that allocators have a chance to run.

It seems that scaling the number of isolated pages to zone
size fixes the unbalancing problem, making the Normal zone
be _more_ scanned than DMA. Which is expected since the
lower zone protection logic decreases allocation pressure
from DMA sending it straight to the Normal zone (therefore
zeroing lower_zone_protection should make the scanning
proportionally equal).

Test used was FFSB using the following profile on a 128MB
P-3 machine:

num_filesystems=1
num_threadgroups=1
directio=0
time=300

[filesystem0]
location=/mnt/hda4/
num_files=20
num_dirs=10
max_filesize=91534338
min_filesize=65535
[end0]

[threadgroup0]
num_threads=10
write_size=2816
write_blocksize=4096
read_size=2816
read_blocksize=4096
create_weight=100
write_weight=30
read_weight=100
[end0]

And the scanning ratios are:

Normal: 114688kB
DMA: 16384kB

Normal/DMA ratio = 114688 / 16384 = 7.000

******* 2.6.14 vanilla ********

* kswapd scanning rates
pgscan_kswapd_normal 450483
pgscan_kswapd_dma 84645
pgscan_kswapd Normal/DMA = (450483 / 88869) = 5.069

* direct scanning rates
pgscan_direct_normal 23826
pgscan_direct_dma 4224
pgscan_direct Normal/DMA = (23826 / 4224) = 5.640

* global (kswapd+direct) scanning rates
pgscan_normal = (450483 + 23826) = 474309
pgscan_dma = (84645 + 4224) = 88869
pgscan Normal/DMA = (474309 / 88869) = 5.337

pgalloc_normal = 794293
pgalloc_dma = 123805
pgalloc_normal_dma_ratio = (794293/123805) = 6.415

****** 2.6.14 isolate relative *****

* kswapd scanning rates
pgscan_kswapd_normal 664883
pgscan_kswapd_dma 82845
pgscan_kswapd Normal/DMA (664883/82845) = 8.025

* direct scanning rates
pgscan_direct_normal 13485
pgscan_direct_dma 1745
pgscan_direct Normal/DMA = (13485/1745) = 7.727

* global (kswapd+direct) scanning rates
pgscan_normal = (664883 + 13485) = 678368
pgscan_dma = (82845 + 1745) = 84590
pgscan Normal/DMA = (678368 / 84590) = 8.019

pgalloc_normal 699927
pgalloc_dma 66313
pgalloc_normal_dma_ratio = (699927/66313) = 10.554

I think it becomes pretty clear that this is really
the case. On vanilla, for each DMA allocation, there
are 6.415 NORMAL allocations, while the NORMAL zone
is 7.000 times the size of DMA.

With the patch, there are 10.5 NORMAL allocations for each
DMA one.

Batching of reclaim is affected by the relative
isolation (now in smaller batches) though.

--- mm/vmscan.c.orig 2006-01-01 12:44:39.000000000 -0200
+++ mm/vmscan.c 2006-01-01 16:43:54.000000000 -0200
@@ -616,8 +616,12 @@
{
LIST_HEAD(page_list);
struct pagevec pvec;
+ int nr_to_isolate;
int max_scan = sc->nr_to_scan;

+ nr_to_isolate = (sc->swap_cluster_max * zone->present_pages)
+ / total_memory;
+
pagevec_init(&pvec, 1);

lru_add_drain();
@@ -628,7 +632,8 @@
int nr_scan;
int nr_freed;

- nr_taken = isolate_lru_pages(sc->swap_cluster_max,
+
+ nr_taken = isolate_lru_pages(nr_to_isolate,
&zone->inactive_list,
&page_list, &nr_scan);
zone->nr_inactive -= nr_taken;

2005-12-01 23:04:26

[permalink] [raw]

Subject: Re: [PATCH 02/12] mm: supporting variables and functions for balanced zone aging

Marcelo Tosatti <[email protected]> wrote:
>
> Hi Andrew,
>
> On Thu, Dec 01, 2005 at 02:37:14AM -0800, Andrew Morton wrote:
> > Wu Fengguang <[email protected]> wrote:
> > >
> > > The zone aging rates are currently imbalanced,
> >
> > ZONE_DMA is out of whack. It shouldn't be, and I'm not aware of anyone
> > getting in and working out why. I certainly wouldn't want to go and add
> > all this stuff without having a good understanding of _why_ it's out of
> > whack. Perhaps it's just some silly bug, like the thing I pointed at in
> > the previous email.
>
> I think that the problem is caused by the interaction between
> the way reclaiming is quantified and parallel allocators.

Could be. But what about the bug which I think is there? That'll cause
overscanning of the DMA zone.

> The zones have different sizes, and each zone reclaim iteration
> scans the same number of pages. It is unfair.

Nope. See how shrink_zone() bases nr_active and nr_inactive on
zone->nr_active and zone_nr_inactive. These calculations are intended to
cause the number of scanned pages in each zone to be

(zone->nr-active + zone->nr_inactive) >> sc->priority.

> On top of that, kswapd is likely to block while doing its job,
> which means that allocators have a chance to run.

kswapd should only block under rare circumstances - huge amounts of dirty
pages coming off the tail of the LRU.

> --- mm/vmscan.c.orig 2006-01-01 12:44:39.000000000 -0200
> +++ mm/vmscan.c 2006-01-01 16:43:54.000000000 -0200
> @@ -616,8 +616,12 @@
> {

Please use `diff -p'.

2005-12-02 01:19:30

[permalink] [raw]

Subject: Re: [PATCH 02/12] mm: supporting variables and functions for balanced zone aging

On Thu, Dec 01, 2005 at 03:03:49PM -0800, Andrew Morton wrote:
> > > ZONE_DMA is out of whack. It shouldn't be, and I'm not aware of anyone
> > > getting in and working out why. I certainly wouldn't want to go and add
> > > all this stuff without having a good understanding of _why_ it's out of
> > > whack. Perhaps it's just some silly bug, like the thing I pointed at in
> > > the previous email.
> >
> > I think that the problem is caused by the interaction between
> > the way reclaiming is quantified and parallel allocators.
>
> Could be. But what about the bug which I think is there? That'll cause
> overscanning of the DMA zone.

Take for example these numbers:
--------------------------------------------------------------------------------
active/inactive sizes on 2.6.14-1-k7-smp:
43/1000 = 116 / 2645
819/1000 = 54023 / 65881

active/inactive scan rates:
dma 480/1000 = 31364 / (58377 + 6963)
normal 985/1000 = 719219 / (645051 + 84579)
high 0/1000 = 0 / (0 + 0)

total used free shared buffers cached
Mem: 503 497 6 0 0 328
-/+ buffers/cache: 168 335
Swap: 127 2 125
--------------------------------------------------------------------------------

cold-page-scan-rate = K * (direct-reclaim-count * direct-scan-prob +
kswapd-reclaim-count * kswapd-scan-prob) * shrink-zone-prob

(direct-reclaim-count : kswapd-reclaim-count) depends on memory pressure.
Here it is
DMA: 8 = 58377 / 6963
Normal: 7 = 645051 / 84579

(direct-scan-prob) is roughly equal for all zones.
(kswapd-scan-prob) is expected to be equal too.

So the equation can be simplified to:
cold-page-scan-rate ~= C * shrink-zone-prob

It depends largely on the shrink_zone() function:

843 zone->nr_scan_inactive += (zone->nr_inactive >> sc->priority) + 1;
844 nr_inactive = zone->nr_scan_inactive;
845 if (nr_inactive >= sc->swap_cluster_max)
846 zone->nr_scan_inactive = 0;
847 else
848 nr_inactive = 0;
849
850 sc->nr_to_reclaim = sc->swap_cluster_max;
851
852 while (nr_active || nr_inactive) {
//...
860 if (nr_inactive) {
861 sc->nr_to_scan = min(nr_inactive,
862 (unsigned long)sc->swap_cluster_max);
863 nr_inactive -= sc->nr_to_scan;
864 shrink_cache(zone, sc);
865 if (sc->nr_to_reclaim <= 0)
866 break;
867 }
868 }

Line 843 is the core of the scan balancing logic:

priority 12 11 10

On each call nr_scan_inactive is increased by:
DMA(2k pages) +1 +2 +3
Normal(64k pages) +17 +33 +65

Round it up to SWAP_CLUSTER_MAX=32, we get (scan batches/accumulate rounds):
DMA 1/32 1/16 2/11
Normal 2/2 2/1 3/1
DMA:Normal ratio 1:32 1:32 2:33

This keeps the scan rate roughly balanced(i.e. 1:32) in low vm pressure.

But lines 865-866 together with line 846 make most shrink_zone() invocations
only run one batch of scan. The numbers become:

DMA 1/32 1/16 1/11
Normal 1/2 1/1 1/1
DMA:Normal ratio 1:16 1:16 1:11

Now the scan ratio turns into something between 2:1 ~ 3:1 !

Another problem is that the equation in line 843 is quite coarse, 64k/127k
pages result in the same result, leading to a large variance range.

Thanks,
Wu

2005-12-02 01:31:00

[permalink] [raw]

Subject: Re: [PATCH 02/12] mm: supporting variables and functions for balanced zone aging

Wu Fengguang <[email protected]> wrote:
>
> 850 sc->nr_to_reclaim = sc->swap_cluster_max;
> 851
> 852 while (nr_active || nr_inactive) {
> //...
> 860 if (nr_inactive) {
> 861 sc->nr_to_scan = min(nr_inactive,
> 862 (unsigned long)sc->swap_cluster_max);
> 863 nr_inactive -= sc->nr_to_scan;
> 864 shrink_cache(zone, sc);
> 865 if (sc->nr_to_reclaim <= 0)
> 866 break;
> 867 }
> 868 }
>
> Line 843 is the core of the scan balancing logic:
>
> priority 12 11 10
>
> On each call nr_scan_inactive is increased by:
> DMA(2k pages) +1 +2 +3
> Normal(64k pages) +17 +33 +65
>
> Round it up to SWAP_CLUSTER_MAX=32, we get (scan batches/accumulate rounds):
> DMA 1/32 1/16 2/11
> Normal 2/2 2/1 3/1
> DMA:Normal ratio 1:32 1:32 2:33
>
> This keeps the scan rate roughly balanced(i.e. 1:32) in low vm pressure.
>
> But lines 865-866 together with line 846 make most shrink_zone() invocations
> only run one batch of scan. The numbers become:

True. Need to go into a huddle with the changelogs, but I have a feeling
that lines 865 and 866 aren't very important. What happens if we remove
them?

2005-12-02 02:03:45

[permalink] [raw]

Subject: Re: [PATCH 02/12] mm: supporting variables and functions for balanced zone aging

On Thu, Dec 01, 2005 at 05:30:15PM -0800, Andrew Morton wrote:
> > But lines 865-866 together with line 846 make most shrink_zone() invocations
> > only run one batch of scan. The numbers become:
>
> True. Need to go into a huddle with the changelogs, but I have a feeling
> that lines 865 and 866 aren't very important. What happens if we remove
> them?

Maybe the answer is: can we accept to free 15M memory at one time for a 64G zone?
(Or can we simply increase the DEF_PRIORITY?)

btw, maybe it's time to lower the low_mem_reserve.
There should be no need to keep ~50M free memory with the balancing patch.

Regards,
Wu

2005-12-02 02:18:14

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: [PATCH 02/12] mm: supporting variables and functions for balanced zone aging

On Fri, Dec 02, 2005 at 10:04:07AM +0800, Wu Fengguang wrote:
> btw, maybe it's time to lower the low_mem_reserve.
> There should be no need to keep ~50M free memory with the balancing patch.

low_mem_reserve is indipendent from shrink_cache, because shrink_cache can't
free unfreeable pinned memory.

If you want to remove low_mem_reserve you'd better start by adding
migration of memory across the zones with pte updates etc... That would
at least mitigate the effect of anonymous memory w/o swap. But
low_mem_reserve is still needed for all other kind of allocations like
kmalloc or pci_alloc_consistent (i.e. not relocatable) etc...

2005-12-02 02:27:12

by Nick Piggin

[permalink] [raw]

Subject: Re: [PATCH 02/12] mm: supporting variables and functions for balanced zone aging

Wu Fengguang wrote:
> On Thu, Dec 01, 2005 at 05:30:15PM -0800, Andrew Morton wrote:
>
>>> But lines 865-866 together with line 846 make most shrink_zone() invocations
>>> only run one batch of scan. The numbers become:
>>
>>True. Need to go into a huddle with the changelogs, but I have a feeling
>>that lines 865 and 866 aren't very important. What happens if we remove
>>them?
>
>
> Maybe the answer is: can we accept to free 15M memory at one time for a 64G zone?
> (Or can we simply increase the DEF_PRIORITY?)
>

0.02% of the memory? Why not? I think you should be more worried
about what happens when the priority winds up.

I think your proposal to synch reclaim rates between zones is fine
when all pages have similar properties, but could behave strangely
when you do have different requirements on different zones.

> btw, maybe it's time to lower the low_mem_reserve.
> There should be no need to keep ~50M free memory with the balancing patch.
>

min_free_kbytes? This number really isn't anything to do with balancing
and more to do with the amount of reserve kept for things like GFP_ATOMIC
and recursive allocations. Let's not lower it ;)

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-12-02 02:36:47

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: [PATCH 02/12] mm: supporting variables and functions for balanced zone aging

On Fri, Dec 02, 2005 at 01:27:06PM +1100, Nick Piggin wrote:
> min_free_kbytes? This number really isn't anything to do with balancing
> and more to do with the amount of reserve kept for things like GFP_ATOMIC
> and recursive allocations. Let's not lower it ;)

Agreed. Or at the very least that should be discussed in a separate
thread, it has no relation with shrink_cache changes or anything else
related to zone aging IMHO.

2005-12-02 02:36:58

[permalink] [raw]

Subject: Re: [PATCH 02/12] mm: supporting variables and functions for balanced zone aging

On Fri, Dec 02, 2005 at 03:18:11AM +0100, Andrea Arcangeli wrote:
> On Fri, Dec 02, 2005 at 10:04:07AM +0800, Wu Fengguang wrote:
> > btw, maybe it's time to lower the low_mem_reserve.
> > There should be no need to keep ~50M free memory with the balancing patch.
>
> low_mem_reserve is indipendent from shrink_cache, because shrink_cache can't
> free unfreeable pinned memory.
>
> If you want to remove low_mem_reserve you'd better start by adding
> migration of memory across the zones with pte updates etc... That would
> at least mitigate the effect of anonymous memory w/o swap. But
> low_mem_reserve is still needed for all other kind of allocations like
> kmalloc or pci_alloc_consistent (i.e. not relocatable) etc...

Thanks for the clarification, I was concerning too much ;)

Regards,
Wu

2005-12-02 02:43:36

[permalink] [raw]

Subject: Re: [PATCH 02/12] mm: supporting variables and functions for balanced zone aging

On Fri, Dec 02, 2005 at 01:27:06PM +1100, Nick Piggin wrote:
> Wu Fengguang wrote:
> >On Thu, Dec 01, 2005 at 05:30:15PM -0800, Andrew Morton wrote:
> >
> >>>But lines 865-866 together with line 846 make most shrink_zone()
> >>>invocations
> >>>only run one batch of scan. The numbers become:
> >>
> >>True. Need to go into a huddle with the changelogs, but I have a feeling
> >>that lines 865 and 866 aren't very important. What happens if we remove
> >>them?
> >
> >
> >Maybe the answer is: can we accept to free 15M memory at one time for a
> >64G zone?
> >(Or can we simply increase the DEF_PRIORITY?)
> >
>
> 0.02% of the memory? Why not? I think you should be more worried
> about what happens when the priority winds up.

Yes, sounds reasonable.

> I think your proposal to synch reclaim rates between zones is fine
> when all pages have similar properties, but could behave strangely
> when you do have different requirements on different zones.

Thanks.
That requirement might be addressed by disabling the feature on specific zones,
or attaching them with a shrinker.seeks like ratio, or something else...

> >btw, maybe it's time to lower the low_mem_reserve.
> >There should be no need to keep ~50M free memory with the balancing patch.
> >
>
> min_free_kbytes? This number really isn't anything to do with balancing
> and more to do with the amount of reserve kept for things like GFP_ATOMIC
> and recursive allocations. Let's not lower it ;)

ok :)

Regards,
Wu

2005-12-02 02:52:43

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: [PATCH 02/12] mm: supporting variables and functions for balanced zone aging

On Fri, Dec 02, 2005 at 10:37:27AM +0800, Wu Fengguang wrote:
> Thanks for the clarification, I was concerning too much ;)

You're welcome. I'm also not concerned because the cost is linear with
the amount of memory (and the cost has an high bound, that is the size
of the lower zones, so it's not like the struct page that is a
percentage of ram guaranteed to be lost) so it's generally not
noticeable at runtime, and it's most important in the big systems (where
in turn the cost is higher).

2005-12-02 03:05:24

[permalink] [raw]

Subject: Re: [PATCH 02/12] mm: supporting variables and functions for balanced zone aging

On Thu, Dec 01, 2005 at 03:03:49PM -0800, Andrew Morton wrote:
> Marcelo Tosatti <[email protected]> wrote:
> >
> > Hi Andrew,
> >
> > On Thu, Dec 01, 2005 at 02:37:14AM -0800, Andrew Morton wrote:
> > > Wu Fengguang <[email protected]> wrote:
> > > >
> > > > The zone aging rates are currently imbalanced,
> > >
> > > ZONE_DMA is out of whack. It shouldn't be, and I'm not aware of anyone
> > > getting in and working out why. I certainly wouldn't want to go and add
> > > all this stuff without having a good understanding of _why_ it's out of
> > > whack. Perhaps it's just some silly bug, like the thing I pointed at in
> > > the previous email.
> >
> > I think that the problem is caused by the interaction between
> > the way reclaiming is quantified and parallel allocators.
>
> Could be. But what about the bug which I think is there? That'll cause
> overscanning of the DMA zone.

There were about 12Mb of inactive pages on the DMA zone. You're hypothesis
was that there were no LRU pages to be scanned on DMA zone?

> > The zones have different sizes, and each zone reclaim iteration
> > scans the same number of pages. It is unfair.
>
> Nope. See how shrink_zone() bases nr_active and nr_inactive on
> zone->nr_active and zone_nr_inactive. These calculations are intended to
> cause the number of scanned pages in each zone to be
>
> (zone->nr-active + zone->nr_inactive) >> sc->priority.

True... Well, I don't know, then.

> > On top of that, kswapd is likely to block while doing its job,
> > which means that allocators have a chance to run.
>
> kswapd should only block under rare circumstances - huge amounts of dirty
> pages coming off the tail of the LRU.

Alright. I don't know - what could be the problem, then?

> > --- mm/vmscan.c.orig 2006-01-01 12:44:39.000000000 -0200
> > +++ mm/vmscan.c 2006-01-01 16:43:54.000000000 -0200
> > @@ -616,8 +616,12 @@
> > {
>
> Please use `diff -p'.

2005-12-02 03:41:23

[permalink] [raw]

Subject: Re: [PATCH 02/12] mm: supporting variables and functions for balanced zone aging

Marcelo Tosatti <[email protected]> wrote:
>
> > Could be. But what about the bug which I think is there? That'll cause
> > overscanning of the DMA zone.
>
> There were about 12Mb of inactive pages on the DMA zone. You're hypothesis
> was that there were no LRU pages to be scanned on DMA zone?

No, my hypothesis was that balance_pgdat() had a bug. Looking at it again,
I don't see it any more..

2005-12-02 04:46:31

[permalink] [raw]

Subject: Re: [PATCH 02/12] mm: supporting variables and functions for balanced zone aging

Andrea Arcangeli <[email protected]> wrote:
>
> low_mem_reserve

I've a suspicion that the addition of the dma32 zone might have
broken this.

2005-12-02 05:50:05

[permalink] [raw]

Subject: Re: [PATCH 02/12] mm: supporting variables and functions for balanced zone aging

Wu Fengguang <[email protected]> wrote:
>
> 865 if (sc->nr_to_reclaim <= 0)
> 866 break;
> 867 }
> 868 }
>
> Line 843 is the core of the scan balancing logic:
>
> priority 12 11 10
>
> On each call nr_scan_inactive is increased by:
> DMA(2k pages) +1 +2 +3
> Normal(64k pages) +17 +33 +65
>
> Round it up to SWAP_CLUSTER_MAX=32, we get (scan batches/accumulate rounds):
> DMA 1/32 1/16 2/11
> Normal 2/2 2/1 3/1
> DMA:Normal ratio 1:32 1:32 2:33
>
> This keeps the scan rate roughly balanced(i.e. 1:32) in low vm pressure.
>
> But lines 865-866 together with line 846 make most shrink_zone() invocations
> only run one batch of scan.

Yes, this seems to be the problem. Sigh. By the time 2.6.8 came around I
just didn't have time to do the amount of testing which any page reclaim
tweak necessitates.

From: Andrew Morton <[email protected]>

Revert a patch which went into 2.6.8-rc1. The changelog for that patch was:

The shrink_zone() logic can, under some circumstances, cause far too many
pages to be reclaimed. Say, we're scanning at high priority and suddenly
hit a large number of reclaimable pages on the LRU.

Change things so we bale out when SWAP_CLUSTER_MAX pages have been
reclaimed.

Problem is, this change caused significant imbalance in inter-zone scan
balancing by truncating scans of larger zones.

Suppose, for example, ZONE_HIGHMEM is 10x the size of ZONE_NORMAL. The zone
balancing algorithm would require that if we're scanning 100 pages of
ZONE_HIGHMEM, we should scan 10 pages of ZONE_NORMAL. But this logic will
cause the scanning of ZONE_HIGHMEM to bale out after only 32 pages are
reclaimed. Thus effectively causing smaller zones to be scanned relatively
harder than large ones.

Now I need to remember what the workload was which caused me to write this
patch originally, then fix it up in a different way...

Signed-off-by: Andrew Morton <[email protected]>
---

mm/vmscan.c | 8 --------
1 files changed, 8 deletions(-)

diff -puN mm/vmscan.c~vmscan-balancing-fix mm/vmscan.c
--- devel/mm/vmscan.c~vmscan-balancing-fix 2005-12-01 21:20:44.000000000 -0800
+++ devel-akpm/mm/vmscan.c 2005-12-01 21:21:38.000000000 -0800
@@ -63,9 +63,6 @@ struct scan_control {

unsigned long nr_mapped; /* From page_state */

- /* How many pages shrink_cache() should reclaim */
- int nr_to_reclaim;
-
/* Ask shrink_caches, or shrink_zone to scan at this priority */
unsigned int priority;

@@ -901,7 +898,6 @@ static void shrink_cache(struct zone *zo
if (current_is_kswapd())
mod_page_state(kswapd_steal, nr_freed);
mod_page_state_zone(zone, pgsteal, nr_freed);
- sc->nr_to_reclaim -= nr_freed;

spin_lock_irq(&zone->lru_lock);
/*
@@ -1101,8 +1097,6 @@ shrink_zone(struct zone *zone, struct sc
else
nr_inactive = 0;

- sc->nr_to_reclaim = sc->swap_cluster_max;
-
while (nr_active || nr_inactive) {
if (nr_active) {
sc->nr_to_scan = min(nr_active,
@@ -1116,8 +1110,6 @@ shrink_zone(struct zone *zone, struct sc
(unsigned long)sc->swap_cluster_max);
nr_inactive -= sc->nr_to_scan;
shrink_cache(zone, sc);
- if (sc->nr_to_reclaim <= 0)
- break;
}
}

_

2005-12-02 06:37:15

[permalink] [raw]

Subject: Re: [PATCH 02/12] mm: supporting variables and functions for balanced zone aging

On Thu, Dec 01, 2005 at 08:45:49PM -0800, Andrew Morton wrote:
> Andrea Arcangeli <[email protected]> wrote:
> >
> > low_mem_reserve
>
> I've a suspicion that the addition of the dma32 zone might have
> broken this.

And there is a danger of (the last zone != the largest zone). This breaks my
assumption. Either we should remove the two lines in shrink_zone():

> 865 if (sc->nr_to_reclaim <= 0)
> 866 break;

Or explicitly add more weight to the balancing efforts with
mm-add-weight-to-reclaim-for-aging.patch below.

Thanks,
Wu

Subject: mm: add more weight to reclaim for aging
Cc: Marcelo Tosatti <[email protected]>, Magnus Damm <[email protected]>
Cc: Nick Piggin <[email protected]>, Andrea Arcangeli <[email protected]>

Let HighMem = the last zone, we get in normal cases:
- HighMem zone is the largest zone
- HighMem zone is mainly reclaimed for watermark, other zones is almost always
reclaimed for aging
- While HighMem is reclaimed N times for watermark, other zones has N+1 chances
to reclaim for aging
- shrink_zone() only scans one chunk of SWAP_CLUSTER_MAX pages to get
SWAP_CLUSTER_MAX free pages

In the above situation, the force of balancing will win out the force of
unbalancing. But if HighMem(or the last zone) is not the largest zone, the
other larger zones can no longer catch up.

This patch multiplies the force of balancing by 8 times, which should be more
than enough. It just prevents shrink_zone() to return prematurely, and will
not cause DMA zone to be scanned more than SWAP_CLUSTER_MAX at one time in
normal cases.

Signed-off-by: Wu Fengguang <[email protected]>
--- linux.orig/mm/vmscan.c
+++ linux/mm/vmscan.c
@@ -1453,12 +1453,14 @@ loop_again:
SC_RECLAIM_FROM_KSWAPD|
SC_RECLAIM_FOR_WATERMARK);
all_zones_ok = 0;
+ sc.nr_to_reclaim = SWAP_CLUSTER_MAX;
} else if (zone == youngest_zone &&
pages_more_aged(oldest_zone,
youngest_zone)) {
debug_reclaim(&sc,
SC_RECLAIM_FROM_KSWAPD|
SC_RECLAIM_FOR_AGING);
+ sc.nr_to_reclaim = SWAP_CLUSTER_MAX * 8;
} else
continue;
}

2005-12-02 07:17:00

[permalink] [raw]

Subject: Re: [PATCH 02/12] mm: supporting variables and functions for balanced zone aging

On Thu, Dec 01, 2005 at 09:49:31PM -0800, Andrew Morton wrote:
> From: Andrew Morton <[email protected]>
>
> Revert a patch which went into 2.6.8-rc1. The changelog for that patch was:
>
> The shrink_zone() logic can, under some circumstances, cause far too many
> pages to be reclaimed. Say, we're scanning at high priority and suddenly
> hit a large number of reclaimable pages on the LRU.
>
> Change things so we bale out when SWAP_CLUSTER_MAX pages have been
> reclaimed.
>
> Problem is, this change caused significant imbalance in inter-zone scan
> balancing by truncating scans of larger zones.
>
> Suppose, for example, ZONE_HIGHMEM is 10x the size of ZONE_NORMAL. The zone
> balancing algorithm would require that if we're scanning 100 pages of
> ZONE_HIGHMEM, we should scan 10 pages of ZONE_NORMAL. But this logic will
> cause the scanning of ZONE_HIGHMEM to bale out after only 32 pages are
> reclaimed. Thus effectively causing smaller zones to be scanned relatively
> harder than large ones.
>
> Now I need to remember what the workload was which caused me to write this
> patch originally, then fix it up in a different way...

Maybe it's a situation like this:

__|____|________|________________|________________________________|________________________________________________________________|________________________________________________________________________________________________________________________________|________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
_: pinned chunk
-: reclaimable chunk
|: shrink_zone() invocation

First we run into a large range of pinned chunks, which lowered the scan
priority. And then there are plenty of reclaimable chunks, bomb...

Thanks,
Wu

2005-12-02 07:27:37

[permalink] [raw]

Subject: Re: [PATCH 02/12] mm: supporting variables and functions for balanced zone aging

Wu Fengguang <[email protected]> wrote:
>
> First we run into a large range of pinned chunks, which lowered the scan
> priority. And then there are plenty of reclaimable chunks, bomb...

It doesn't have to be that complex - the unreclaimable pages could be
referenced, or under writeback or even simply dirty.

2005-12-02 15:16:24

[permalink] [raw]

Subject: Re: [PATCH 02/12] mm: supporting variables and functions for balanced zone aging

On Thu, Dec 01, 2005 at 09:49:31PM -0800, Andrew Morton wrote:
> Wu Fengguang <[email protected]> wrote:
> >
> > 865 if (sc->nr_to_reclaim <= 0)
> > 866 break;
> > 867 }
> > 868 }
> >
> > Line 843 is the core of the scan balancing logic:
> >
> > priority 12 11 10
> >
> > On each call nr_scan_inactive is increased by:
> > DMA(2k pages) +1 +2 +3
> > Normal(64k pages) +17 +33 +65
> >
> > Round it up to SWAP_CLUSTER_MAX=32, we get (scan batches/accumulate rounds):
> > DMA 1/32 1/16 2/11
> > Normal 2/2 2/1 3/1
> > DMA:Normal ratio 1:32 1:32 2:33
> >
> > This keeps the scan rate roughly balanced(i.e. 1:32) in low vm pressure.
> >
> > But lines 865-866 together with line 846 make most shrink_zone() invocations
> > only run one batch of scan.
>
> Yes, this seems to be the problem. Sigh. By the time 2.6.8 came around I
> just didn't have time to do the amount of testing which any page reclaim
> tweak necessitates.

Hi Andrew,

It all makes sense to me (Wu's description of the problem and your patch),
but still no good with reference to fair scanning. Moreover the patch hurts
interactivity _badly_, not sure why (ssh into the box with FFSB testcase
takes more than one minute to login, while vanilla takes few dozens of seconds).

Follows an interesting part of "diff -u 2614-vanilla.vmstat 2614-akpm.vmstat"
(they were not retrieve at the exact same point in the benchmark run, but
that should not matter much):

-slabs_scanned 37632
-kswapd_steal 731859
-kswapd_inodesteal 1363
-pageoutrun 26573
-allocstall 636
-pgrotated 1898
+slabs_scanned 2688
+kswapd_steal 502946
+kswapd_inodesteal 1
+pageoutrun 10612
+allocstall 90
+pgrotated 68

Note how direct reclaim (and slabs_scanned) are hugely affected.

Normal: 114688kB
DMA: 16384kB

Normal/DMA ratio = 114688 / 16384 = 7.000

******* 2.6.14 vanilla ********

* kswapd scanning rates
pgscan_kswapd_normal 450483
pgscan_kswapd_dma 84645
pgscan_kswapd Normal/DMA = (450483 / 88869) = 5.069

* direct scanning rates
pgscan_direct_normal 23826
pgscan_direct_dma 4224
pgscan_direct Normal/DMA = (23826 / 4224) = 5.640

* global (kswapd+direct) scanning rates
pgscan_normal = (450483 + 23826) = 474309
pgscan_dma = (84645 + 4224) = 88869
pgscan Normal/DMA = (474309 / 88869) = 5.337

pgalloc_normal = 794293
pgalloc_dma = 123805
pgalloc_normal_dma_ratio = (794293/123805) = 6.415

******* 2.6.14 akpm-no-nr_to_reclaim ********

* kswapd scanning rates
pgscan_kswapd_normal 441936
pgscan_kswapd_dma 80520
pgscan_kswapd Normal/DMA = (441936 / 80520) = 5.488

* direct scanning rates
pgscan_direct_normal 7392
pgscan_direct_dma 1188
pgscan_direct Normal/DMA = (7392/1188) = 6.222

* global (kswapd+direct) scanning rates
pgscan_normal = (441936 + 7392) = 449328
pgscan_dma = (80520 + 1188) = 81708
pgscan Normal/DMA = (449328 / 81708) = 5.499

pgalloc_normal = 559994
pgalloc_dma = 84883
pgalloc_normal_dma_ratio = (559994 / 8488) = 6.597

****** 2.6.14 isolate relative *****

* kswapd scanning rates
pgscan_kswapd_normal 664883
pgscan_kswapd_dma 82845
pgscan_kswapd Normal/DMA (664883/82845) = 8.025

* direct scanning rates
pgscan_direct_normal 13485
pgscan_direct_dma 1745
pgscan_direct Normal/DMA = (13485/1745) = 7.727

* global (kswapd+direct) scanning rates
pgscan_normal = (664883 + 13485) = 678368
pgscan_dma = (82845 + 1745) = 84590
pgscan Normal/DMA = (678368 / 84590) = 8.019

pgalloc_normal 699927
pgalloc_dma 66313
pgalloc_normal_dma_ratio = (699927/66313) = 10.554

2005-12-02 21:38:52

[permalink] [raw]

Subject: Re: [PATCH 02/12] mm: supporting variables and functions for balanced zone aging

Marcelo Tosatti <[email protected]> wrote:
>
>
> It all makes sense to me (Wu's description of the problem and your patch),
> but still no good with reference to fair scanning.

Not so. On a 4G x86 box doing a simple 8GB write this patch took the
highmem/normal scanning ratio from 0.7 to 3.5. On that setup the highmem
zone has 3.6x as many pages as the normal zone, so it's bang-on-target.

There's not a lot of point in jumping straight into the complex stresstests
without having first tested the simple stuff.

> Moreover the patch hurts
> interactivity _badly_, not sure why (ssh into the box with FFSB testcase
> takes more than one minute to login, while vanilla takes few dozens of seconds).

Well, we know that the revert reintroduces an overscanning problem.

How are you invoking FFSB? Exactly? On what sort of machine, with how
much memory?

> Follows an interesting part of "diff -u 2614-vanilla.vmstat 2614-akpm.vmstat"
> (they were not retrieve at the exact same point in the benchmark run, but
> that should not matter much):
>
> -slabs_scanned 37632
> -kswapd_steal 731859
> -kswapd_inodesteal 1363
> -pageoutrun 26573
> -allocstall 636
> -pgrotated 1898
> +slabs_scanned 2688
> +kswapd_steal 502946
> +kswapd_inodesteal 1
> +pageoutrun 10612
> +allocstall 90
> +pgrotated 68
>
> Note how direct reclaim (and slabs_scanned) are hugely affected.

hm. allocstall is much lower and pgrotated has improved and direct reclaim
has improved. All of which would indicate that kswapd is doing more work.
Yet kswapd reclaimed less pages. It's hard to say what's going on as these
numbers came from different stages of the test.

>
> Normal: 114688kB
> DMA: 16384kB
>
> Normal/DMA ratio = 114688 / 16384 = 7.000
>
> pgscan_kswapd Normal/DMA = (450483 / 88869) = 5.069
> pgscan_direct Normal/DMA = (23826 / 4224) = 5.640
> pgscan Normal/DMA = (474309 / 88869) = 5.337
> pgscan_kswapd Normal/DMA = (441936 / 80520) = 5.488
> pgscan_direct Normal/DMA = (7392/1188) = 6.222
> pgscan Normal/DMA = (449328 / 81708) = 5.499
> pgalloc_normal_dma_ratio = (559994 / 8488) = 6.597
> pgscan_kswapd Normal/DMA (664883/82845) = 8.025
> pgscan_direct Normal/DMA = (13485/1745) = 7.727
> pgscan Normal/DMA = (678368 / 84590) = 8.019
> pgalloc_normal_dma_ratio = (699927/66313) = 10.554

All of these look close enough to me. 10-20% over- or under-scanning of
the teeny DMA zone doesn't seem very important. Getting normal-vs-highmem
right is more important.

It's hard to say what effect the watermark thingies have on all of this.
I'd sugget that you start out with much less complex tests and see if `echo
10000 10000 10000 > /proc/sys/vm/lowmem_reserve_ratio' changes anything.
(I have that in my rc.local - the thing is a daft waste of memory).

I'd be more concerned about the interactivity thing, although it sounds
like the machine is so overloaded with this test that it'd be fairly
pointless to try to tune that workload first. It's more important to tune
the system for more typical heavy loads.

Also, the choice of IO scheduler matters. Which are you using?

2005-12-03 14:18:10

[permalink] [raw]

Subject: Re: [PATCH 02/12] mm: supporting variables and functions for balanced zone aging

On Fri, Dec 02, 2005 at 01:39:17PM -0800, Andrew Morton wrote:
> Marcelo Tosatti <[email protected]> wrote:
> >
> >
> > It all makes sense to me (Wu's description of the problem and your patch),
> > but still no good with reference to fair scanning.
>
> Not so. On a 4G x86 box doing a simple 8GB write this patch took the
> highmem/normal scanning ratio from 0.7 to 3.5. On that setup the highmem
> zone has 3.6x as many pages as the normal zone, so it's bang-on-target.

Humpf! What are the pgalloc dma/normal/highmem numbers under such test?

Does this machine need bounce buffers for disk I/O?

> There's not a lot of point in jumping straight into the complex stresstests
> without having first tested the simple stuff.

Its not a really complex stresstest, though yours is simpler. There are 10
threads operating on 20 files. You can reproduce the load using the
following FFSB profile (I remake the filesystem each time, results are
pretty stable):

num_filesystems=1
num_threadgroups=1
directio=0
time=300

[filesystem0]
location=/mnt/hda4/
num_files=20
num_dirs=10
max_filesize=91534338
min_filesize=65535
[end0]

[threadgroup0]
num_threads=10
write_size=2816
write_blocksize=4096
read_size=2816
read_blocksize=4096
create_weight=100
write_weight=30
read_weight=100
[end0]

> > Moreover the patch hurts
> > interactivity _badly_, not sure why (ssh into the box with FFSB testcase
> > takes more than one minute to login, while vanilla takes few dozens of seconds).
>
> Well, we know that the revert reintroduces an overscanning problem.

Can you remember the testcase for which you added the "truncate reclaim"
logic more precisely?

> How are you invoking FFSB? Exactly? On what sort of machine, with how
> much memory?

Its a single processor Pentium-3 1GHz+ booted with mem=128M, 4/5 years old IDE disk.

> > Follows an interesting part of "diff -u 2614-vanilla.vmstat 2614-akpm.vmstat"
> > (they were not retrieve at the exact same point in the benchmark run, but
> > that should not matter much):
> >
> > -slabs_scanned 37632
> > -kswapd_steal 731859
> > -kswapd_inodesteal 1363
> > -pageoutrun 26573
> > -allocstall 636
> > -pgrotated 1898
> > +slabs_scanned 2688
> > +kswapd_steal 502946
> > +kswapd_inodesteal 1
> > +pageoutrun 10612
> > +allocstall 90
> > +pgrotated 68
> >
> > Note how direct reclaim (and slabs_scanned) are hugely affected.
>
> hm. allocstall is much lower and pgrotated has improved and direct reclaim
> has improved. All of which would indicate that kswapd is doing more work.
> Yet kswapd reclaimed less pages. It's hard to say what's going on as these
> numbers came from different stages of the test.

I have a feeling they came from a somewhat equivalent stage (FFSB is a cyclic
test, there are not much of "phases" after the initial creation of files).

Feel free to reproduce the testcase, you simply need the FFSB profile
above and mem=128M.

It seems very fragile (Wu's patches attempt to address that) in general: you
tweak it here and watch it go nuts there.

> > Normal: 114688kB
> > DMA: 16384kB
> >
> > Normal/DMA ratio = 114688 / 16384 = 7.000
> >
> > pgscan_kswapd Normal/DMA = (450483 / 88869) = 5.069
> > pgscan_direct Normal/DMA = (23826 / 4224) = 5.640
> > pgscan Normal/DMA = (474309 / 88869) = 5.337
> > pgscan_kswapd Normal/DMA = (441936 / 80520) = 5.488
> > pgscan_direct Normal/DMA = (7392/1188) = 6.222
> > pgscan Normal/DMA = (449328 / 81708) = 5.499
> > pgalloc_normal_dma_ratio = (559994 / 8488) = 6.597
> > pgscan_kswapd Normal/DMA (664883/82845) = 8.025
> > pgscan_direct Normal/DMA = (13485/1745) = 7.727
> > pgscan Normal/DMA = (678368 / 84590) = 8.019
> > pgalloc_normal_dma_ratio = (699927/66313) = 10.554
>
> All of these look close enough to me. 10-20% over- or under-scanning of
> the teeny DMA zone doesn't seem very important.

Hopefully yes. The lowmem_reserve[] logic is there to _avoid_ over-allocation
(over-scanning) of the DMA zone by GFP_NORMAL allocations, isnt it?

Note, there should be no DMA limited hardware on this box (I'm using PIO for the
IDE disk). BTW, why do you need lowmem_reserve for the DMA zone if you don't
have 16MB capped ISA devices on your system?

> Getting normal-vs-highmem right is more important.
>
> It's hard to say what effect the watermark thingies have on all of this.
> I'd sugget that you start out with much less complex tests and see if `echo
> 10000 10000 10000 > /proc/sys/vm/lowmem_reserve_ratio' changes anything.
> (I have that in my rc.local - the thing is a daft waste of memory).
>
> I'd be more concerned about the interactivity thing, although it sounds
> like the machine is so overloaded with this test that it'd be fairly
> pointless to try to tune that workload first. It's more important to tune
> the system for more typical heavy loads.

What made me notice it was the huge interactivity difference between
vanilla and your patch, again, I'm not really sure about its root.

> Also, the choice of IO scheduler matters. Which are you using?

The default for 2.6.14. Thats AS right?

I'll see if I can do more tests next week.

Best wishes.

2005-12-04 05:56:39