LinuxLists.cc - [PATCH 01/12] vm: kswapd incmin

2005-12-01 10:11:56

Subject: [PATCH 01/12] vm: kswapd incmin

Explicitly teach kswapd about the incremental min logic instead of just scanning
all zones under the first low zone. This should keep more even pressure applied
on the zones.

Signed-off-by: Nick Piggin <[email protected]>
Signed-off-by: Wu Fengguang <[email protected]>
---

mm/vmscan.c | 111 ++++++++++++++++++++----------------------------------------
1 files changed, 37 insertions(+), 74 deletions(-)

--- linux.orig/mm/vmscan.c
+++ linux/mm/vmscan.c
@@ -1310,101 +1310,65 @@ loop_again:
}

for (priority = DEF_PRIORITY; priority >= 0; priority--) {
- int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */
unsigned long lru_pages = 0;
+ int first_low_zone = 0;
+
+ all_zones_ok = 1;
+ sc.nr_scanned = 0;
+ sc.nr_reclaimed = 0;
+ sc.priority = priority;
+ sc.swap_cluster_max = nr_pages ? nr_pages : SWAP_CLUSTER_MAX;

/* The swap token gets in the way of swapout... */
if (!priority)
disable_swap_token();

- all_zones_ok = 1;
-
- if (nr_pages == 0) {
- /*
- * Scan in the highmem->dma direction for the highest
- * zone which needs scanning
- */
- for (i = pgdat->nr_zones - 1; i >= 0; i--) {
- struct zone *zone = pgdat->node_zones + i;
+ /* Scan in the highmem->dma direction */
+ for (i = pgdat->nr_zones - 1; i >= 0; i--) {
+ struct zone *zone = pgdat->node_zones + i;

- if (!populated_zone(zone))
- continue;
+ if (!populated_zone(zone))
+ continue;

- if (zone->all_unreclaimable &&
- priority != DEF_PRIORITY)
+ if (nr_pages == 0) { /* Not software suspend */
+ if (zone_watermark_ok(zone, order,
+ zone->pages_high, first_low_zone, 0))
continue;

- if (!zone_watermark_ok(zone, order,
- zone->pages_high, 0, 0)) {
- end_zone = i;
- goto scan;
- }
+ all_zones_ok = 0;
+ if (first_low_zone < i)
+ first_low_zone = i;
}
- goto out;
- } else {
- end_zone = pgdat->nr_zones - 1;
- }
-scan:
- for (i = 0; i <= end_zone; i++) {
- struct zone *zone = pgdat->node_zones + i;
-
- lru_pages += zone->nr_active + zone->nr_inactive;
- }
-
- /*
- * Now scan the zone in the dma->highmem direction, stopping
- * at the last zone which needs scanning.
- *
- * We do this because the page allocator works in the opposite
- * direction. This prevents the page allocator from allocating
- * pages behind kswapd's direction of progress, which would
- * cause too much scanning of the lower zones.
- */
- for (i = 0; i <= end_zone; i++) {
- struct zone *zone = pgdat->node_zones + i;
- int nr_slab;
-
- if (!populated_zone(zone))
- continue;

if (zone->all_unreclaimable && priority != DEF_PRIORITY)
continue;

- if (nr_pages == 0) { /* Not software suspend */
- if (!zone_watermark_ok(zone, order,
- zone->pages_high, end_zone, 0))
- all_zones_ok = 0;
- }
zone->temp_priority = priority;
if (zone->prev_priority > priority)
zone->prev_priority = priority;
- sc.nr_scanned = 0;
- sc.nr_reclaimed = 0;
- sc.priority = priority;
- sc.swap_cluster_max = nr_pages? nr_pages : SWAP_CLUSTER_MAX;
- atomic_inc(&zone->reclaim_in_progress);
+ lru_pages += zone->nr_active + zone->nr_inactive;
+
shrink_zone(zone, &sc);
- atomic_dec(&zone->reclaim_in_progress);
- reclaim_state->reclaimed_slab = 0;
- nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,
- lru_pages);
- sc.nr_reclaimed += reclaim_state->reclaimed_slab;
- total_reclaimed += sc.nr_reclaimed;
- total_scanned += sc.nr_scanned;
- if (zone->all_unreclaimable)
- continue;
- if (nr_slab == 0 && zone->pages_scanned >=
+
+ if (zone->pages_scanned >=
(zone->nr_active + zone->nr_inactive) * 4)
zone->all_unreclaimable = 1;
- /*
- * If we've done a decent amount of scanning and
- * the reclaim ratio is low, start doing writepage
- * even in laptop mode
- */
- if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
- total_scanned > total_reclaimed+total_reclaimed/2)
- sc.may_writepage = 1;
}
+ reclaim_state->reclaimed_slab = 0;
+ shrink_slab(sc.nr_scanned, GFP_KERNEL, lru_pages);
+ sc.nr_reclaimed += reclaim_state->reclaimed_slab;
+ total_reclaimed += sc.nr_reclaimed;
+ total_scanned += sc.nr_scanned;
+
+ /*
+ * If we've done a decent amount of scanning and
+ * the reclaim ratio is low, start doing writepage
+ * even in laptop mode
+ */
+ if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
+ total_scanned > total_reclaimed+total_reclaimed/2)
+ sc.may_writepage = 1;
+
if (nr_pages && to_free > total_reclaimed)
continue; /* swsusp: need to do more work */
if (all_zones_ok)
@@ -1425,7 +1389,6 @@ scan:
if ((total_reclaimed >= SWAP_CLUSTER_MAX) && (!nr_pages))
break;
}
-out:
for (i = 0; i < pgdat->nr_zones; i++) {
struct zone *zone = pgdat->node_zones + i;

--

2005-12-01 10:34:25

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH 01/12] vm: kswapd incmin

Wu Fengguang <[email protected]> wrote:
>
> Explicitly teach kswapd about the incremental min logic instead of just scanning
> all zones under the first low zone. This should keep more even pressure applied
> on the zones.

I spat this back a while ago. See the changelog (below) for the logic
which you're removing.

This change appears to go back to performing reclaim in the highmem->lowmem
direction. Page reclaim might go all lumpy again.

Shouldn't first_low_zone be initialised to ZONE_HIGHMEM (or pgdat->nr_zones
- 1) rather than to 0, or something? I don't understand why we're passing
zero as the classzone_idx into zone_watermark_ok() in the first go around
the loop.

And this bit, which Nick didn't reply to (wimp!). I think it's a bug.

Looking at it, I am confused.

In the first loop:

for (i = pgdat->nr_zones - 1; i >= 0; i--) {
struct zone *zone = pgdat->node_zones + i;
...
if (!zone_watermark_ok(zone, order,
zone->pages_high, 0, 0)) {
end_zone = i;
goto scan;
}

end_zone gets the value of the highest-numbered zone which needs scanning.
Where `0' corresponds to ZONE_DMA. (correct?)

In the second loop:

for (i = 0; i <= end_zone; i++) {
struct zone *zone = pgdat->node_zones + i;

Shouldn't that be

for (i = end_zone; ...; i++)

or am I on crack?

As kswapd is now scanning zones in the highmem->normal->dma direction it can
get into competition with the page allocator: kswapd keep on trying to free
pages from highmem, then kswapd moves onto lowmem. By the time kswapd has
done proportional scanning in lowmem, someone has come in and allocated a few
pages from highmem. So kswapd goes back and frees some highmem, then some
lowmem again. But nobody has allocated any lowmem yet. So we keep on and on
scanning lowmem in response to highmem page allocations.

With a simple `dd' on a 1G box we get:

r b swpd free buff cache si so bi bo in cs us sy wa id
0 3 0 59340 4628 922348 0 0 4 28188 1072 808 0 10 46 44
0 3 0 29932 4660 951760 0 0 0 30752 1078 441 1 6 30 64
0 3 0 57568 4556 924052 0 0 0 30748 1075 478 0 8 43 49
0 3 0 29664 4584 952176 0 0 0 30752 1075 472 0 6 34 60
0 3 0 5304 4620 976280 0 0 4 40484 1073 456 1 7 52 41
0 3 0 104856 4508 877112 0 0 0 18452 1074 97 0 7 67 26
0 3 0 70768 4540 911488 0 0 0 35876 1078 746 0 7 34 59
1 2 0 42544 4568 939680 0 0 0 21524 1073 556 0 5 43 51
0 3 0 5520 4608 976428 0 0 4 37924 1076 836 0 7 41 51
0 2 0 4848 4632 976812 0 0 32 12308 1092 94 0 1 33 66

Simple fix: go back to scanning the zones in the dma->normal->highmem
direction so we meet the page allocator in the middle somewhere.

r b swpd free buff cache si so bi bo in cs us sy wa id
1 3 0 5152 3468 976548 0 0 4 37924 1071 650 0 8 64 28
1 2 0 4888 3496 976588 0 0 0 23576 1075 726 0 6 66 27
0 3 0 5336 3532 976348 0 0 0 31264 1072 708 0 8 60 32
0 3 0 6168 3560 975504 0 0 0 40992 1072 683 0 6 63 31
0 3 0 4560 3580 976844 0 0 0 18448 1073 233 0 4 59 37
0 3 0 5840 3624 975712 0 0 4 26660 1072 800 1 8 46 45
0 3 0 4816 3648 976640 0 0 0 40992 1073 526 0 6 47 47
0 3 0 5456 3672 976072 0 0 0 19984 1070 320 0 5 60 35

---

25-akpm/mm/vmscan.c | 37 +++++++++++++++++++++++++++++++++++--
1 files changed, 35 insertions(+), 2 deletions(-)

diff -puN mm/vmscan.c~kswapd-avoid-higher-zones-reverse-direction mm/vmscan.c
--- 25/mm/vmscan.c~kswapd-avoid-higher-zones-reverse-direction 2004-03-12 01:33:09.000000000 -0800
+++ 25-akpm/mm/vmscan.c 2004-03-12 01:33:09.000000000 -0800
@@ -924,8 +924,41 @@ static int balance_pgdat(pg_data_t *pgda
for (priority = DEF_PRIORITY; priority; priority--) {
int all_zones_ok = 1;
int pages_scanned = 0;
+ int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */

- for (i = pgdat->nr_zones - 1; i >= 0; i--) {
+
+ if (nr_pages == 0) {
+ /*
+ * Scan in the highmem->dma direction for the highest
+ * zone which needs scanning
+ */
+ for (i = pgdat->nr_zones - 1; i >= 0; i--) {
+ struct zone *zone = pgdat->node_zones + i;
+
+ if (zone->all_unreclaimable &&
+ priority != DEF_PRIORITY)
+ continue;
+
+ if (zone->free_pages <= zone->pages_high) {
+ end_zone = i;
+ goto scan;
+ }
+ }
+ goto out;
+ } else {
+ end_zone = pgdat->nr_zones - 1;
+ }
+scan:
+ /*
+ * Now scan the zone in the dma->highmem direction, stopping
+ * at the last zone which needs scanning.
+ *
+ * We do this because the page allocator works in the opposite
+ * direction. This prevents the page allocator from allocating
+ * pages behind kswapd's direction of progress, which would
+ * cause too much scanning of the lower zones.
+ */
+ for (i = 0; i <= end_zone; i++) {
struct zone *zone = pgdat->node_zones + i;
int total_scanned = 0;
int max_scan;
@@ -965,7 +998,7 @@ static int balance_pgdat(pg_data_t *pgda
if (pages_scanned)
blk_congestion_wait(WRITE, HZ/10);
}
-
+out:
for (i = 0; i < pgdat->nr_zones; i++) {
struct zone *zone = pgdat->node_zones + i;

_

2005-12-01 11:33:40

by Wu Fengguang

[permalink] [raw]

Subject: Re: [PATCH 01/12] vm: kswapd incmin

On Thu, Dec 01, 2005 at 02:33:30AM -0800, Andrew Morton wrote:
> I spat this back a while ago. See the changelog (below) for the logic
> which you're removing.
>
> This change appears to go back to performing reclaim in the highmem->lowmem
> direction. Page reclaim might go all lumpy again.
>
> Shouldn't first_low_zone be initialised to ZONE_HIGHMEM (or pgdat->nr_zones
> - 1) rather than to 0, or something? I don't understand why we're passing
> zero as the classzone_idx into zone_watermark_ok() in the first go around
> the loop.

Sorry to note that I'm mainly taking its zone-range --> zones-under-watermark
cleanups. The scan order is reverted back to DMA->HighMem in
mm-balance-zone-aging-in-kswapd-reclaim.patch, and the first_low_zone logic is
also replaced with a quite different one there.

My thinking is that the overall reclaim-for-watermark should be weakened and
just do minimal watermark-safeguard work, so that it will not be a major force
of imbalance.

Assume there are three zones. The dynamics goes something like:

HighMem exhausted --> reclaim from it --> become more aged --> reclaim the
other two zones for aging

DMA reclaimed --> age leaps ahead --> reclaim Normal zone for aging, while
HighMem is being reclaimed for watermark

In the kswapd path, if there are N rounds of reclaim-for-watermark with
all_zones_ok=0, there could be N+1 rounds of reclaim-for-aging with the
additional 1 time of all_zones_ok=1. With this the force of balance outperforms
the force of imbalance.

In the direct path, there are 10 rounds of aging before unconditionally reclaim
from all zones, that's pretty much force of balance.

In summary:
- HighMem zone is normally first exhausted and mostly reclaimed for watermark.
- DMA zone is now mainly reclaimed for aging.
- Normal zone will be mostly reclaimed for aging, sometimes for watermark.

Thanks,
Wu