The vm subsystem is rather complex. System memory is divided into zones,
lower zones act as fallback of higher zones in memory allocation. The page
reclaim algorithm should generally keep zone aging rates in sync. But if a
zone under watermark has many unreclaimable pages, it has to be scanned much
more to get enough free pages. While doing this,
- lower zones should also be scanned more, since their pages are also usable
for higher zone allocations.
- higher zones should not be scanned just to keep the aging in sync, which
can evict large amount of pages without saving the problem(and may well
worsen it).
With that in mind, the patch does the rebalance in kswapd as follows:
1) reclaim from the lowest zone when
- under pages_high
- under pages_high+lowmem_reserve, and less/equal aged than highest
zone(or out of sync with it)
2) reclaim from higher zones when
- under pages_high+lowmem_reserve, and less/equal aged than its
immediate lower neighbor(or out of sync with it)
Note that the zone age is a normalized value in range 0-4096 on i386/4G. 4096
corresponds to a full scan of one zone. And the comparison of ages are only
deemed ok if the gap is less than 4096/8, or they will be regarded as out of
sync.
On exit, the code ensures:
1) the lowest zone will be pages_high ok
2) at least one zone will be pages_high+lowmem_reserve ok
3) a very strong force of rebalancing with the exception of
- some lower zones are unreclaimable: we must let them go ahead
alone, leaving higher zones back
- shrink_zone() scans too much and creates huge imbalance in one
run(Nick is working on this)
The logic can deal with known normal/abnormal situations gracefully:
1) Normal case
- zone ages are cyclicly tied together: taking over each other, and
keeping close enough
2) A Zone is unreclaimable, scanned much more, and become out of sync
- if ever a troublesome zone is being overscanned, the logic brings
its lower neighbors ahead together, leaving higher neighbors back.
- the aging tie between the two groups is broken, and the relevant
zones are reclaimed when pages_high+lowmem_reserve not ok, just as
before the patch.
- at some time the zone ages meet again and back to normal
- a possiblely better strategy, as soon as the pressure disappeared,
might be relunctant to reclaim from the already overscanned lower
group, and let the higher group slowly catch up.
3) Zone is truncated
- will not reclaim from it until under watermark
With this patch, the meaning of zone->pages_high+lowmem_reserve changed from
the _required_ watermark to the _recommended_ watermark. Someone might be
willing to increase them somehow.
Signed-off-by: Wu Fengguang <[email protected]>
---
mm/vmscan.c | 34 +++++++++++++++++++++++++++++-----
1 files changed, 29 insertions(+), 5 deletions(-)
--- linux.orig/mm/vmscan.c
+++ linux/mm/vmscan.c
@@ -1364,6 +1364,7 @@ static int balance_pgdat(pg_data_t *pgda
int total_scanned, total_reclaimed;
struct reclaim_state *reclaim_state = current->reclaim_state;
struct scan_control sc;
+ struct zone *prev_zone = pgdat->node_zones;
loop_again:
total_scanned = 0;
@@ -1379,6 +1380,9 @@ loop_again:
struct zone *zone = pgdat->node_zones + i;
zone->temp_priority = DEF_PRIORITY;
+
+ if (populated_zone(zone))
+ prev_zone = zone;
}
for (priority = DEF_PRIORITY; priority >= 0; priority--) {
@@ -1409,14 +1413,34 @@ loop_again:
if (!populated_zone(zone))
continue;
- if (nr_pages == 0) { /* Not software suspend */
- if (zone_watermark_ok(zone, order,
- zone->pages_high, 0, 0))
- continue;
+ if (nr_pages) /* software suspend */
+ goto scan_swspd;
- all_zones_ok = 0;
+ if (zone_watermark_ok(zone, order,
+ zone->pages_high,
+ pgdat->nr_zones - 1, 0)) {
+ /* free pages enough, no reclaim */
+ } else if (zone < prev_zone) {
+ if (!zone_watermark_ok(zone, order,
+ zone->pages_high, 0, 0)) {
+ /* have to scan for free pages */
+ goto scan;
+ }
+ if (age_ge(prev_zone, zone)) {
+ /* catch up if falls behind */
+ goto scan;
+ }
+ } else if (!age_gt(zone, prev_zone)) {
+ /* catch up if falls behind or out of sync */
+ goto scan;
}
+ prev_zone = zone;
+ continue;
+scan:
+ prev_zone = zone;
+ all_zones_ok = 0;
+scan_swspd:
if (zone->all_unreclaimable && priority != DEF_PRIORITY)
continue;
--
Here is the new testing reports. The intermittent and concurrent copy of two
files is expected to generate large range of unreclaimable pages.
The balance seems to improve a lot since last version. And the direct reclaim
times is reduced to minimal.
IN QEMU
=======
root ~# grep (age |rounds) /proc/zoneinfo
rounds 142
age 3621
rounds 142
age 3499
rounds 142
age 3502
root ~# ./show-aging-rate.sh
Linux (none) 2.6.15-rc5-mm1 #8 SMP Wed Dec 7 16:06:47 CST 2005 i686 GNU/Linux
total used free shared buffers cached
Mem: 1138 1119 18 0 0 1105
-/+ buffers/cache: 14 1124
Swap: 0 0 0
---------------------------------------------------------------
active/inactive size ratios:
DMA0: 469 / 1000 = 621 / 1323
Normal0: 374 / 1000 = 58588 / 156523
HighMem0: 397 / 1000 = 18880 / 47498
active/inactive scan rates:
DMA: 273 / 1000 = 58528 / ( 210464 + 3296)
Normal: 342 / 1000 = 7851552 / ( 22838944 + 94080)
HighMem: 393 / 1000 = 2680480 / ( 6774304 + 31040)
---------------------------------------------------------------
inactive size ratios:
DMA0 / Normal0: 85 / 10000 = 1334 / 156630
Normal0 / HighMem0: 32946 / 10000 = 156630 / 47540
inactive scan rates:
DMA / Normal: 93 / 10000 = ( 210464 + 3296) / ( 22838944 + 94080)
Normal / HighMem: 33698 / 10000 = ( 22838944 + 94080) / ( 6774304 + 31040)
root ~# grep -E '(low|high|free|protection:) ' /proc/zoneinfo
pages free 1161
low 21
high 25
protection: (0, 0, 880, 1140)
pages free 3505
low 1173
high 1408
protection: (0, 0, 0, 2080)
pages free 189
low 134
high 203
protection: (0, 0, 0, 0)
root ~# vmstat 5 10
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
2 0 0 18556 344 1131720 0 0 16 5 1042 37 4 66 30 0
2 0 0 18616 352 1132324 0 0 0 16 1035 48 6 94 0 0
2 0 0 19060 348 1131512 0 0 0 3 974 52 6 94 0 0
2 0 0 18256 268 1132272 0 0 0 8 1018 50 6 94 0 0
2 0 0 19096 248 1132020 0 0 0 3 1009 49 6 94 0 0
2 0 0 19520 248 1130524 0 0 0 8 989 50 6 94 0 0
2 0 0 18916 248 1131680 0 0 0 3 1008 49 6 94 0 0
1 0 0 18436 208 1132740 0 0 0 7 1009 64 4 96 0 0
2 0 0 18976 200 1132272 0 0 0 14 1029 64 5 95 0 0
2 0 0 19156 200 1131932 0 0 0 8 992 48 6 94 0 0
root ~# cat /proc/vmstat
nr_dirty 9
nr_writeback 0
nr_unstable 0
nr_page_table_pages 22
nr_mapped 971
nr_slab 1405
pgpgin 68177
pgpgout 21820
pswpin 0
pswpout 0
pgalloc_high 4338439
pgalloc_normal 15416448
pgalloc_dma32 0
pgalloc_dma 157690
pgfree 19917495
pgactivate 10660320
pgdeactivate 10582874
pgkeephot 53405
pgkeepcold 17
pgfault 145079
pgmajfault 116
pgrefill_high 2707872
pgrefill_normal 7936896
pgrefill_dma32 0
pgrefill_dma 59392
pgsteal_high 4264660
pgsteal_normal 15171368
pgsteal_dma32 0
pgsteal_dma 155635
pgscan_kswapd_high 6843616
pgscan_kswapd_normal 23067616
pgscan_kswapd_dma32 0
pgscan_kswapd_dma 212352
pgscan_direct_high 31040
pgscan_direct_normal 94080
pgscan_direct_dma32 0
pgscan_direct_dma 3296
pginodesteal 0
slabs_scanned 128
kswapd_steal 19582040
kswapd_inodesteal 0
pageoutrun 547184
allocstall 274
pgrotated 8
nr_bounce 0
ON A REAL BOX
=============
root@Server ~# grep (age |rounds) /proc/zoneinfo
rounds 164
age 410
rounds 150
age 396
rounds 150
age 396
root@Server ~# ./show-aging-rate.sh
Linux Server 2.6.15-rc5-mm1 #9 SMP Wed Dec 7 16:47:56 CST 2005 i686 GNU/Linux
total used free shared buffers cached
Mem: 2020 1970 50 0 5 1916
-/+ buffers/cache: 48 1972
Swap: 0 0 0
---------------------------------------------------------------
active/inactive size ratios:
DMA0: 132 / 1000 = 123 / 930
Normal0: 161 / 1000 = 28022 / 173838
HighMem0: 177 / 1000 = 43935 / 247952
active/inactive scan rates:
DMA: 170 / 1000 = 23889 / ( 118528 + 21216)
Normal: 210 / 1000 = 5296960 / ( 24645696 + 484160)
HighMem: 239 / 1000 = 8501024 / ( 34741600 + 752000)
---------------------------------------------------------------
inactive size ratios:
DMA0 / Normal0: 53 / 10000 = 930 / 173838
Normal0 / HighMem0: 7010 / 10000 = 173838 / 247952
inactive scan rates:
DMA / Normal: 55 / 10000 = ( 118528 + 21216) / ( 24645696 + 484160)
Normal / HighMem: 7080 / 10000 = ( 24645696 + 484160) / ( 34741600 + 752000)
pageoutrun / allocstall = 73374 / 100 = 1072730 / 1461
Here is another testing on 512M desktop. This time only a big sparse file copy.
- The inactive_list balance is perfectly maintained
- The active_list is scanned a bit more, for the calculation is performed after
the scan, when the nr_inactive is made a little smaller
- direct reclaims near to zero, good or evil?
wfg ~% grep (age |rounds) /proc/zoneinfo
rounds 100
age 659
rounds 100
age 621
wfg ~% show-aging-rate.sh
Linux lark 2.6.15-rc5-mm1 #9 SMP Wed Dec 7 16:47:56 CST 2005 i686 GNU/Linux
total used free shared buffers cached
Mem: 501 475 26 0 6 278
-/+ buffers/cache: 189 311
Swap: 127 2 125
---------------------------------------------------------------
active/inactive size ratios:
DMA0: 75 / 1000 = 161 / 2135
Normal0: 1074 / 1000 = 59046 / 54936
active/inactive scan rates:
DMA: 31 / 1000 = 7867 / ( 246784 + 0)
Normal: 974 / 1000 = 5847744 / ( 6001216 + 128)
---------------------------------------------------------------
inactive size ratios:
DMA0 / Normal0: 388 / 10000 = 2135 / 54936
inactive scan rates:
DMA / Normal: 411 / 10000 = ( 246784 + 0) / ( 6001216 + 128)
pageoutrun / allocstall = 4140780 / 100 = 207039 / 4
wfg ~% vmstat 5 10
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
3 0 2612 6408 7280 293960 0 1 122 13 1178 1415 14 9 75 2
1 0 2612 6404 2144 298144 0 0 190 35 1159 1675 11 89 0 0
1 0 2612 6216 2052 299596 0 0 442 8 1189 1612 10 90 0 0
1 0 2612 5916 2032 299888 0 0 326 0 1182 1713 10 90 0 0
1 3 2612 6528 3252 297240 0 0 795 6 1275 1464 10 53 0 37
0 0 2612 5648 3644 298480 0 0 739 14 1261 1203 14 23 39 24
[the big cp stops about here]
0 0 2612 5784 3660 298532 0 0 0 17 1130 1322 10 3 87 0
0 0 2612 5952 3692 298500 0 0 6 4 1137 1343 9 2 87 2
0 0 2612 5976 3700 298492 0 0 2 3 1143 1327 7 1 91 0
0 0 2612 6000 3700 298492 0 0 0 0 1138 1315 7 2 91 0