2013-03-17 13:04:29

by Mel Gorman

[permalink] [raw]
Subject: [RFC PATCH 0/8] Reduce system disruption due to kswapd

Kswapd and page reclaim behaviour has been screwy in one way or the other
for a long time. Very broadly speaking it worked in the far past because
machines were limited in memory so it did not have that many pages to scan
and it stalled congestion_wait() frequently to prevent it going completely
nuts. In recent times it has behaved very unsatisfactorily with some of
the problems compounded by the removal of stall logic and the introduction
of transparent hugepage support with high-order reclaims.

There are many variations of bugs that are rooted in this area. One example
is reports of a large copy operations or backup causing the machine to
grind to a halt or applications pushed to swap. Sometimes in low memory
situations a large percentage of memory suddenly gets reclaimed. In other
cases an application starts and kswapd hits 100% CPU usage for prolonged
periods of time and so on. There is now talk of introducing features like
an extra free kbytes tunable to work around aspects of the problem instead
of trying to deal with it. It's compounded by the problem that it can be
very workload and machine specific.

This RFC is aimed at investigating if kswapd can be address these various
problems in a relatively straight-forward fashion without a fundamental
rewrite.

Patches 1-2 limits the number of pages kswapd reclaims while still obeying
the anon/file proportion of the LRUs it should be scanning.

Patches 3-4 control how and when kswapd raises its scanning priority and
deletes the scanning restart logic which is tricky to follow.

Patch 5 notes that it is too easy for kswapd to reach priority 0 when
scanning and then reclaim the world. Down with that sort of thing.

Patch 6 notes that kswapd starts writeback based on scanning priority which
is not necessarily related to dirty pages. It will have kswapd
writeback pages if a number of unqueued dirty pages have been
recently encountered at the tail of the LRU.

Patch 7 notes that sometimes kswapd should stall waiting on IO to complete
to reduce LRU churn and the likelihood that it'll reclaim young
clean pages or push applications to swap. It will cause kswapd
to block on IO if it detects that pages being reclaimed under
writeback are recycling through the LRU before the IO completes.

Patch 8 shrinks slab just once per priority scanned or if a zone is otherwise
unreclaimable to avoid hammering slab when kswapd has to skip a
large number of pages.

Patches 9-10 are cosmetic but balance_pgdat() might be easier to follow.

This was tested using memcached+memcachetest while some background IO
was in progress as implemented by the parallel IO tests implement in MM
Tests. memcachetest benchmarks how many operations/second memcached can
service and it is run multiple times. It starts with no background IO and
then re-runs the test with larger amounts of IO in the background to roughly
simulate a large copy in progress. The expectation is that the IO should
have little or no impact on memcachetest which is running entirely in memory.

Ordinarily this test is run a number of times for each amount of IO and
the worse result reported but these results are based on just one run as
a quick test. ftrace was also running so there was additional sources of
interference and the results would be more varaiable than normal. More
comprehensive tests are be queued but they'll take quite some time to
complete. Kernel baseline is v3.9-rc2 and the following kernels were tested

vanilla 3.9-rc2
flatten-v1r8 Patches 1-4
limitprio-v1r8 Patches 1-5
write-v1r8 Patches 1-6
block-v1r8 Patches 1-7
tidy-v1r8 Patches 1-10

3.9.0-rc2 3.9.0-rc2 3.9.0-rc2 3.9.0-rc2 3.9.0-rc2
vanilla flatten-v1r8 limitprio-v1r8 block-v1r8 tidy-v1r8
Ops memcachetest-0M 10932.00 ( 0.00%) 10898.00 ( -0.31%) 10903.00 ( -0.27%) 10911.00 ( -0.19%) 10916.00 ( -0.15%)
Ops memcachetest-749M 7816.00 ( 0.00%) 10715.00 ( 37.09%) 11006.00 ( 40.81%) 10903.00 ( 39.50%) 10856.00 ( 38.89%)
Ops memcachetest-2498M 3974.00 ( 0.00%) 3190.00 (-19.73%) 11623.00 (192.48%) 11142.00 (180.37%) 10930.00 (175.04%)
Ops memcachetest-4246M 2355.00 ( 0.00%) 2915.00 ( 23.78%) 12619.00 (435.84%) 11212.00 (376.09%) 10904.00 (363.01%)
Ops io-duration-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops io-duration-749M 31.00 ( 0.00%) 16.00 ( 48.39%) 9.00 ( 70.97%) 9.00 ( 70.97%) 8.00 ( 74.19%)
Ops io-duration-2498M 89.00 ( 0.00%) 111.00 (-24.72%) 27.00 ( 69.66%) 28.00 ( 68.54%) 27.00 ( 69.66%)
Ops io-duration-4246M 182.00 ( 0.00%) 165.00 ( 9.34%) 49.00 ( 73.08%) 46.00 ( 74.73%) 45.00 ( 75.27%)
Ops swaptotal-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops swaptotal-749M 219394.00 ( 0.00%) 162045.00 ( 26.14%) 0.00 ( 0.00%) 0.00 ( 0.00%) 16.00 ( 99.99%)
Ops swaptotal-2498M 312904.00 ( 0.00%) 389809.00 (-24.58%) 334.00 ( 99.89%) 1233.00 ( 99.61%) 8.00 (100.00%)
Ops swaptotal-4246M 471517.00 ( 0.00%) 395170.00 ( 16.19%) 0.00 ( 0.00%) 1117.00 ( 99.76%) 29.00 ( 99.99%)
Ops swapin-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops swapin-749M 62057.00 ( 0.00%) 5954.00 ( 90.41%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops swapin-2498M 143617.00 ( 0.00%) 154592.00 ( -7.64%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops swapin-4246M 160417.00 ( 0.00%) 125904.00 ( 21.51%) 0.00 ( 0.00%) 13.00 ( 99.99%) 0.00 ( 0.00%)
Ops minorfaults-0M 1683549.00 ( 0.00%) 1685771.00 ( -0.13%) 1675398.00 ( 0.48%) 1723245.00 ( -2.36%) 1683717.00 ( -0.01%)
Ops minorfaults-749M 1788977.00 ( 0.00%) 1871737.00 ( -4.63%) 1617193.00 ( 9.60%) 1610892.00 ( 9.95%) 1682760.00 ( 5.94%)
Ops minorfaults-2498M 1836894.00 ( 0.00%) 1796566.00 ( 2.20%) 1677878.00 ( 8.66%) 1685741.00 ( 8.23%) 1609514.00 ( 12.38%)
Ops minorfaults-4246M 1797685.00 ( 0.00%) 1819832.00 ( -1.23%) 1689258.00 ( 6.03%) 1690695.00 ( 5.95%) 1684430.00 ( 6.30%)
Ops majorfaults-0M 5.00 ( 0.00%) 7.00 (-40.00%) 5.00 ( 0.00%) 24.00 (-380.00%) 9.00 (-80.00%)
Ops majorfaults-749M 10310.00 ( 0.00%) 876.00 ( 91.50%) 73.00 ( 99.29%) 63.00 ( 99.39%) 90.00 ( 99.13%)
Ops majorfaults-2498M 20809.00 ( 0.00%) 22377.00 ( -7.54%) 102.00 ( 99.51%) 110.00 ( 99.47%) 55.00 ( 99.74%)
Ops majorfaults-4246M 23228.00 ( 0.00%) 20270.00 ( 12.73%) 196.00 ( 99.16%) 222.00 ( 99.04%) 102.00 ( 99.56%)

Note how the vanilla kernel's performance is ruined by the parallel IO
with performance of 10932 ops/sec dropping to 2355 ops/sec. Note that
this is likely due to the swap activity and major faults as memcached
is pushed to swap prematurely.

flatten-v1r8 overall reduces the amount of reclaim but it's a minor
improvement.

limitprio-v1r8 almost eliminates the impact the parallel IO has on the
memcachetest workload. The ops/sec remain above 10K ops/sec and there is
no swapin activity.

The remainer of the series has very little impact on the memcachetest
workload but the impact on kswapd is visible in the vmstat figures.

3.9.0-rc2 3.9.0-rc2 3.9.0-rc2 3.9.0-rc2 3.9.0-rc2
vanillaflatten-v1r8limitprio-v1r8 block-v1r8 tidy-v1r8
Page Ins 1567012 1238608 90388 103832 75684
Page Outs 12837552 15223512 12726464 13613400 12668604
Swap Ins 366362 286798 0 13 0
Swap Outs 637724 660574 334 2337 53
Direct pages scanned 0 0 0 196955 292532
Kswapd pages scanned 11763732 4389473 207629411 22337712 3885443
Kswapd pages reclaimed 1262812 1186228 1228379 971375 685338
Direct pages reclaimed 0 0 0 186053 267255
Kswapd efficiency 10% 27% 0% 4% 17%
Kswapd velocity 9111.544 3407.923 161226.742 17342.002 3009.265
Direct efficiency 100% 100% 100% 94% 91%
Direct velocity 0.000 0.000 0.000 152.907 226.565
Percentage direct scans 0% 0% 0% 0% 7%
Page writes by reclaim 2858699 1159073 42498573 21198413 3018972
Page writes file 2220975 498499 42498239 21196076 3018919
Page writes anon 637724 660574 334 2337 53
Page reclaim immediate 6243 125 69598 1056 4370
Page rescued immediate 0 0 0 0 0
Slabs scanned 35328 39296 32000 62080 25600
Direct inode steals 0 0 0 0 0
Kswapd inode steals 16899 5491 6375 19957 907
Kswapd skipped wait 0 0 0 0 0
THP fault alloc 14 7 10 50 7
THP collapse alloc 491 465 637 709 629
THP splits 10 12 5 7 5
THP fault fallback 0 0 0 0 0
THP collapse fail 0 0 0 0 0
Compaction stalls 0 0 0 81 3
Compaction success 0 0 0 74 0
Compaction failures 0 0 0 7 3
Page migrate success 0 0 0 43855 0
Page migrate failure 0 0 0 0 0
Compaction pages isolated 0 0 0 97582 0
Compaction migrate scanned 0 0 0 111419 0
Compaction free scanned 0 0 0 324617 0
Compaction cost 0 0 0 48 0

While limitprio-v1r8 improves the performance of memcachetest, note what it
does to kswapd activity apparently scanning on average 162K pages/second. In
reality what happened was that there was spikes in reclaim activity but
nevertheless it's severe.

The patch that blocks kswapd when it encounters too many pages under
writeback severely reduces the amount of scanning activity. Note that the
full series also reduces the amount of slab shrinking heavily reduces the
amount of inodes reclaimed by kswapd.

Comments?

include/linux/mmzone.h | 16 ++
mm/vmscan.c | 387 +++++++++++++++++++++++++++++--------------------
2 files changed, 245 insertions(+), 158 deletions(-)

--
1.8.1.4


2013-03-17 13:04:31

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 03/10] mm: vmscan: Flatten kswapd priority loop

kswapd stops raising the scanning priority when at least SWAP_CLUSTER_MAX
pages have been reclaimed or the pgdat is considered balanced. It then
rechecks if it needs to restart at DEF_PRIORITY and whether high-order
reclaim needs to be reset. This is not wrong per-se but it is confusing
to follow and forcing kswapd to stay at DEF_PRIORITY may require several
restarts before it has scanned enough pages to meet the high watermark even
at 100% efficiency. This patch irons out the logic a bit by controlling
when priority is raised and removing the "goto loop_again".

This patch has kswapd raise the scanning priority until it is scanning
enough pages that it could meet the high watermark in one shrink of the
LRU lists if it is able to reclaim at 100% efficiency. It will not raise
the scanning prioirty higher unless it is failing to reclaim any pages.

To avoid infinite looping for high-order allocation requests kswapd will
not reclaim for high-order allocations when it has reclaimed at least
twice the number of pages as the allocation request.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/vmscan.c | 86 ++++++++++++++++++++++++++++++-------------------------------
1 file changed, 42 insertions(+), 44 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 182ff15..279d0c2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2625,8 +2625,11 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
/*
* kswapd shrinks the zone by the number of pages required to reach
* the high watermark.
+ *
+ * Returns true if kswapd scanned at least the requested number of
+ * pages to reclaim.
*/
-static void kswapd_shrink_zone(struct zone *zone,
+static bool kswapd_shrink_zone(struct zone *zone,
struct scan_control *sc,
unsigned long lru_pages)
{
@@ -2646,6 +2649,8 @@ static void kswapd_shrink_zone(struct zone *zone,

if (nr_slab == 0 && !zone_reclaimable(zone))
zone->all_unreclaimable = 1;
+
+ return sc->nr_scanned >= sc->nr_to_reclaim;
}

/*
@@ -2672,26 +2677,25 @@ static void kswapd_shrink_zone(struct zone *zone,
static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
int *classzone_idx)
{
- bool pgdat_is_balanced = false;
int i;
int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */
unsigned long nr_soft_reclaimed;
unsigned long nr_soft_scanned;
struct scan_control sc = {
.gfp_mask = GFP_KERNEL,
+ .priority = DEF_PRIORITY,
.may_unmap = 1,
.may_swap = 1,
+ .may_writepage = !laptop_mode,
.order = order,
.target_mem_cgroup = NULL,
};
-loop_again:
- sc.priority = DEF_PRIORITY;
- sc.nr_reclaimed = 0;
- sc.may_writepage = !laptop_mode;
count_vm_event(PAGEOUTRUN);

do {
unsigned long lru_pages = 0;
+ unsigned long nr_reclaimed = sc.nr_reclaimed;
+ bool raise_priority = true;

/*
* Scan in the highmem->dma direction for the highest
@@ -2733,10 +2737,8 @@ loop_again:
}
}

- if (i < 0) {
- pgdat_is_balanced = true;
+ if (i < 0)
goto out;
- }

for (i = 0; i <= end_zone; i++) {
struct zone *zone = pgdat->node_zones + i;
@@ -2803,8 +2805,16 @@ loop_again:

if ((buffer_heads_over_limit && is_highmem_idx(i)) ||
!zone_balanced(zone, testorder,
- balance_gap, end_zone))
- kswapd_shrink_zone(zone, &sc, lru_pages);
+ balance_gap, end_zone)) {
+ /*
+ * There should be no need to raise the
+ * scanning priority if enough pages are
+ * already being scanned that that high
+ * watermark would be met at 100% efficiency.
+ */
+ if (kswapd_shrink_zone(zone, &sc, lru_pages))
+ raise_priority = false;
+ }

/*
* If we're getting trouble reclaiming, start doing
@@ -2839,46 +2849,33 @@ loop_again:
pfmemalloc_watermark_ok(pgdat))
wake_up(&pgdat->pfmemalloc_wait);

- if (pgdat_balanced(pgdat, order, *classzone_idx)) {
- pgdat_is_balanced = true;
- break; /* kswapd: all done */
- }
-
/*
- * We do this so kswapd doesn't build up large priorities for
- * example when it is freeing in parallel with allocators. It
- * matches the direct reclaim path behaviour in terms of impact
- * on zone->*_priority.
+ * Fragmentation may mean that the system cannot be rebalanced
+ * for high-order allocations in all zones. If twice the
+ * allocation size has been reclaimed and the zones are still
+ * not balanced then recheck the watermarks at order-0 to
+ * prevent kswapd reclaiming excessively. Assume that a
+ * process requested a high-order can direct reclaim/compact.
*/
- if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX)
- break;
- } while (--sc.priority >= 0);
+ if (order && sc.nr_reclaimed >= 2UL << order)
+ order = sc.order = 0;

-out:
- if (!pgdat_is_balanced) {
- cond_resched();
+ /* Check if kswapd should be suspending */
+ if (try_to_freeze() || kthread_should_stop())
+ break;

- try_to_freeze();
+ /* If no reclaim progress then increase scanning priority */
+ if (sc.nr_reclaimed - nr_reclaimed == 0)
+ raise_priority = true;

/*
- * Fragmentation may mean that the system cannot be
- * rebalanced for high-order allocations in all zones.
- * At this point, if nr_reclaimed < SWAP_CLUSTER_MAX,
- * it means the zones have been fully scanned and are still
- * not balanced. For high-order allocations, there is
- * little point trying all over again as kswapd may
- * infinite loop.
- *
- * Instead, recheck all watermarks at order-0 as they
- * are the most important. If watermarks are ok, kswapd will go
- * back to sleep. High-order users can still perform direct
- * reclaim if they wish.
+ * Raise priority if scanning rate is too low or there was no
+ * progress in reclaiming pages
*/
- if (sc.nr_reclaimed < SWAP_CLUSTER_MAX)
- order = sc.order = 0;
-
- goto loop_again;
- }
+ if (raise_priority || sc.nr_reclaimed - nr_reclaimed == 0)
+ sc.priority--;
+ } while (sc.priority >= 0 &&
+ !pgdat_balanced(pgdat, order, *classzone_idx));

/*
* If kswapd was reclaiming at a higher order, it has the option of
@@ -2907,6 +2904,7 @@ out:
compact_pgdat(pgdat, order);
}

+out:
/*
* Return the order we were reclaiming at so prepare_kswapd_sleep()
* makes a decision on the order we were last reclaiming at. However,
--
1.8.1.4

2013-03-17 13:04:45

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 10/10] mm: vmscan: Move logic from balance_pgdat() to kswapd_shrink_zone()

balance_pgdat() is very long and some of the logic can and should
be internal to kswapd_shrink_zone(). Move it so the flow of
balance_pgdat() is marginally easier to follow.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/vmscan.c | 104 +++++++++++++++++++++++++++++-------------------------------
1 file changed, 51 insertions(+), 53 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8c66e5a..d7cf384 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2660,18 +2660,53 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
* reclaim or if the lack of process was due to pages under writeback.
*/
static bool kswapd_shrink_zone(struct zone *zone,
+ int classzone_idx,
struct scan_control *sc,
unsigned long lru_pages,
bool shrinking_slab)
{
+ int testorder = sc->order;
unsigned long nr_slab = 0;
+ unsigned long balance_gap;
struct reclaim_state *reclaim_state = current->reclaim_state;
struct shrink_control shrink = {
.gfp_mask = sc->gfp_mask,
};
+ bool lowmem_pressure;

/* Reclaim above the high watermark. */
sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));
+
+ /*
+ * Kswapd reclaims only single pages with compaction enabled. Trying
+ * too hard to reclaim until contiguous free pages have become
+ * available can hurt performance by evicting too much useful data
+ * from memory. Do not reclaim more than needed for compaction.
+ */
+ if (IS_ENABLED(CONFIG_COMPACTION) && sc->order &&
+ compaction_suitable(zone, sc->order) !=
+ COMPACT_SKIPPED)
+ testorder = 0;
+
+ /*
+ * We put equal pressure on every zone, unless one zone has way too
+ * many pages free already. The "too many pages" is defined as the
+ * high wmark plus a "gap" where the gap is either the low
+ * watermark or 1% of the zone, whichever is smaller.
+ */
+ balance_gap = min(low_wmark_pages(zone),
+ (zone->managed_pages + KSWAPD_ZONE_BALANCE_GAP_RATIO-1) /
+ KSWAPD_ZONE_BALANCE_GAP_RATIO);
+
+ /*
+ * If there is no low memory pressure or the zone is balanced then no
+ * reclaim is necessary
+ */
+ lowmem_pressure = (buffer_heads_over_limit && is_highmem(zone));
+ if (!(lowmem_pressure || !zone_balanced(zone, testorder,
+ balance_gap, classzone_idx)))
+ return true;
+
shrink_zone(zone, sc);

/*
@@ -2689,6 +2724,16 @@ static bool kswapd_shrink_zone(struct zone *zone,

zone_clear_flag(zone, ZONE_WRITEBACK);

+ /*
+ * If a zone reaches its high watermark, consider it to be no longer
+ * congested. It's possible there are dirty pages backed by congested
+ * BDIs but as pressure is relieved, speculatively avoid congestion
+ * waits.
+ */
+ if (!zone->all_unreclaimable &&
+ zone_balanced(zone, testorder, 0, classzone_idx))
+ zone_clear_flag(zone, ZONE_CONGESTED);
+
return sc->nr_scanned >= sc->nr_to_reclaim;
}

@@ -2821,8 +2866,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
*/
for (i = 0; i <= end_zone; i++) {
struct zone *zone = pgdat->node_zones + i;
- int testorder;
- unsigned long balance_gap;

if (!populated_zone(zone))
continue;
@@ -2843,61 +2886,16 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
sc.nr_reclaimed += nr_soft_reclaimed;

/*
- * We put equal pressure on every zone, unless
- * one zone has way too many pages free
- * already. The "too many pages" is defined
- * as the high wmark plus a "gap" where the
- * gap is either the low watermark or 1%
- * of the zone, whichever is smaller.
- */
- balance_gap = min(low_wmark_pages(zone),
- (zone->managed_pages +
- KSWAPD_ZONE_BALANCE_GAP_RATIO-1) /
- KSWAPD_ZONE_BALANCE_GAP_RATIO);
- /*
- * Kswapd reclaims only single pages with compaction
- * enabled. Trying too hard to reclaim until contiguous
- * free pages have become available can hurt performance
- * by evicting too much useful data from memory.
- * Do not reclaim more than needed for compaction.
+ * There should be no need to raise the scanning
+ * priority if enough pages are already being scanned
+ * that that high watermark would be met at 100%
+ * efficiency.
*/
- testorder = order;
- if (IS_ENABLED(CONFIG_COMPACTION) && order &&
- compaction_suitable(zone, order) !=
- COMPACT_SKIPPED)
- testorder = 0;
-
- if ((buffer_heads_over_limit && is_highmem_idx(i)) ||
- !zone_balanced(zone, testorder,
- balance_gap, end_zone)) {
- /*
- * There should be no need to raise the
- * scanning priority if enough pages are
- * already being scanned that that high
- * watermark would be met at 100% efficiency.
- */
- if (kswapd_shrink_zone(zone, &sc,
+ if (kswapd_shrink_zone(zone, end_zone, &sc,
lru_pages, shrinking_slab))
raise_priority = false;

- nr_to_reclaim += sc.nr_to_reclaim;
- }
-
- if (zone->all_unreclaimable) {
- if (end_zone && end_zone == i)
- end_zone--;
- continue;
- }
-
- if (zone_balanced(zone, testorder, 0, end_zone))
- /*
- * If a zone reaches its high watermark,
- * consider it to be no longer congested. It's
- * possible there are dirty pages backed by
- * congested BDIs but as pressure is relieved,
- * speculatively avoid congestion waits
- */
- zone_clear_flag(zone, ZONE_CONGESTED);
+ nr_to_reclaim += sc.nr_to_reclaim;
}

/*
--
1.8.1.4

2013-03-17 13:04:43

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 08/10] mm: vmscan: Have kswapd shrink slab only once per priority

If kswaps fails to make progress but continues to shrink slab then it'll
either discard all of slab or consume CPU uselessly scanning shrinkers.
This patch causes kswapd to only call the shrinkers once per priority.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/vmscan.c | 28 +++++++++++++++++++++-------
1 file changed, 21 insertions(+), 7 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7d5a932..84375b2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2661,9 +2661,10 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
*/
static bool kswapd_shrink_zone(struct zone *zone,
struct scan_control *sc,
- unsigned long lru_pages)
+ unsigned long lru_pages,
+ bool shrinking_slab)
{
- unsigned long nr_slab;
+ unsigned long nr_slab = 0;
struct reclaim_state *reclaim_state = current->reclaim_state;
struct shrink_control shrink = {
.gfp_mask = sc->gfp_mask,
@@ -2673,9 +2674,15 @@ static bool kswapd_shrink_zone(struct zone *zone,
sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));
shrink_zone(zone, sc);

- reclaim_state->reclaimed_slab = 0;
- nr_slab = shrink_slab(&shrink, sc->nr_scanned, lru_pages);
- sc->nr_reclaimed += reclaim_state->reclaimed_slab;
+ /*
+ * Slabs are shrunk for each zone once per priority or if the zone
+ * being balanced is otherwise unreclaimable
+ */
+ if (shrinking_slab || !zone_reclaimable(zone)) {
+ reclaim_state->reclaimed_slab = 0;
+ nr_slab = shrink_slab(&shrink, sc->nr_scanned, lru_pages);
+ sc->nr_reclaimed += reclaim_state->reclaimed_slab;
+ }

if (nr_slab == 0 && !zone_reclaimable(zone))
zone->all_unreclaimable = 1;
@@ -2713,6 +2720,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */
unsigned long nr_soft_reclaimed;
unsigned long nr_soft_scanned;
+ bool shrinking_slab = true;
struct scan_control sc = {
.gfp_mask = GFP_KERNEL,
.priority = DEF_PRIORITY,
@@ -2861,7 +2869,8 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
* already being scanned that that high
* watermark would be met at 100% efficiency.
*/
- if (kswapd_shrink_zone(zone, &sc, lru_pages))
+ if (kswapd_shrink_zone(zone, &sc,
+ lru_pages, shrinking_slab))
raise_priority = false;

nr_to_reclaim += sc.nr_to_reclaim;
@@ -2900,6 +2909,9 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
pfmemalloc_watermark_ok(pgdat))
wake_up(&pgdat->pfmemalloc_wait);

+ /* Only shrink slab once per priority */
+ shrinking_slab = false;
+
/*
* Fragmentation may mean that the system cannot be rebalanced
* for high-order allocations in all zones. If twice the
@@ -2925,8 +2937,10 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
* Raise priority if scanning rate is too low or there was no
* progress in reclaiming pages
*/
- if (raise_priority || !this_reclaimed)
+ if (raise_priority || !this_reclaimed) {
sc.priority--;
+ shrinking_slab = true;
+ }
} while (sc.priority >= 1 &&
!pgdat_balanced(pgdat, order, *classzone_idx));

--
1.8.1.4

2013-03-17 13:05:23

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 09/10] mm: vmscan: Check if kswapd should writepage once per priority

Currently kswapd checks if it should start writepage as it shrinks
each zone without taking into consideration if the zone is balanced or
not. This is not wrong as such but it does not make much sense either.
This patch checks once per priority if kswapd should be writing pages.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/vmscan.c | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 84375b2..8c66e5a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2804,6 +2804,13 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
}

/*
+ * If we're getting trouble reclaiming, start doing writepage
+ * even in laptop mode.
+ */
+ if (sc.priority < DEF_PRIORITY - 2)
+ sc.may_writepage = 1;
+
+ /*
* Now scan the zone in the dma->highmem direction, stopping
* at the last zone which needs scanning.
*
@@ -2876,13 +2883,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
nr_to_reclaim += sc.nr_to_reclaim;
}

- /*
- * If we're getting trouble reclaiming, start doing
- * writepage even in laptop mode.
- */
- if (sc.priority < DEF_PRIORITY - 2)
- sc.may_writepage = 1;
-
if (zone->all_unreclaimable) {
if (end_zone && end_zone == i)
end_zone--;
--
1.8.1.4

2013-03-17 13:05:49

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 06/10] mm: vmscan: Have kswapd writeback pages based on dirty pages encountered, not priority

Currently kswapd queues dirty pages for writeback if scanning at an elevated
priority but the priority kswapd scans at is not related to the number
of unqueued dirty encountered. Since commit "mm: vmscan: Flatten kswapd
priority loop", the priority is related to the size of the LRU and the
zone watermark which is no indication as to whether kswapd should write
pages or not.

This patch tracks if an excessive number of unqueued dirty pages are being
encountered at the end of the LRU. If so, it indicates that dirty pages
are being recycled before flusher threads can clean them and flags the
zone so that kswapd will start writing pages until the zone is balanced.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/mmzone.h | 8 ++++++++
mm/vmscan.c | 29 +++++++++++++++++++++++------
2 files changed, 31 insertions(+), 6 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ede2749..edd6b98 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -495,6 +495,9 @@ typedef enum {
ZONE_CONGESTED, /* zone has many dirty pages backed by
* a congested BDI
*/
+ ZONE_DIRTY, /* reclaim scanning has recently found
+ * many dirty file pages
+ */
} zone_flags_t;

static inline void zone_set_flag(struct zone *zone, zone_flags_t flag)
@@ -517,6 +520,11 @@ static inline int zone_is_reclaim_congested(const struct zone *zone)
return test_bit(ZONE_CONGESTED, &zone->flags);
}

+static inline int zone_is_reclaim_dirty(const struct zone *zone)
+{
+ return test_bit(ZONE_DIRTY, &zone->flags);
+}
+
static inline int zone_is_reclaim_locked(const struct zone *zone)
{
return test_bit(ZONE_RECLAIM_LOCKED, &zone->flags);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index af3bb6f..493728b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -675,13 +675,14 @@ static unsigned long shrink_page_list(struct list_head *page_list,
struct zone *zone,
struct scan_control *sc,
enum ttu_flags ttu_flags,
- unsigned long *ret_nr_dirty,
+ unsigned long *ret_nr_unqueued_dirty,
unsigned long *ret_nr_writeback,
bool force_reclaim)
{
LIST_HEAD(ret_pages);
LIST_HEAD(free_pages);
int pgactivate = 0;
+ unsigned long nr_unqueued_dirty = 0;
unsigned long nr_dirty = 0;
unsigned long nr_congested = 0;
unsigned long nr_reclaimed = 0;
@@ -807,14 +808,17 @@ static unsigned long shrink_page_list(struct list_head *page_list,
if (PageDirty(page)) {
nr_dirty++;

+ if (!PageWriteback(page))
+ nr_unqueued_dirty++;
+
/*
* Only kswapd can writeback filesystem pages to
- * avoid risk of stack overflow but do not writeback
- * unless under significant pressure.
+ * avoid risk of stack overflow but only writeback
+ * if many dirty pages have been encountered.
*/
if (page_is_file_cache(page) &&
(!current_is_kswapd() ||
- sc->priority >= DEF_PRIORITY - 2)) {
+ !zone_is_reclaim_dirty(zone))) {
/*
* Immediately reclaim when written back.
* Similar in principal to deactivate_page()
@@ -959,7 +963,7 @@ keep:
list_splice(&ret_pages, page_list);
count_vm_events(PGACTIVATE, pgactivate);
mem_cgroup_uncharge_end();
- *ret_nr_dirty += nr_dirty;
+ *ret_nr_unqueued_dirty += nr_unqueued_dirty;
*ret_nr_writeback += nr_writeback;
return nr_reclaimed;
}
@@ -1372,6 +1376,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
(nr_taken >> (DEF_PRIORITY - sc->priority)))
wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);

+ /*
+ * Similarly, if many dirty pages are encountered that are not
+ * currently being written then flag that kswapd should start
+ * writing back pages.
+ */
+ if (global_reclaim(sc) && nr_dirty &&
+ nr_dirty >= (nr_taken >> (DEF_PRIORITY - sc->priority)))
+ zone_set_flag(zone, ZONE_DIRTY);
+
trace_mm_vmscan_lru_shrink_inactive(zone->zone_pgdat->node_id,
zone_idx(zone),
nr_scanned, nr_reclaimed,
@@ -2735,8 +2748,12 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
end_zone = i;
break;
} else {
- /* If balanced, clear the congested flag */
+ /*
+ * If balanced, clear the dirty and congested
+ * flags
+ */
zone_clear_flag(zone, ZONE_CONGESTED);
+ zone_clear_flag(zone, ZONE_DIRTY);
}
}

--
1.8.1.4

2013-03-17 13:05:43

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 07/10] mm: vmscan: Block kswapd if it is encountering pages under writeback

Historically, kswapd used to congestion_wait() at higher priorities if it
was not making forward progress. This made no sense as the failure to make
progress could be completely independent of IO. It was later replaced by
wait_iff_congested() and removed entirely by commit 258401a6 (mm: don't
wait on congested zones in balance_pgdat()) as it was duplicating logic
in shrink_inactive_list().

This is problematic. If kswapd encounters many pages under writeback and
it continues to scan until it reaches the high watermark then it will
quickly skip over the pages under writeback and reclaim clean young
pages or push applications out to swap.

The use of wait_iff_congested() is not suited to kswapd as it will only
stall if the underlying BDI is really congested or a direct reclaimer was
unable to write to the underlying BDI. kswapd bypasses the BDI congestion
as it sets PF_SWAPWRITE but even if this was taken into account then it
would cause direct reclaimers to stall on writeback which is not desirable.

This patch sets a ZONE_WRITEBACK flag if direct reclaim or kswapd is
encountering too many pages under writeback. If this flag is set and
kswapd encounters a PageReclaim page under writeback then it'll assume
that the LRU lists are being recycled too quickly before IO can complete
and block waiting for some IO to complete.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/mmzone.h | 8 ++++++++
mm/vmscan.c | 29 ++++++++++++++++++++++++-----
2 files changed, 32 insertions(+), 5 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index edd6b98..c758fb7 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -498,6 +498,9 @@ typedef enum {
ZONE_DIRTY, /* reclaim scanning has recently found
* many dirty file pages
*/
+ ZONE_WRITEBACK, /* reclaim scanning has recently found
+ * many pages under writeback
+ */
} zone_flags_t;

static inline void zone_set_flag(struct zone *zone, zone_flags_t flag)
@@ -525,6 +528,11 @@ static inline int zone_is_reclaim_dirty(const struct zone *zone)
return test_bit(ZONE_DIRTY, &zone->flags);
}

+static inline int zone_is_reclaim_writeback(const struct zone *zone)
+{
+ return test_bit(ZONE_WRITEBACK, &zone->flags);
+}
+
static inline int zone_is_reclaim_locked(const struct zone *zone)
{
return test_bit(ZONE_RECLAIM_LOCKED, &zone->flags);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 493728b..7d5a932 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -725,6 +725,19 @@ static unsigned long shrink_page_list(struct list_head *page_list,

if (PageWriteback(page)) {
/*
+ * If reclaim is encountering an excessive number of
+ * pages under writeback and this page is both under
+ * writeback and PageReclaim then it indicates that
+ * pages are being queued for IO but are being
+ * recycled through the LRU before the IO can complete.
+ * is useless CPU work so wait on the IO to complete.
+ */
+ if (current_is_kswapd() &&
+ zone_is_reclaim_writeback(zone)) {
+ wait_on_page_writeback(page);
+ zone_clear_flag(zone, ZONE_WRITEBACK);
+
+ /*
* memcg doesn't have any dirty pages throttling so we
* could easily OOM just because too many pages are in
* writeback and there is nothing else to reclaim.
@@ -741,7 +754,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
* grab_cache_page_write_begin(,,AOP_FLAG_NOFS), so
* testing may_enter_fs here is liable to OOM on them.
*/
- if (global_reclaim(sc) ||
+ } else if (global_reclaim(sc) ||
!PageReclaim(page) || !(sc->gfp_mask & __GFP_IO)) {
/*
* This is slightly racy - end_page_writeback()
@@ -756,9 +769,11 @@ static unsigned long shrink_page_list(struct list_head *page_list,
*/
SetPageReclaim(page);
nr_writeback++;
+
goto keep_locked;
+ } else {
+ wait_on_page_writeback(page);
}
- wait_on_page_writeback(page);
}

if (!force_reclaim)
@@ -1373,8 +1388,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
* isolated page is PageWriteback
*/
if (nr_writeback && nr_writeback >=
- (nr_taken >> (DEF_PRIORITY - sc->priority)))
+ (nr_taken >> (DEF_PRIORITY - sc->priority))) {
wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
+ zone_set_flag(zone, ZONE_WRITEBACK);
+ }

/*
* Similarly, if many dirty pages are encountered that are not
@@ -2639,8 +2656,8 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
* kswapd shrinks the zone by the number of pages required to reach
* the high watermark.
*
- * Returns true if kswapd scanned at least the requested number of
- * pages to reclaim.
+ * Returns true if kswapd scanned at least the requested number of pages to
+ * reclaim or if the lack of process was due to pages under writeback.
*/
static bool kswapd_shrink_zone(struct zone *zone,
struct scan_control *sc,
@@ -2663,6 +2680,8 @@ static bool kswapd_shrink_zone(struct zone *zone,
if (nr_slab == 0 && !zone_reclaimable(zone))
zone->all_unreclaimable = 1;

+ zone_clear_flag(zone, ZONE_WRITEBACK);
+
return sc->nr_scanned >= sc->nr_to_reclaim;
}

--
1.8.1.4

2013-03-17 13:06:16

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 05/10] mm: vmscan: Do not allow kswapd to scan at maximum priority

Page reclaim at priority 0 will scan the entire LRU as priority 0 is
considered to be a near OOM condition. Kswapd can reach priority 0 quite
easily if it is encountering a large number of pages it cannot reclaim
such as pages under writeback. When this happens, kswapd reclaims very
aggressively even though there may be no real risk of allocation failure
or OOM.

This patch prevents kswapd reaching priority 0 and trying to reclaim
the world. Direct reclaimers will still reach priority 0 in the event
of an OOM situation.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/vmscan.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7513bd1..af3bb6f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2891,7 +2891,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
*/
if (raise_priority || !this_reclaimed)
sc.priority--;
- } while (sc.priority >= 0 &&
+ } while (sc.priority >= 1 &&
!pgdat_balanced(pgdat, order, *classzone_idx));

out:
--
1.8.1.4

2013-03-17 13:04:24

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority

The number of pages kswapd can reclaim is bound by the number of pages it
scans which is related to the size of the zone and the scanning priority. In
many cases the priority remains low because it's reset every SWAP_CLUSTER_MAX
reclaimed pages but in the event kswapd scans a large number of pages it
cannot reclaim, it will raise the priority and potentially discard a large
percentage of the zone as sc->nr_to_reclaim is ULONG_MAX. The user-visible
effect is a reclaim "spike" where a large percentage of memory is suddenly
freed. It would be bad enough if this was just unused memory but because
of how anon/file pages are balanced it is possible that applications get
pushed to swap unnecessarily.

This patch limits the number of pages kswapd will reclaim to the high
watermark. Reclaim will will overshoot due to it not being a hard limit as
shrink_lruvec() will ignore the sc.nr_to_reclaim at DEF_PRIORITY but it
prevents kswapd reclaiming the world at higher priorities. The number of
pages it reclaims is not adjusted for high-order allocations as kswapd will
reclaim excessively if it is to balance zones for high-order allocations.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/vmscan.c | 53 +++++++++++++++++++++++++++++------------------------
1 file changed, 29 insertions(+), 24 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 88c5fed..4835a7a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2593,6 +2593,32 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
}

/*
+ * kswapd shrinks the zone by the number of pages required to reach
+ * the high watermark.
+ */
+static void kswapd_shrink_zone(struct zone *zone,
+ struct scan_control *sc,
+ unsigned long lru_pages)
+{
+ unsigned long nr_slab;
+ struct reclaim_state *reclaim_state = current->reclaim_state;
+ struct shrink_control shrink = {
+ .gfp_mask = sc->gfp_mask,
+ };
+
+ /* Reclaim above the high watermark. */
+ sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));
+ shrink_zone(zone, sc);
+
+ reclaim_state->reclaimed_slab = 0;
+ nr_slab = shrink_slab(&shrink, sc->nr_scanned, lru_pages);
+ sc->nr_reclaimed += reclaim_state->reclaimed_slab;
+
+ if (nr_slab == 0 && !zone_reclaimable(zone))
+ zone->all_unreclaimable = 1;
+}
+
+/*
* For kswapd, balance_pgdat() will work across all this node's zones until
* they are all at high_wmark_pages(zone).
*
@@ -2619,27 +2645,16 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
bool pgdat_is_balanced = false;
int i;
int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */
- unsigned long total_scanned;
- struct reclaim_state *reclaim_state = current->reclaim_state;
unsigned long nr_soft_reclaimed;
unsigned long nr_soft_scanned;
struct scan_control sc = {
.gfp_mask = GFP_KERNEL,
.may_unmap = 1,
.may_swap = 1,
- /*
- * kswapd doesn't want to be bailed out while reclaim. because
- * we want to put equal scanning pressure on each zone.
- */
- .nr_to_reclaim = ULONG_MAX,
.order = order,
.target_mem_cgroup = NULL,
};
- struct shrink_control shrink = {
- .gfp_mask = sc.gfp_mask,
- };
loop_again:
- total_scanned = 0;
sc.priority = DEF_PRIORITY;
sc.nr_reclaimed = 0;
sc.may_writepage = !laptop_mode;
@@ -2710,7 +2725,7 @@ loop_again:
*/
for (i = 0; i <= end_zone; i++) {
struct zone *zone = pgdat->node_zones + i;
- int nr_slab, testorder;
+ int testorder;
unsigned long balance_gap;

if (!populated_zone(zone))
@@ -2730,7 +2745,6 @@ loop_again:
order, sc.gfp_mask,
&nr_soft_scanned);
sc.nr_reclaimed += nr_soft_reclaimed;
- total_scanned += nr_soft_scanned;

/*
* We put equal pressure on every zone, unless
@@ -2759,17 +2773,8 @@ loop_again:

if ((buffer_heads_over_limit && is_highmem_idx(i)) ||
!zone_balanced(zone, testorder,
- balance_gap, end_zone)) {
- shrink_zone(zone, &sc);
-
- reclaim_state->reclaimed_slab = 0;
- nr_slab = shrink_slab(&shrink, sc.nr_scanned, lru_pages);
- sc.nr_reclaimed += reclaim_state->reclaimed_slab;
- total_scanned += sc.nr_scanned;
-
- if (nr_slab == 0 && !zone_reclaimable(zone))
- zone->all_unreclaimable = 1;
- }
+ balance_gap, end_zone))
+ kswapd_shrink_zone(zone, &sc, lru_pages);

/*
* If we're getting trouble reclaiming, start doing
--
1.8.1.4

2013-03-17 13:06:31

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 04/10] mm: vmscan: Decide whether to compact the pgdat based on reclaim progress

In the past, kswapd makes a decision on whether to compact memory after the
pgdat was considered balanced. This more or less worked but it is late to
make such a decision and does not fit well now that kswapd makes a decision
whether to exit the zone scanning loop depending on reclaim progress.

This patch will compact a pgdat if at least the requested number of pages
were reclaimed from unbalanced zones for a given priority. If any zone is
currently balanced, kswapd will not call compaction as it is expected the
necessary pages are already available.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/vmscan.c | 52 +++++++++++++++++++++-------------------------------
1 file changed, 21 insertions(+), 31 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 279d0c2..7513bd1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2694,8 +2694,11 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,

do {
unsigned long lru_pages = 0;
+ unsigned long nr_to_reclaim = 0;
unsigned long nr_reclaimed = sc.nr_reclaimed;
+ unsigned long this_reclaimed;
bool raise_priority = true;
+ bool pgdat_needs_compaction = true;

/*
* Scan in the highmem->dma direction for the highest
@@ -2743,7 +2746,17 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
for (i = 0; i <= end_zone; i++) {
struct zone *zone = pgdat->node_zones + i;

+ if (!populated_zone(zone))
+ continue;
+
lru_pages += zone_reclaimable_pages(zone);
+
+ /* Check if the memory needs to be defragmented */
+ if (order && pgdat_needs_compaction &&
+ zone_watermark_ok(zone, order,
+ low_wmark_pages(zone),
+ *classzone_idx, 0))
+ pgdat_needs_compaction = false;
}

/*
@@ -2814,6 +2827,8 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
*/
if (kswapd_shrink_zone(zone, &sc, lru_pages))
raise_priority = false;
+
+ nr_to_reclaim += sc.nr_to_reclaim;
}

/*
@@ -2864,46 +2879,21 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
if (try_to_freeze() || kthread_should_stop())
break;

- /* If no reclaim progress then increase scanning priority */
- if (sc.nr_reclaimed - nr_reclaimed == 0)
- raise_priority = true;
+ /* Compact if necessary and kswapd is reclaiming efficiently */
+ this_reclaimed = sc.nr_reclaimed - nr_reclaimed;
+ if (order && pgdat_needs_compaction &&
+ this_reclaimed > nr_to_reclaim)
+ compact_pgdat(pgdat, order);

/*
* Raise priority if scanning rate is too low or there was no
* progress in reclaiming pages
*/
- if (raise_priority || sc.nr_reclaimed - nr_reclaimed == 0)
+ if (raise_priority || !this_reclaimed)
sc.priority--;
} while (sc.priority >= 0 &&
!pgdat_balanced(pgdat, order, *classzone_idx));

- /*
- * If kswapd was reclaiming at a higher order, it has the option of
- * sleeping without all zones being balanced. Before it does, it must
- * ensure that the watermarks for order-0 on *all* zones are met and
- * that the congestion flags are cleared. The congestion flag must
- * be cleared as kswapd is the only mechanism that clears the flag
- * and it is potentially going to sleep here.
- */
- if (order) {
- int zones_need_compaction = 1;
-
- for (i = 0; i <= end_zone; i++) {
- struct zone *zone = pgdat->node_zones + i;
-
- if (!populated_zone(zone))
- continue;
-
- /* Check if the memory needs to be defragmented. */
- if (zone_watermark_ok(zone, order,
- low_wmark_pages(zone), *classzone_idx, 0))
- zones_need_compaction = 0;
- }
-
- if (zones_need_compaction)
- compact_pgdat(pgdat, order);
- }
-
out:
/*
* Return the order we were reclaiming at so prepare_kswapd_sleep()
--
1.8.1.4

2013-03-17 13:06:55

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 02/10] mm: vmscan: Obey proportional scanning requirements for kswapd

Simplistically, the anon and file LRU lists are scanned proportionally
depending on the value of vm.swappiness although there are other factors
taken into account by get_scan_count(). The patch "mm: vmscan: Limit
the number of pages kswapd reclaims" limits the number of pages kswapd
reclaims but it breaks this proportional scanning and may evenly shrink
anon/file LRUs regardless of vm.swappiness.

This patch preserves the proportional scanning and reclaim. It does mean
that kswapd will reclaim more than requested but the number of pages will
be related to the high watermark.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/vmscan.c | 52 +++++++++++++++++++++++++++++++++++++++++-----------
1 file changed, 41 insertions(+), 11 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4835a7a..182ff15 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1815,6 +1815,45 @@ out:
}
}

+static void recalculate_scan_count(unsigned long nr_reclaimed,
+ unsigned long nr_to_reclaim,
+ unsigned long nr[NR_LRU_LISTS])
+{
+ enum lru_list l;
+
+ /*
+ * For direct reclaim, reclaim the number of pages requested. Less
+ * care is taken to ensure that scanning for each LRU is properly
+ * proportional. This is unfortunate and is improper aging but
+ * minimises the amount of time a process is stalled.
+ */
+ if (!current_is_kswapd()) {
+ if (nr_reclaimed >= nr_to_reclaim) {
+ for_each_evictable_lru(l)
+ nr[l] = 0;
+ }
+ return;
+ }
+
+ /*
+ * For kswapd, reclaim at least the number of pages requested.
+ * However, ensure that LRUs shrink by the proportion requested
+ * by get_scan_count() so vm.swappiness is obeyed.
+ */
+ if (nr_reclaimed >= nr_to_reclaim) {
+ unsigned long min = ULONG_MAX;
+
+ /* Find the LRU with the fewest pages to reclaim */
+ for_each_evictable_lru(l)
+ if (nr[l] < min)
+ min = nr[l];
+
+ /* Normalise the scan counts so kswapd scans proportionally */
+ for_each_evictable_lru(l)
+ nr[l] -= min;
+ }
+}
+
/*
* This is a basic per-zone page freer. Used by both kswapd and direct reclaim.
*/
@@ -1841,17 +1880,8 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
lruvec, sc);
}
}
- /*
- * On large memory systems, scan >> priority can become
- * really large. This is fine for the starting priority;
- * we want to put equal scanning pressure on each zone.
- * However, if the VM has a harder time of freeing pages,
- * with multiple processes reclaiming pages, the total
- * freeing target can get unreasonably large.
- */
- if (nr_reclaimed >= nr_to_reclaim &&
- sc->priority < DEF_PRIORITY)
- break;
+
+ recalculate_scan_count(nr_reclaimed, nr_to_reclaim, nr);
}
blk_finish_plug(&plug);
sc->nr_reclaimed += nr_reclaimed;
--
1.8.1.4

2013-03-17 14:36:26

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 03/10] mm: vmscan: Flatten kswapd priority loop

Mel Gorman <[email protected]> writes:
>
> To avoid infinite looping for high-order allocation requests kswapd will
> not reclaim for high-order allocations when it has reclaimed at least
> twice the number of pages as the allocation request.

Will this make higher order allocations fail earlier? Or does compaction
still kick in early enough.

I understand the motivation.

-Andi

--
[email protected] -- Speaking for myself only

2013-03-17 14:39:40

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 02/10] mm: vmscan: Obey proportional scanning requirements for kswapd

Mel Gorman <[email protected]> writes:
> +
> + /*
> + * For direct reclaim, reclaim the number of pages requested. Less
> + * care is taken to ensure that scanning for each LRU is properly
> + * proportional. This is unfortunate and is improper aging but
> + * minimises the amount of time a process is stalled.
> + */
> + if (!current_is_kswapd()) {
> + if (nr_reclaimed >= nr_to_reclaim) {
> + for_each_evictable_lru(l)

Don't we need some NUMA awareness here?
Similar below.

-Andi

--
[email protected] -- Speaking for myself only

2013-03-17 14:42:41

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 06/10] mm: vmscan: Have kswapd writeback pages based on dirty pages encountered, not priority

Mel Gorman <[email protected]> writes:

> @@ -495,6 +495,9 @@ typedef enum {
> ZONE_CONGESTED, /* zone has many dirty pages backed by
> * a congested BDI
> */
> + ZONE_DIRTY, /* reclaim scanning has recently found
> + * many dirty file pages
> + */

Needs a better name. ZONE_DIRTY_CONGESTED ?

> + * currently being written then flag that kswapd should start
> + * writing back pages.
> + */
> + if (global_reclaim(sc) && nr_dirty &&
> + nr_dirty >= (nr_taken >> (DEF_PRIORITY - sc->priority)))
> + zone_set_flag(zone, ZONE_DIRTY);
> +
> trace_mm_vmscan_lru_shrink_inactive(zone->zone_pgdat->node_id,

I suppose you want to trace the dirty case here too.

-Andi
--
[email protected] -- Speaking for myself only

2013-03-17 14:49:53

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 07/10] mm: vmscan: Block kswapd if it is encountering pages under writeback

Mel Gorman <[email protected]> writes:


> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 493728b..7d5a932 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -725,6 +725,19 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>
> if (PageWriteback(page)) {
> /*
> + * If reclaim is encountering an excessive number of
> + * pages under writeback and this page is both under
> + * writeback and PageReclaim then it indicates that
> + * pages are being queued for IO but are being
> + * recycled through the LRU before the IO can complete.
> + * is useless CPU work so wait on the IO to complete.
> + */
> + if (current_is_kswapd() &&
> + zone_is_reclaim_writeback(zone)) {
> + wait_on_page_writeback(page);
> + zone_clear_flag(zone, ZONE_WRITEBACK);
> +
> + /*

Something is wrong with the indentation here. Comment should be indented
or is the code in the wrong block?

It's not fully clair to me how you decide here that the writeback
situation has cleared. There must be some kind of threshold for it,
but I don't see it. Or do you clear already when the first page
finished? That would seem too early.

BTW longer term the code would probably be a lot clearer with a
real explicit state machine instead of all these custom state bits.

-Andi
--
[email protected] -- Speaking for myself only

2013-03-17 14:53:15

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 08/10] mm: vmscan: Have kswapd shrink slab only once per priority

Mel Gorman <[email protected]> writes:

> If kswaps fails to make progress but continues to shrink slab then it'll
> either discard all of slab or consume CPU uselessly scanning shrinkers.
> This patch causes kswapd to only call the shrinkers once per priority.

Great. This was too aggressive for a long time. Probably still needs
more intelligence in the shrinkers itself to be really good though
(e.g. the old defrag heuristics in dcache)

-Andi
--
[email protected] -- Speaking for myself only

2013-03-17 14:55:59

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 10/10] mm: vmscan: Move logic from balance_pgdat() to kswapd_shrink_zone()

Mel Gorman <[email protected]> writes:

> +
> + /*
> + * We put equal pressure on every zone, unless one zone has way too
> + * many pages free already. The "too many pages" is defined as the
> + * high wmark plus a "gap" where the gap is either the low
> + * watermark or 1% of the zone, whichever is smaller.
> + */
> + balance_gap = min(low_wmark_pages(zone),
> + (zone->managed_pages + KSWAPD_ZONE_BALANCE_GAP_RATIO-1) /
> + KSWAPD_ZONE_BALANCE_GAP_RATIO);

Don't like those hard coded tunables. 1% of a 512GB node can be still
quite a lot. Shouldn't the low watermark be enough?

-Andi
--
[email protected] -- Speaking for myself only

2013-03-17 15:08:13

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 02/10] mm: vmscan: Obey proportional scanning requirements for kswapd

On Sun, Mar 17, 2013 at 07:39:37AM -0700, Andi Kleen wrote:
> Mel Gorman <[email protected]> writes:
> > +
> > + /*
> > + * For direct reclaim, reclaim the number of pages requested. Less
> > + * care is taken to ensure that scanning for each LRU is properly
> > + * proportional. This is unfortunate and is improper aging but
> > + * minimises the amount of time a process is stalled.
> > + */
> > + if (!current_is_kswapd()) {
> > + if (nr_reclaimed >= nr_to_reclaim) {
> > + for_each_evictable_lru(l)
>
> Don't we need some NUMA awareness here?
> Similar below.
>

Of what sort? In this context we are usually dealing with a zone and in
the case of kswapd it is only ever dealing with a single node.

--
Mel Gorman
SUSE Labs

2013-03-17 15:09:40

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 03/10] mm: vmscan: Flatten kswapd priority loop

On Sun, Mar 17, 2013 at 07:36:22AM -0700, Andi Kleen wrote:
> Mel Gorman <[email protected]> writes:
> >
> > To avoid infinite looping for high-order allocation requests kswapd will
> > not reclaim for high-order allocations when it has reclaimed at least
> > twice the number of pages as the allocation request.
>
> Will this make higher order allocations fail earlier? Or does compaction
> still kick in early enough.
>

Compaction should still kick in early enough. The impact it might have
is that direct reclaim/compaction may be used more than it was in the
past.

--
Mel Gorman
SUSE Labs

2013-03-17 15:12:00

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 06/10] mm: vmscan: Have kswapd writeback pages based on dirty pages encountered, not priority

On Sun, Mar 17, 2013 at 07:42:39AM -0700, Andi Kleen wrote:
> Mel Gorman <[email protected]> writes:
>
> > @@ -495,6 +495,9 @@ typedef enum {
> > ZONE_CONGESTED, /* zone has many dirty pages backed by
> > * a congested BDI
> > */
> > + ZONE_DIRTY, /* reclaim scanning has recently found
> > + * many dirty file pages
> > + */
>
> Needs a better name. ZONE_DIRTY_CONGESTED ?
>

That might be confusing. The underlying BDI is not necessarily
congested. I accept your point though and will try thinking of a better
name.

> > + * currently being written then flag that kswapd should start
> > + * writing back pages.
> > + */
> > + if (global_reclaim(sc) && nr_dirty &&
> > + nr_dirty >= (nr_taken >> (DEF_PRIORITY - sc->priority)))
> > + zone_set_flag(zone, ZONE_DIRTY);
> > +
> > trace_mm_vmscan_lru_shrink_inactive(zone->zone_pgdat->node_id,
>
> I suppose you want to trace the dirty case here too.
>

I guess it wouldn't hurt to have a new tracepoint for when the flag gets
set. A vmstat might be helpful as well.

--
Mel Gorman
SUSE Labs

2013-03-17 15:19:23

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 07/10] mm: vmscan: Block kswapd if it is encountering pages under writeback

On Sun, Mar 17, 2013 at 07:49:50AM -0700, Andi Kleen wrote:
> Mel Gorman <[email protected]> writes:
>
>
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 493728b..7d5a932 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -725,6 +725,19 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >
> > if (PageWriteback(page)) {
> > /*
> > + * If reclaim is encountering an excessive number of
> > + * pages under writeback and this page is both under
> > + * writeback and PageReclaim then it indicates that
> > + * pages are being queued for IO but are being
> > + * recycled through the LRU before the IO can complete.
> > + * is useless CPU work so wait on the IO to complete.
> > + */
> > + if (current_is_kswapd() &&
> > + zone_is_reclaim_writeback(zone)) {
> > + wait_on_page_writeback(page);
> > + zone_clear_flag(zone, ZONE_WRITEBACK);
> > +
> > + /*
>
> Something is wrong with the indentation here. Comment should be indented
> or is the code in the wrong block?
>

I'll rearrange the comments.

> It's not fully clair to me how you decide here that the writeback
> situation has cleared. There must be some kind of threshold for it,
> but I don't see it. Or do you clear already when the first page
> finished? That would seem too early.
>

I deliberately cleared it when the first page finished. If kswapd blocks
waiting for IO of that page to complete then it cannot be certain that
there are still too many pages at the end of the LRU. By clearing the
flag, it's forced to recheck instead of potentially blocking on the next
page unnecessarily.

What I did get wrong is that I meant to check PageReclaim here as
described in the comment. It must have gotten lost during a rebase.

> BTW longer term the code would probably be a lot clearer with a
> real explicit state machine instead of all these custom state bits.
>

I would expect so even though it'd be a major overhawl.

--
Mel Gorman
SUSE Labs

2013-03-17 15:26:07

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 10/10] mm: vmscan: Move logic from balance_pgdat() to kswapd_shrink_zone()

On Sun, Mar 17, 2013 at 07:55:54AM -0700, Andi Kleen wrote:
> Mel Gorman <[email protected]> writes:
>
> > +
> > + /*
> > + * We put equal pressure on every zone, unless one zone has way too
> > + * many pages free already. The "too many pages" is defined as the
> > + * high wmark plus a "gap" where the gap is either the low
> > + * watermark or 1% of the zone, whichever is smaller.
> > + */
> > + balance_gap = min(low_wmark_pages(zone),
> > + (zone->managed_pages + KSWAPD_ZONE_BALANCE_GAP_RATIO-1) /
> > + KSWAPD_ZONE_BALANCE_GAP_RATIO);
>
> Don't like those hard coded tunables. 1% of a 512GB node can be still
> quite a lot. Shouldn't the low watermark be enough?
>

1% of 512G would be lot but in that case, it'll use the low watermark as
the balance gap.

--
Mel Gorman
SUSE Labs

2013-03-17 15:40:18

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 07/10] mm: vmscan: Block kswapd if it is encountering pages under writeback

> > BTW longer term the code would probably be a lot clearer with a
> > real explicit state machine instead of all these custom state bits.
> >
>
> I would expect so even though it'd be a major overhawl.

A lot of these VM paths need overhaul because they usually don't
do enough page batching to perform really well on larger systems.

-Andi

--
[email protected] -- Speaking for myself only.

2013-03-18 11:35:08

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCH 04/10] mm: vmscan: Decide whether to compact the pgdat based on reclaim progress

On Sun, Mar 17, 2013 at 9:04 PM, Mel Gorman <[email protected]> wrote:
> In the past, kswapd makes a decision on whether to compact memory after the
> pgdat was considered balanced. This more or less worked but it is late to
> make such a decision and does not fit well now that kswapd makes a decision
> whether to exit the zone scanning loop depending on reclaim progress.
>
> This patch will compact a pgdat if at least the requested number of pages
> were reclaimed from unbalanced zones for a given priority. If any zone is
> currently balanced, kswapd will not call compaction as it is expected the
> necessary pages are already available.
>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> mm/vmscan.c | 52 +++++++++++++++++++++-------------------------------
> 1 file changed, 21 insertions(+), 31 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 279d0c2..7513bd1 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2694,8 +2694,11 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
>
> do {
> unsigned long lru_pages = 0;
> + unsigned long nr_to_reclaim = 0;
> unsigned long nr_reclaimed = sc.nr_reclaimed;
> + unsigned long this_reclaimed;
> bool raise_priority = true;
> + bool pgdat_needs_compaction = true;

To show that compaction is needed iff non order-o reclaim,
bool do_compaction = !!order;

>
> /*
> * Scan in the highmem->dma direction for the highest
> @@ -2743,7 +2746,17 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
> for (i = 0; i <= end_zone; i++) {
> struct zone *zone = pgdat->node_zones + i;
>
> + if (!populated_zone(zone))
> + continue;
> +
> lru_pages += zone_reclaimable_pages(zone);
> +
> + /* Check if the memory needs to be defragmented */
Enrich the comment with, say,
/*If any zone is
currently balanced, kswapd will not call compaction as it is expected the
necessary pages are already available.*/
please since a big one is deleted below.

> + if (order && pgdat_needs_compaction &&
> + zone_watermark_ok(zone, order,
> + low_wmark_pages(zone),
> + *classzone_idx, 0))
> + pgdat_needs_compaction = false;
> }
>
> /*
> @@ -2814,6 +2827,8 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
> */
> if (kswapd_shrink_zone(zone, &sc, lru_pages))
> raise_priority = false;
> +
> + nr_to_reclaim += sc.nr_to_reclaim;
> }
>
> /*
> @@ -2864,46 +2879,21 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
> if (try_to_freeze() || kthread_should_stop())
> break;
>
> - /* If no reclaim progress then increase scanning priority */
> - if (sc.nr_reclaimed - nr_reclaimed == 0)
> - raise_priority = true;
> + /* Compact if necessary and kswapd is reclaiming efficiently */
> + this_reclaimed = sc.nr_reclaimed - nr_reclaimed;
> + if (order && pgdat_needs_compaction &&
> + this_reclaimed > nr_to_reclaim)
> + compact_pgdat(pgdat, order);
>
> /*
> * Raise priority if scanning rate is too low or there was no
> * progress in reclaiming pages
> */
> - if (raise_priority || sc.nr_reclaimed - nr_reclaimed == 0)
> + if (raise_priority || !this_reclaimed)
> sc.priority--;
> } while (sc.priority >= 0 &&
> !pgdat_balanced(pgdat, order, *classzone_idx));
>
> - /*
> - * If kswapd was reclaiming at a higher order, it has the option of
> - * sleeping without all zones being balanced. Before it does, it must
> - * ensure that the watermarks for order-0 on *all* zones are met and
> - * that the congestion flags are cleared. The congestion flag must
> - * be cleared as kswapd is the only mechanism that clears the flag
> - * and it is potentially going to sleep here.
> - */
> - if (order) {
> - int zones_need_compaction = 1;
> -
> - for (i = 0; i <= end_zone; i++) {
> - struct zone *zone = pgdat->node_zones + i;
> -
> - if (!populated_zone(zone))
> - continue;
> -
> - /* Check if the memory needs to be defragmented. */
> - if (zone_watermark_ok(zone, order,
> - low_wmark_pages(zone), *classzone_idx, 0))
> - zones_need_compaction = 0;
> - }
> -
> - if (zones_need_compaction)
> - compact_pgdat(pgdat, order);
> - }
> -
> out:
> /*
> * Return the order we were reclaiming at so prepare_kswapd_sleep()
> --
> 1.8.1.4

2013-03-18 11:37:52

by Simon Jeons

[permalink] [raw]
Subject: Re: [PATCH 07/10] mm: vmscan: Block kswapd if it is encountering pages under writeback

On 03/17/2013 09:04 PM, Mel Gorman wrote:
> Historically, kswapd used to congestion_wait() at higher priorities if it
> was not making forward progress. This made no sense as the failure to make
> progress could be completely independent of IO. It was later replaced by
> wait_iff_congested() and removed entirely by commit 258401a6 (mm: don't
> wait on congested zones in balance_pgdat()) as it was duplicating logic
> in shrink_inactive_list().
>
> This is problematic. If kswapd encounters many pages under writeback and
> it continues to scan until it reaches the high watermark then it will
> quickly skip over the pages under writeback and reclaim clean young
> pages or push applications out to swap.
>
> The use of wait_iff_congested() is not suited to kswapd as it will only
> stall if the underlying BDI is really congested or a direct reclaimer was
> unable to write to the underlying BDI. kswapd bypasses the BDI congestion
> as it sets PF_SWAPWRITE but even if this was taken into account then it

Where will check this flag?

> would cause direct reclaimers to stall on writeback which is not desirable.
>
> This patch sets a ZONE_WRITEBACK flag if direct reclaim or kswapd is
> encountering too many pages under writeback. If this flag is set and
> kswapd encounters a PageReclaim page under writeback then it'll assume
> that the LRU lists are being recycled too quickly before IO can complete
> and block waiting for some IO to complete.
>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> include/linux/mmzone.h | 8 ++++++++
> mm/vmscan.c | 29 ++++++++++++++++++++++++-----
> 2 files changed, 32 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index edd6b98..c758fb7 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -498,6 +498,9 @@ typedef enum {
> ZONE_DIRTY, /* reclaim scanning has recently found
> * many dirty file pages
> */
> + ZONE_WRITEBACK, /* reclaim scanning has recently found
> + * many pages under writeback
> + */
> } zone_flags_t;
>
> static inline void zone_set_flag(struct zone *zone, zone_flags_t flag)
> @@ -525,6 +528,11 @@ static inline int zone_is_reclaim_dirty(const struct zone *zone)
> return test_bit(ZONE_DIRTY, &zone->flags);
> }
>
> +static inline int zone_is_reclaim_writeback(const struct zone *zone)
> +{
> + return test_bit(ZONE_WRITEBACK, &zone->flags);
> +}
> +
> static inline int zone_is_reclaim_locked(const struct zone *zone)
> {
> return test_bit(ZONE_RECLAIM_LOCKED, &zone->flags);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 493728b..7d5a932 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -725,6 +725,19 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>
> if (PageWriteback(page)) {
> /*
> + * If reclaim is encountering an excessive number of
> + * pages under writeback and this page is both under
> + * writeback and PageReclaim then it indicates that
> + * pages are being queued for IO but are being
> + * recycled through the LRU before the IO can complete.
> + * is useless CPU work so wait on the IO to complete.
> + */
> + if (current_is_kswapd() &&
> + zone_is_reclaim_writeback(zone)) {
> + wait_on_page_writeback(page);
> + zone_clear_flag(zone, ZONE_WRITEBACK);
> +
> + /*
> * memcg doesn't have any dirty pages throttling so we
> * could easily OOM just because too many pages are in
> * writeback and there is nothing else to reclaim.
> @@ -741,7 +754,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> * grab_cache_page_write_begin(,,AOP_FLAG_NOFS), so
> * testing may_enter_fs here is liable to OOM on them.
> */
> - if (global_reclaim(sc) ||
> + } else if (global_reclaim(sc) ||
> !PageReclaim(page) || !(sc->gfp_mask & __GFP_IO)) {
> /*
> * This is slightly racy - end_page_writeback()
> @@ -756,9 +769,11 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> */
> SetPageReclaim(page);
> nr_writeback++;
> +
> goto keep_locked;
> + } else {
> + wait_on_page_writeback(page);
> }
> - wait_on_page_writeback(page);
> }
>
> if (!force_reclaim)
> @@ -1373,8 +1388,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
> * isolated page is PageWriteback
> */
> if (nr_writeback && nr_writeback >=
> - (nr_taken >> (DEF_PRIORITY - sc->priority)))
> + (nr_taken >> (DEF_PRIORITY - sc->priority))) {
> wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
> + zone_set_flag(zone, ZONE_WRITEBACK);
> + }
>
> /*
> * Similarly, if many dirty pages are encountered that are not
> @@ -2639,8 +2656,8 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
> * kswapd shrinks the zone by the number of pages required to reach
> * the high watermark.
> *
> - * Returns true if kswapd scanned at least the requested number of
> - * pages to reclaim.
> + * Returns true if kswapd scanned at least the requested number of pages to
> + * reclaim or if the lack of process was due to pages under writeback.
> */
> static bool kswapd_shrink_zone(struct zone *zone,
> struct scan_control *sc,
> @@ -2663,6 +2680,8 @@ static bool kswapd_shrink_zone(struct zone *zone,
> if (nr_slab == 0 && !zone_reclaimable(zone))
> zone->all_unreclaimable = 1;
>
> + zone_clear_flag(zone, ZONE_WRITEBACK);
> +
> return sc->nr_scanned >= sc->nr_to_reclaim;
> }
>

2013-03-18 23:59:00

by Simon Jeons

[permalink] [raw]
Subject: Re: [PATCH 03/10] mm: vmscan: Flatten kswapd priority loop

Hi Mel,
On 03/17/2013 09:04 PM, Mel Gorman wrote:
> kswapd stops raising the scanning priority when at least SWAP_CLUSTER_MAX
> pages have been reclaimed or the pgdat is considered balanced. It then
> rechecks if it needs to restart at DEF_PRIORITY and whether high-order
> reclaim needs to be reset. This is not wrong per-se but it is confusing
> to follow and forcing kswapd to stay at DEF_PRIORITY may require several
> restarts before it has scanned enough pages to meet the high watermark even
> at 100% efficiency. This patch irons out the logic a bit by controlling
> when priority is raised and removing the "goto loop_again".
>
> This patch has kswapd raise the scanning priority until it is scanning
> enough pages that it could meet the high watermark in one shrink of the
> LRU lists if it is able to reclaim at 100% efficiency. It will not raise
> the scanning prioirty higher unless it is failing to reclaim any pages.
>
> To avoid infinite looping for high-order allocation requests kswapd will
> not reclaim for high-order allocations when it has reclaimed at least
> twice the number of pages as the allocation request.
>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> mm/vmscan.c | 86 ++++++++++++++++++++++++++++++-------------------------------
> 1 file changed, 42 insertions(+), 44 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 182ff15..279d0c2 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2625,8 +2625,11 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
> /*
> * kswapd shrinks the zone by the number of pages required to reach
> * the high watermark.
> + *
> + * Returns true if kswapd scanned at least the requested number of
> + * pages to reclaim.
> */
> -static void kswapd_shrink_zone(struct zone *zone,
> +static bool kswapd_shrink_zone(struct zone *zone,
> struct scan_control *sc,
> unsigned long lru_pages)
> {
> @@ -2646,6 +2649,8 @@ static void kswapd_shrink_zone(struct zone *zone,
>
> if (nr_slab == 0 && !zone_reclaimable(zone))
> zone->all_unreclaimable = 1;
> +
> + return sc->nr_scanned >= sc->nr_to_reclaim;
> }
>
> /*
> @@ -2672,26 +2677,25 @@ static void kswapd_shrink_zone(struct zone *zone,
> static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
> int *classzone_idx)
> {
> - bool pgdat_is_balanced = false;
> int i;
> int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */
> unsigned long nr_soft_reclaimed;
> unsigned long nr_soft_scanned;
> struct scan_control sc = {
> .gfp_mask = GFP_KERNEL,
> + .priority = DEF_PRIORITY,
> .may_unmap = 1,
> .may_swap = 1,
> + .may_writepage = !laptop_mode,

What's the influence of this change? If there are large numbers of
anonymous pages and very little file pages, anonymous pages will not be
swapped out when priorty >= DEF_PRIORITY-2. Just no sense scan.
> .order = order,
> .target_mem_cgroup = NULL,
> };
> -loop_again:
> - sc.priority = DEF_PRIORITY;
> - sc.nr_reclaimed = 0;
> - sc.may_writepage = !laptop_mode;
> count_vm_event(PAGEOUTRUN);
>
> do {
> unsigned long lru_pages = 0;
> + unsigned long nr_reclaimed = sc.nr_reclaimed;
> + bool raise_priority = true;
>
> /*
> * Scan in the highmem->dma direction for the highest
> @@ -2733,10 +2737,8 @@ loop_again:
> }
> }
>
> - if (i < 0) {
> - pgdat_is_balanced = true;
> + if (i < 0)
> goto out;
> - }
>
> for (i = 0; i <= end_zone; i++) {
> struct zone *zone = pgdat->node_zones + i;
> @@ -2803,8 +2805,16 @@ loop_again:
>
> if ((buffer_heads_over_limit && is_highmem_idx(i)) ||
> !zone_balanced(zone, testorder,
> - balance_gap, end_zone))
> - kswapd_shrink_zone(zone, &sc, lru_pages);
> + balance_gap, end_zone)) {
> + /*
> + * There should be no need to raise the
> + * scanning priority if enough pages are
> + * already being scanned that that high
> + * watermark would be met at 100% efficiency.
> + */
> + if (kswapd_shrink_zone(zone, &sc, lru_pages))
> + raise_priority = false;
> + }
>
> /*
> * If we're getting trouble reclaiming, start doing
> @@ -2839,46 +2849,33 @@ loop_again:
> pfmemalloc_watermark_ok(pgdat))
> wake_up(&pgdat->pfmemalloc_wait);
>
> - if (pgdat_balanced(pgdat, order, *classzone_idx)) {
> - pgdat_is_balanced = true;
> - break; /* kswapd: all done */
> - }
> -
> /*
> - * We do this so kswapd doesn't build up large priorities for
> - * example when it is freeing in parallel with allocators. It
> - * matches the direct reclaim path behaviour in terms of impact
> - * on zone->*_priority.
> + * Fragmentation may mean that the system cannot be rebalanced
> + * for high-order allocations in all zones. If twice the
> + * allocation size has been reclaimed and the zones are still
> + * not balanced then recheck the watermarks at order-0 to
> + * prevent kswapd reclaiming excessively. Assume that a
> + * process requested a high-order can direct reclaim/compact.
> */
> - if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX)
> - break;
> - } while (--sc.priority >= 0);
> + if (order && sc.nr_reclaimed >= 2UL << order)
> + order = sc.order = 0;
>
> -out:
> - if (!pgdat_is_balanced) {
> - cond_resched();
> + /* Check if kswapd should be suspending */
> + if (try_to_freeze() || kthread_should_stop())
> + break;
>
> - try_to_freeze();
> + /* If no reclaim progress then increase scanning priority */
> + if (sc.nr_reclaimed - nr_reclaimed == 0)
> + raise_priority = true;
>
> /*
> - * Fragmentation may mean that the system cannot be
> - * rebalanced for high-order allocations in all zones.
> - * At this point, if nr_reclaimed < SWAP_CLUSTER_MAX,
> - * it means the zones have been fully scanned and are still
> - * not balanced. For high-order allocations, there is
> - * little point trying all over again as kswapd may
> - * infinite loop.
> - *
> - * Instead, recheck all watermarks at order-0 as they
> - * are the most important. If watermarks are ok, kswapd will go
> - * back to sleep. High-order users can still perform direct
> - * reclaim if they wish.
> + * Raise priority if scanning rate is too low or there was no
> + * progress in reclaiming pages
> */
> - if (sc.nr_reclaimed < SWAP_CLUSTER_MAX)
> - order = sc.order = 0;
> -
> - goto loop_again;
> - }
> + if (raise_priority || sc.nr_reclaimed - nr_reclaimed == 0)
> + sc.priority--;
> + } while (sc.priority >= 0 &&
> + !pgdat_balanced(pgdat, order, *classzone_idx));
>
> /*
> * If kswapd was reclaiming at a higher order, it has the option of
> @@ -2907,6 +2904,7 @@ out:
> compact_pgdat(pgdat, order);
> }
>
> +out:
> /*
> * Return the order we were reclaiming at so prepare_kswapd_sleep()
> * makes a decision on the order we were last reclaiming at. However,

2013-03-18 23:59:50

by Simon Jeons

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority

Hi Mel,
On 03/17/2013 09:04 PM, Mel Gorman wrote:
> The number of pages kswapd can reclaim is bound by the number of pages it
> scans which is related to the size of the zone and the scanning priority. In
> many cases the priority remains low because it's reset every SWAP_CLUSTER_MAX
> reclaimed pages but in the event kswapd scans a large number of pages it
> cannot reclaim, it will raise the priority and potentially discard a large
> percentage of the zone as sc->nr_to_reclaim is ULONG_MAX. The user-visible
> effect is a reclaim "spike" where a large percentage of memory is suddenly
> freed. It would be bad enough if this was just unused memory but because

Since there is nr_reclaimed >= nr_to_reclaim check if priority is large
than DEF_PRIORITY in shrink_lruvec, how can a large percentage of memory
is suddenly freed happen?

> of how anon/file pages are balanced it is possible that applications get
> pushed to swap unnecessarily.
>
> This patch limits the number of pages kswapd will reclaim to the high
> watermark. Reclaim will will overshoot due to it not being a hard limit as
> shrink_lruvec() will ignore the sc.nr_to_reclaim at DEF_PRIORITY but it
> prevents kswapd reclaiming the world at higher priorities. The number of
> pages it reclaims is not adjusted for high-order allocations as kswapd will
> reclaim excessively if it is to balance zones for high-order allocations.
>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> mm/vmscan.c | 53 +++++++++++++++++++++++++++++------------------------
> 1 file changed, 29 insertions(+), 24 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 88c5fed..4835a7a 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2593,6 +2593,32 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
> }
>
> /*
> + * kswapd shrinks the zone by the number of pages required to reach
> + * the high watermark.
> + */
> +static void kswapd_shrink_zone(struct zone *zone,
> + struct scan_control *sc,
> + unsigned long lru_pages)
> +{
> + unsigned long nr_slab;
> + struct reclaim_state *reclaim_state = current->reclaim_state;
> + struct shrink_control shrink = {
> + .gfp_mask = sc->gfp_mask,
> + };
> +
> + /* Reclaim above the high watermark. */
> + sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));
> + shrink_zone(zone, sc);
> +
> + reclaim_state->reclaimed_slab = 0;
> + nr_slab = shrink_slab(&shrink, sc->nr_scanned, lru_pages);
> + sc->nr_reclaimed += reclaim_state->reclaimed_slab;
> +
> + if (nr_slab == 0 && !zone_reclaimable(zone))
> + zone->all_unreclaimable = 1;
> +}
> +
> +/*
> * For kswapd, balance_pgdat() will work across all this node's zones until
> * they are all at high_wmark_pages(zone).
> *
> @@ -2619,27 +2645,16 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
> bool pgdat_is_balanced = false;
> int i;
> int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */
> - unsigned long total_scanned;
> - struct reclaim_state *reclaim_state = current->reclaim_state;
> unsigned long nr_soft_reclaimed;
> unsigned long nr_soft_scanned;
> struct scan_control sc = {
> .gfp_mask = GFP_KERNEL,
> .may_unmap = 1,
> .may_swap = 1,
> - /*
> - * kswapd doesn't want to be bailed out while reclaim. because
> - * we want to put equal scanning pressure on each zone.
> - */
> - .nr_to_reclaim = ULONG_MAX,
> .order = order,
> .target_mem_cgroup = NULL,
> };
> - struct shrink_control shrink = {
> - .gfp_mask = sc.gfp_mask,
> - };
> loop_again:
> - total_scanned = 0;
> sc.priority = DEF_PRIORITY;
> sc.nr_reclaimed = 0;
> sc.may_writepage = !laptop_mode;
> @@ -2710,7 +2725,7 @@ loop_again:
> */
> for (i = 0; i <= end_zone; i++) {
> struct zone *zone = pgdat->node_zones + i;
> - int nr_slab, testorder;
> + int testorder;
> unsigned long balance_gap;
>
> if (!populated_zone(zone))
> @@ -2730,7 +2745,6 @@ loop_again:
> order, sc.gfp_mask,
> &nr_soft_scanned);
> sc.nr_reclaimed += nr_soft_reclaimed;
> - total_scanned += nr_soft_scanned;
>
> /*
> * We put equal pressure on every zone, unless
> @@ -2759,17 +2773,8 @@ loop_again:
>
> if ((buffer_heads_over_limit && is_highmem_idx(i)) ||
> !zone_balanced(zone, testorder,
> - balance_gap, end_zone)) {
> - shrink_zone(zone, &sc);
> -
> - reclaim_state->reclaimed_slab = 0;
> - nr_slab = shrink_slab(&shrink, sc.nr_scanned, lru_pages);
> - sc.nr_reclaimed += reclaim_state->reclaimed_slab;
> - total_scanned += sc.nr_scanned;
> -
> - if (nr_slab == 0 && !zone_reclaimable(zone))
> - zone->all_unreclaimable = 1;
> - }
> + balance_gap, end_zone))
> + kswapd_shrink_zone(zone, &sc, lru_pages);
>
> /*
> * If we're getting trouble reclaiming, start doing

2013-03-19 03:08:32

by Simon Jeons

[permalink] [raw]
Subject: Re: [PATCH 03/10] mm: vmscan: Flatten kswapd priority loop

Hi Mel,
On 03/17/2013 09:04 PM, Mel Gorman wrote:
> kswapd stops raising the scanning priority when at least SWAP_CLUSTER_MAX
> pages have been reclaimed or the pgdat is considered balanced. It then
> rechecks if it needs to restart at DEF_PRIORITY and whether high-order
> reclaim needs to be reset. This is not wrong per-se but it is confusing

per-se is short for what?

> to follow and forcing kswapd to stay at DEF_PRIORITY may require several
> restarts before it has scanned enough pages to meet the high watermark even
> at 100% efficiency. This patch irons out the logic a bit by controlling
> when priority is raised and removing the "goto loop_again".
>
> This patch has kswapd raise the scanning priority until it is scanningmm: vmscan: Flatten kswapd priority loop
> enough pages that it could meet the high watermark in one shrink of the
> LRU lists if it is able to reclaim at 100% efficiency. It will not raise

Which kind of reclaim can be treated as 100% efficiency?

> the scanning prioirty higher unless it is failing to reclaim any pages.
>
> To avoid infinite looping for high-order allocation requests kswapd will
> not reclaim for high-order allocations when it has reclaimed at least
> twice the number of pages as the allocation request.
>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> mm/vmscan.c | 86 ++++++++++++++++++++++++++++++-------------------------------
> 1 file changed, 42 insertions(+), 44 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 182ff15..279d0c2 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2625,8 +2625,11 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
> /*
> * kswapd shrinks the zone by the number of pages required to reach
> * the high watermark.
> + *
> + * Returns true if kswapd scanned at least the requested number of
> + * pages to reclaim.
> */
> -static void kswapd_shrink_zone(struct zone *zone,
> +static bool kswapd_shrink_zone(struct zone *zone,
> struct scan_control *sc,
> unsigned long lru_pages)
> {
> @@ -2646,6 +2649,8 @@ static void kswapd_shrink_zone(struct zone *zone,
>
> if (nr_slab == 0 && !zone_reclaimable(zone))
> zone->all_unreclaimable = 1;
> +
> + return sc->nr_scanned >= sc->nr_to_reclaim;
> }
>
> /*
> @@ -2672,26 +2677,25 @@ static void kswapd_shrink_zone(struct zone *zone,
> static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
> int *classzone_idx)
> {
> - bool pgdat_is_balanced = false;
> int i;
> int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */
> unsigned long nr_soft_reclaimed;
> unsigned long nr_soft_scanned;
> struct scan_control sc = {
> .gfp_mask = GFP_KERNEL,
> + .priority = DEF_PRIORITY,
> .may_unmap = 1,
> .may_swap = 1,
> + .may_writepage = !laptop_mode,
> .order = order,
> .target_mem_cgroup = NULL,
> };
> -loop_again:
> - sc.priority = DEF_PRIORITY;
> - sc.nr_reclaimed = 0;
> - sc.may_writepage = !laptop_mode;
> count_vm_event(PAGEOUTRUN);
>
> do {
> unsigned long lru_pages = 0;
> + unsigned long nr_reclaimed = sc.nr_reclaimed;
> + bool raise_priority = true;
>
> /*
> * Scan in the highmem->dma direction for the highest
> @@ -2733,10 +2737,8 @@ loop_again:
> }
> }
>
> - if (i < 0) {
> - pgdat_is_balanced = true;
> + if (i < 0)
> goto out;
> - }
>
> for (i = 0; i <= end_zone; i++) {
> struct zone *zone = pgdat->node_zones + i;
> @@ -2803,8 +2805,16 @@ loop_again:
>
> if ((buffer_heads_over_limit && is_highmem_idx(i)) ||
> !zone_balanced(zone, testorder,
> - balance_gap, end_zone))
> - kswapd_shrink_zone(zone, &sc, lru_pages);
> + balance_gap, end_zone)) {
> + /*
> + * There should be no need to raise the
> + * scanning priority if enough pages are
> + * already being scanned that that high
> + * watermark would be met at 100% efficiency.
> + */
> + if (kswapd_shrink_zone(zone, &sc, lru_pages))
> + raise_priority = false;
> + }
>
> /*
> * If we're getting trouble reclaiming, start doing
> @@ -2839,46 +2849,33 @@ loop_again:
> pfmemalloc_watermark_ok(pgdat))
> wake_up(&pgdat->pfmemalloc_wait);
>
> - if (pgdat_balanced(pgdat, order, *classzone_idx)) {
> - pgdat_is_balanced = true;
> - break; /* kswapd: all done */
> - }
> -
> /*
> - * We do this so kswapd doesn't build up large priorities for
> - * example when it is freeing in parallel with allocators. It
> - * matches the direct reclaim path behaviour in terms of impact
> - * on zone->*_priority.
> + * Fragmentation may mean that the system cannot be rebalanced
> + * for high-order allocations in all zones. If twice the
> + * allocation size has been reclaimed and the zones are still
> + * not balanced then recheck the watermarks at order-0 to
> + * prevent kswapd reclaiming excessively. Assume that a
> + * process requested a high-order can direct reclaim/compact.
> */
> - if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX)
> - break;
> - } while (--sc.priority >= 0);
> + if (order && sc.nr_reclaimed >= 2UL << order)
> + order = sc.order = 0;

If order == 0 is meet, should we do defrag for it?

>
> -out:
> - if (!pgdat_is_balanced) {
> - cond_resched();
> + /* Check if kswapd should be suspending */
> + if (try_to_freeze() || kthread_should_stop())
> + break;
>
> - try_to_freeze();
> + /* If no reclaim progress then increase scanning priority */
> + if (sc.nr_reclaimed - nr_reclaimed == 0)
> + raise_priority = true;
>
> /*
> - * Fragmentation may mean that the system cannot be
> - * rebalanced for high-order allocations in all zones.
> - * At this point, if nr_reclaimed < SWAP_CLUSTER_MAX,
> - * it means the zones have been fully scanned and are still
> - * not balanced. For high-order allocations, there is
> - * little point trying all over again as kswapd may
> - * infinite loop.
> - *
> - * Instead, recheck all watermarks at order-0 as they
> - * are the most important. If watermarks are ok, kswapd will go
> - * back to sleep. High-order users can still perform direct
> - * reclaim if they wish.
> + * Raise priority if scanning rate is too low or there was no
> + * progress in reclaiming pages
> */
> - if (sc.nr_reclaimed < SWAP_CLUSTER_MAX)
> - order = sc.order = 0;
> -
> - goto loop_again;
> - }
> + if (raise_priority || sc.nr_reclaimed - nr_reclaimed == 0)
> + sc.priority--;
> + } while (sc.priority >= 0 &&
> + !pgdat_balanced(pgdat, order, *classzone_idx));
>
> /*
> * If kswapd was reclaiming at a higher order, it has the option of
> @@ -2907,6 +2904,7 @@ out:
> compact_pgdat(pgdat, order);
> }
>
> +out:
> /*
> * Return the order we were reclaiming at so prepare_kswapd_sleep()
> * makes a decision on the order we were last reclaiming at. However,

2013-03-19 08:23:40

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 03/10] mm: vmscan: Flatten kswapd priority loop

On Tue 19-03-13 11:08:23, Simon Jeons wrote:
> Hi Mel,
> On 03/17/2013 09:04 PM, Mel Gorman wrote:
> >kswapd stops raising the scanning priority when at least SWAP_CLUSTER_MAX
> >pages have been reclaimed or the pgdat is considered balanced. It then
> >rechecks if it needs to restart at DEF_PRIORITY and whether high-order
> >reclaim needs to be reset. This is not wrong per-se but it is confusing
>
> per-se is short for what?
>
> >to follow and forcing kswapd to stay at DEF_PRIORITY may require several
> >restarts before it has scanned enough pages to meet the high watermark even
> >at 100% efficiency. This patch irons out the logic a bit by controlling
> >when priority is raised and removing the "goto loop_again".
> >
> >This patch has kswapd raise the scanning priority until it is scanningmm: vmscan: Flatten kswapd priority loop
> >enough pages that it could meet the high watermark in one shrink of the
> >LRU lists if it is able to reclaim at 100% efficiency. It will not raise
>
> Which kind of reclaim can be treated as 100% efficiency?

nr_scanned == nr_reclaimed

> >the scanning prioirty higher unless it is failing to reclaim any pages.
> >
> >To avoid infinite looping for high-order allocation requests kswapd will
> >not reclaim for high-order allocations when it has reclaimed at least
> >twice the number of pages as the allocation request.
> >
> >Signed-off-by: Mel Gorman <[email protected]>
[...]
--
Michal Hocko
SUSE Labs

2013-03-19 09:55:22

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority

On Tue, Mar 19, 2013 at 07:53:16AM +0800, Simon Jeons wrote:
> Hi Mel,
> On 03/17/2013 09:04 PM, Mel Gorman wrote:
> >The number of pages kswapd can reclaim is bound by the number of pages it
> >scans which is related to the size of the zone and the scanning priority. In
> >many cases the priority remains low because it's reset every SWAP_CLUSTER_MAX
> >reclaimed pages but in the event kswapd scans a large number of pages it
> >cannot reclaim, it will raise the priority and potentially discard a large
> >percentage of the zone as sc->nr_to_reclaim is ULONG_MAX. The user-visible
> >effect is a reclaim "spike" where a large percentage of memory is suddenly
> >freed. It would be bad enough if this was just unused memory but because
>
> Since there is nr_reclaimed >= nr_to_reclaim check if priority is
> large than DEF_PRIORITY in shrink_lruvec, how can a large percentage
> of memory is suddenly freed happen?
>

Because of the priority checks made in get_scan_count(). Patch 5 has
more detail on why this happens.

--
Mel Gorman
SUSE Labs

2013-03-19 10:12:24

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 03/10] mm: vmscan: Flatten kswapd priority loop

On Tue, Mar 19, 2013 at 07:58:51AM +0800, Simon Jeons wrote:
> >@@ -2672,26 +2677,25 @@ static void kswapd_shrink_zone(struct zone *zone,
> > static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
> > int *classzone_idx)
> > {
> >- bool pgdat_is_balanced = false;
> > int i;
> > int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */
> > unsigned long nr_soft_reclaimed;
> > unsigned long nr_soft_scanned;
> > struct scan_control sc = {
> > .gfp_mask = GFP_KERNEL,
> >+ .priority = DEF_PRIORITY,
> > .may_unmap = 1,
> > .may_swap = 1,
> >+ .may_writepage = !laptop_mode,
>
> What's the influence of this change? If there are large numbers of
> anonymous pages and very little file pages, anonymous pages will not
> be swapped out when priorty >= DEF_PRIORITY-2. Just no sense scan.

None. The initialisation just moves from where it was after the
loop_again label to here. See the next hunk.

> > .order = order,
> > .target_mem_cgroup = NULL,
> > };
> >-loop_again:
> >- sc.priority = DEF_PRIORITY;
> >- sc.nr_reclaimed = 0;
> >- sc.may_writepage = !laptop_mode;
> > count_vm_event(PAGEOUTRUN);
> > do {
> > unsigned long lru_pages = 0;
> >+ unsigned long nr_reclaimed = sc.nr_reclaimed;
> >+ bool raise_priority = true;
> > /*
> > * Scan in the highmem->dma direction for the highest

--
Mel Gorman
SUSE Labs

2013-03-19 10:14:32

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 03/10] mm: vmscan: Flatten kswapd priority loop

On Tue, Mar 19, 2013 at 11:08:23AM +0800, Simon Jeons wrote:
> Hi Mel,
> On 03/17/2013 09:04 PM, Mel Gorman wrote:
> >kswapd stops raising the scanning priority when at least SWAP_CLUSTER_MAX
> >pages have been reclaimed or the pgdat is considered balanced. It then
> >rechecks if it needs to restart at DEF_PRIORITY and whether high-order
> >reclaim needs to be reset. This is not wrong per-se but it is confusing
>
> per-se is short for what?
>

It means "in self" or "as such".

> >to follow and forcing kswapd to stay at DEF_PRIORITY may require several
> >restarts before it has scanned enough pages to meet the high watermark even
> >at 100% efficiency. This patch irons out the logic a bit by controlling
> >when priority is raised and removing the "goto loop_again".
> >
> >This patch has kswapd raise the scanning priority until it is scanningmm: vmscan: Flatten kswapd priority loop
> >enough pages that it could meet the high watermark in one shrink of the
> >LRU lists if it is able to reclaim at 100% efficiency. It will not raise
>
> Which kind of reclaim can be treated as 100% efficiency?
>

100% efficiency is where every page scanned can be reclaimed immediately.

> > /*
> >- * We do this so kswapd doesn't build up large priorities for
> >- * example when it is freeing in parallel with allocators. It
> >- * matches the direct reclaim path behaviour in terms of impact
> >- * on zone->*_priority.
> >+ * Fragmentation may mean that the system cannot be rebalanced
> >+ * for high-order allocations in all zones. If twice the
> >+ * allocation size has been reclaimed and the zones are still
> >+ * not balanced then recheck the watermarks at order-0 to
> >+ * prevent kswapd reclaiming excessively. Assume that a
> >+ * process requested a high-order can direct reclaim/compact.
> > */
> >- if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX)
> >- break;
> >- } while (--sc.priority >= 0);
> >+ if (order && sc.nr_reclaimed >= 2UL << order)
> >+ order = sc.order = 0;
>
> If order == 0 is meet, should we do defrag for it?
>

Compaction is unnecessary for order-0.

--
Mel Gorman
SUSE Labs

2013-03-19 10:17:00

by Simon Jeons

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority

Hi Mel,
On 03/19/2013 05:55 PM, Mel Gorman wrote:
> On Tue, Mar 19, 2013 at 07:53:16AM +0800, Simon Jeons wrote:
>> Hi Mel,
>> On 03/17/2013 09:04 PM, Mel Gorman wrote:
>>> The number of pages kswapd can reclaim is bound by the number of pages it
>>> scans which is related to the size of the zone and the scanning priority. In
>>> many cases the priority remains low because it's reset every SWAP_CLUSTER_MAX
>>> reclaimed pages but in the event kswapd scans a large number of pages it
>>> cannot reclaim, it will raise the priority and potentially discard a large
>>> percentage of the zone as sc->nr_to_reclaim is ULONG_MAX. The user-visible
>>> effect is a reclaim "spike" where a large percentage of memory is suddenly
>>> freed. It would be bad enough if this was just unused memory but because
>> Since there is nr_reclaimed >= nr_to_reclaim check if priority is
>> large than DEF_PRIORITY in shrink_lruvec, how can a large percentage
>> of memory is suddenly freed happen?
>>
> Because of the priority checks made in get_scan_count(). Patch 5 has
> more detail on why this happens.
>
But nr_reclaim >= nr_to_reclaim check in function shrink_lruvec is after
scan each evictable lru, so if priority == 0, still scan the whole world.

2013-03-19 10:19:43

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 04/10] mm: vmscan: Decide whether to compact the pgdat based on reclaim progress

On Mon, Mar 18, 2013 at 07:11:30PM +0800, Wanpeng Li wrote:
> >@@ -2864,46 +2879,21 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
> > if (try_to_freeze() || kthread_should_stop())
> > break;
> >
> >- /* If no reclaim progress then increase scanning priority */
> >- if (sc.nr_reclaimed - nr_reclaimed == 0)
> >- raise_priority = true;
> >+ /* Compact if necessary and kswapd is reclaiming efficiently */
> >+ this_reclaimed = sc.nr_reclaimed - nr_reclaimed;
> >+ if (order && pgdat_needs_compaction &&
> >+ this_reclaimed > nr_to_reclaim)
> >+ compact_pgdat(pgdat, order);
> >
>
> Hi Mel,
>
> If you should check compaction_suitable here to confirm it's not because
> other reasons like large number of pages under writeback to avoid blind
> compaction. :-)
>

This starts as a question but it is not a question so I am not sure how
I should respond.

Checking compaction_suitable here is unnecessary because compact_pgdat()
makes the same check when it calls compact_zone().

--
Mel Gorman
SUSE Labs

2013-03-19 10:26:52

by Simon Jeons

[permalink] [raw]
Subject: Re: [PATCH 03/10] mm: vmscan: Flatten kswapd priority loop

Hi Mel,
On 03/19/2013 06:14 PM, Mel Gorman wrote:
> On Tue, Mar 19, 2013 at 11:08:23AM +0800, Simon Jeons wrote:
>> Hi Mel,
>> On 03/17/2013 09:04 PM, Mel Gorman wrote:
>>> kswapd stops raising the scanning priority when at least SWAP_CLUSTER_MAX
>>> pages have been reclaimed or the pgdat is considered balanced. It then
>>> rechecks if it needs to restart at DEF_PRIORITY and whether high-order
>>> reclaim needs to be reset. This is not wrong per-se but it is confusing
>> per-se is short for what?
>>
> It means "in self" or "as such".
>
>>> to follow and forcing kswapd to stay at DEF_PRIORITY may require several
>>> restarts before it has scanned enough pages to meet the high watermark even
>>> at 100% efficiency. This patch irons out the logic a bit by controlling
>>> when priority is raised and removing the "goto loop_again".
>>>
>>> This patch has kswapd raise the scanning priority until it is scanningmm: vmscan: Flatten kswapd priority loop
>>> enough pages that it could meet the high watermark in one shrink of the
>>> LRU lists if it is able to reclaim at 100% efficiency. It will not raise
>> Which kind of reclaim can be treated as 100% efficiency?
>>
> 100% efficiency is where every page scanned can be reclaimed immediately.
>
>>> /*
>>> - * We do this so kswapd doesn't build up large priorities for
>>> - * example when it is freeing in parallel with allocators. It
>>> - * matches the direct reclaim path behaviour in terms of impact
>>> - * on zone->*_priority.
>>> + * Fragmentation may mean that the system cannot be rebalanced
>>> + * for high-order allocations in all zones. If twice the
>>> + * allocation size has been reclaimed and the zones are still
>>> + * not balanced then recheck the watermarks at order-0 to
>>> + * prevent kswapd reclaiming excessively. Assume that a
>>> + * process requested a high-order can direct reclaim/compact.
>>> */
>>> - if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX)
>>> - break;
>>> - } while (--sc.priority >= 0);
>>> + if (order && sc.nr_reclaimed >= 2UL << order)
>>> + order = sc.order = 0;
>> If order == 0 is meet, should we do defrag for it?
>>
> Compaction is unnecessary for order-0.
>

I mean since order && sc.reclaimed >= 2UL << order, it is reclaimed for
high order allocation, if order == 0 is meet, should we do defrag for it?

2013-03-19 10:27:07

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 04/10] mm: vmscan: Decide whether to compact the pgdat based on reclaim progress

On Mon, Mar 18, 2013 at 07:35:04PM +0800, Hillf Danton wrote:
> On Sun, Mar 17, 2013 at 9:04 PM, Mel Gorman <[email protected]> wrote:
> > In the past, kswapd makes a decision on whether to compact memory after the
> > pgdat was considered balanced. This more or less worked but it is late to
> > make such a decision and does not fit well now that kswapd makes a decision
> > whether to exit the zone scanning loop depending on reclaim progress.
> >
> > This patch will compact a pgdat if at least the requested number of pages
> > were reclaimed from unbalanced zones for a given priority. If any zone is
> > currently balanced, kswapd will not call compaction as it is expected the
> > necessary pages are already available.
> >
> > Signed-off-by: Mel Gorman <[email protected]>
> > ---
> > mm/vmscan.c | 52 +++++++++++++++++++++-------------------------------
> > 1 file changed, 21 insertions(+), 31 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 279d0c2..7513bd1 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2694,8 +2694,11 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
> >
> > do {
> > unsigned long lru_pages = 0;
> > + unsigned long nr_to_reclaim = 0;
> > unsigned long nr_reclaimed = sc.nr_reclaimed;
> > + unsigned long this_reclaimed;
> > bool raise_priority = true;
> > + bool pgdat_needs_compaction = true;
>
> To show that compaction is needed iff non order-o reclaim,
> bool do_compaction = !!order;
>

An order check is already made where relevant. It could be part of how
pgdat_needs_compaction gets initialised but I did not think it helped
readability.

> >
> > /*
> > * Scan in the highmem->dma direction for the highest
> > @@ -2743,7 +2746,17 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
> > for (i = 0; i <= end_zone; i++) {
> > struct zone *zone = pgdat->node_zones + i;
> >
> > + if (!populated_zone(zone))
> > + continue;
> > +
> > lru_pages += zone_reclaimable_pages(zone);
> > +
> > + /* Check if the memory needs to be defragmented */
> Enrich the comment with, say,
> /*If any zone is
> currently balanced, kswapd will not call compaction as it is expected the
> necessary pages are already available.*/
> please since a big one is deleted below.
>

Ok, done.

--
Mel Gorman
SUSE Labs

2013-03-19 10:35:19

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 06/10] mm: vmscan: Have kswapd writeback pages based on dirty pages encountered, not priority

On Mon, Mar 18, 2013 at 07:08:50PM +0800, Wanpeng Li wrote:
> >@@ -2735,8 +2748,12 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
> > end_zone = i;
> > break;
> > } else {
> >- /* If balanced, clear the congested flag */
> >+ /*
> >+ * If balanced, clear the dirty and congested
> >+ * flags
> >+ */
> > zone_clear_flag(zone, ZONE_CONGESTED);
> >+ zone_clear_flag(zone, ZONE_DIRTY);
>
> Hi Mel,
>
> There are two places in balance_pgdat clear ZONE_CONGESTED flag, one
> is during scan zone which have free_pages <= high_wmark_pages(zone), the
> other one is zone get balanced after reclaim, it seems that you miss the
> later one.
>

I did and it's fixed now. Thanks.

--
Mel Gorman
SUSE Labs

2013-03-19 10:57:55

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 07/10] mm: vmscan: Block kswapd if it is encountering pages under writeback

On Mon, Mar 18, 2013 at 07:37:42PM +0800, Simon Jeons wrote:
> On 03/17/2013 09:04 PM, Mel Gorman wrote:
> >Historically, kswapd used to congestion_wait() at higher priorities if it
> >was not making forward progress. This made no sense as the failure to make
> >progress could be completely independent of IO. It was later replaced by
> >wait_iff_congested() and removed entirely by commit 258401a6 (mm: don't
> >wait on congested zones in balance_pgdat()) as it was duplicating logic
> >in shrink_inactive_list().
> >
> >This is problematic. If kswapd encounters many pages under writeback and
> >it continues to scan until it reaches the high watermark then it will
> >quickly skip over the pages under writeback and reclaim clean young
> >pages or push applications out to swap.
> >
> >The use of wait_iff_congested() is not suited to kswapd as it will only
> >stall if the underlying BDI is really congested or a direct reclaimer was
> >unable to write to the underlying BDI. kswapd bypasses the BDI congestion
> >as it sets PF_SWAPWRITE but even if this was taken into account then it
>
> Where will check this flag?
>

may_write_to_queue

--
Mel Gorman
SUSE Labs

2013-03-19 10:58:34

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 07/10] mm: vmscan: Block kswapd if it is encountering pages under writeback

On Mon, Mar 18, 2013 at 07:58:27PM +0800, Wanpeng Li wrote:
> On Sun, Mar 17, 2013 at 01:04:13PM +0000, Mel Gorman wrote:
> >Historically, kswapd used to congestion_wait() at higher priorities if it
> >was not making forward progress. This made no sense as the failure to make
> >progress could be completely independent of IO. It was later replaced by
> >wait_iff_congested() and removed entirely by commit 258401a6 (mm: don't
> >wait on congested zones in balance_pgdat()) as it was duplicating logic
> >in shrink_inactive_list().
> >
> >This is problematic. If kswapd encounters many pages under writeback and
> >it continues to scan until it reaches the high watermark then it will
> >quickly skip over the pages under writeback and reclaim clean young
> >pages or push applications out to swap.
> >
> >The use of wait_iff_congested() is not suited to kswapd as it will only
> >stall if the underlying BDI is really congested or a direct reclaimer was
> >unable to write to the underlying BDI. kswapd bypasses the BDI congestion
> >as it sets PF_SWAPWRITE but even if this was taken into account then it
> >would cause direct reclaimers to stall on writeback which is not desirable.
> >
> >This patch sets a ZONE_WRITEBACK flag if direct reclaim or kswapd is
> >encountering too many pages under writeback. If this flag is set and
> >kswapd encounters a PageReclaim page under writeback then it'll assume
> >that the LRU lists are being recycled too quickly before IO can complete
> >and block waiting for some IO to complete.
> >
> >Signed-off-by: Mel Gorman <[email protected]>
> >---
> > include/linux/mmzone.h | 8 ++++++++
> > mm/vmscan.c | 29 ++++++++++++++++++++++++-----
> > 2 files changed, 32 insertions(+), 5 deletions(-)
> >
> >diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> >index edd6b98..c758fb7 100644
> >--- a/include/linux/mmzone.h
> >+++ b/include/linux/mmzone.h
> >@@ -498,6 +498,9 @@ typedef enum {
> > ZONE_DIRTY, /* reclaim scanning has recently found
> > * many dirty file pages
> > */
> >+ ZONE_WRITEBACK, /* reclaim scanning has recently found
> >+ * many pages under writeback
> >+ */
> > } zone_flags_t;
> >
> > static inline void zone_set_flag(struct zone *zone, zone_flags_t flag)
> >@@ -525,6 +528,11 @@ static inline int zone_is_reclaim_dirty(const struct zone *zone)
> > return test_bit(ZONE_DIRTY, &zone->flags);
> > }
> >
> >+static inline int zone_is_reclaim_writeback(const struct zone *zone)
> >+{
> >+ return test_bit(ZONE_WRITEBACK, &zone->flags);
> >+}
> >+
> > static inline int zone_is_reclaim_locked(const struct zone *zone)
> > {
> > return test_bit(ZONE_RECLAIM_LOCKED, &zone->flags);
> >diff --git a/mm/vmscan.c b/mm/vmscan.c
> >index 493728b..7d5a932 100644
> >--- a/mm/vmscan.c
> >+++ b/mm/vmscan.c
> >@@ -725,6 +725,19 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >
> > if (PageWriteback(page)) {
> > /*
> >+ * If reclaim is encountering an excessive number of
> >+ * pages under writeback and this page is both under
>
> Is the comment should changed to "encountered an excessive number of
> pages under writeback or this page is both under writeback and PageReclaim"?
> See below:
>

I intended to check for PageReclaim as well but it got lost in a merge
error. Fixed now.

--
Mel Gorman
SUSE Labs

2013-03-19 10:59:46

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority

On Tue, Mar 19, 2013 at 06:16:50PM +0800, Simon Jeons wrote:
> Hi Mel,
> On 03/19/2013 05:55 PM, Mel Gorman wrote:
> >On Tue, Mar 19, 2013 at 07:53:16AM +0800, Simon Jeons wrote:
> >>Hi Mel,
> >>On 03/17/2013 09:04 PM, Mel Gorman wrote:
> >>>The number of pages kswapd can reclaim is bound by the number of pages it
> >>>scans which is related to the size of the zone and the scanning priority. In
> >>>many cases the priority remains low because it's reset every SWAP_CLUSTER_MAX
> >>>reclaimed pages but in the event kswapd scans a large number of pages it
> >>>cannot reclaim, it will raise the priority and potentially discard a large
> >>>percentage of the zone as sc->nr_to_reclaim is ULONG_MAX. The user-visible
> >>>effect is a reclaim "spike" where a large percentage of memory is suddenly
> >>>freed. It would be bad enough if this was just unused memory but because
> >>Since there is nr_reclaimed >= nr_to_reclaim check if priority is
> >>large than DEF_PRIORITY in shrink_lruvec, how can a large percentage
> >>of memory is suddenly freed happen?
> >>
> >Because of the priority checks made in get_scan_count(). Patch 5 has
> >more detail on why this happens.
> >
> But nr_reclaim >= nr_to_reclaim check in function shrink_lruvec is
> after scan each evictable lru, so if priority == 0, still scan the
> whole world.
>

Patch 5 deals with the case where priority == 0.

--
Mel Gorman
SUSE Labs

2013-03-19 11:01:42

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 03/10] mm: vmscan: Flatten kswapd priority loop

On Tue, Mar 19, 2013 at 06:26:43PM +0800, Simon Jeons wrote:
> >>>- if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX)
> >>>- break;
> >>>- } while (--sc.priority >= 0);
> >>>+ if (order && sc.nr_reclaimed >= 2UL << order)
> >>>+ order = sc.order = 0;
> >>If order == 0 is meet, should we do defrag for it?
> >>
> >Compaction is unnecessary for order-0.
> >
>
> I mean since order && sc.reclaimed >= 2UL << order, it is reclaimed
> for high order allocation, if order == 0 is meet, should we do
> defrag for it?
>

I don't get this question at all. We do not defrag via compaction for
order-0 allocation requests because it makes no sense.

--
Mel Gorman
SUSE Labs

2013-03-19 11:06:53

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 07/10] mm: vmscan: Block kswapd if it is encountering pages under writeback

On Sun, Mar 17, 2013 at 04:40:13PM +0100, Andi Kleen wrote:
> > > BTW longer term the code would probably be a lot clearer with a
> > > real explicit state machine instead of all these custom state bits.
> > >
> >
> > I would expect so even though it'd be a major overhawl.
>
> A lot of these VM paths need overhaul because they usually don't
> do enough page batching to perform really well on larger systems.
>

While I agree this is also a serious issue and one you brought up last year,
the issue here is that page reclaim is making bad decisions for ordinary
machines. The figures in the leader patch show that a single-threaded
background write is enough to push an active application into swap.

For reclaim, the batching that is meant to mitigate part of this problem
is page lruvecs but that has been causing its own problems recently. At
some point the bullet will have to be bitten by removing pagevecs, seeing
what falls out and then design and implement a better batching mechanism
for handling large numbers of struct pages.

--
Mel Gorman
SUSE Labs

2013-03-20 16:18:52

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority

On Sun 17-03-13 13:04:07, Mel Gorman wrote:
[...]
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 88c5fed..4835a7a 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2593,6 +2593,32 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
> }
>
> /*
> + * kswapd shrinks the zone by the number of pages required to reach
> + * the high watermark.
> + */
> +static void kswapd_shrink_zone(struct zone *zone,
> + struct scan_control *sc,
> + unsigned long lru_pages)
> +{
> + unsigned long nr_slab;
> + struct reclaim_state *reclaim_state = current->reclaim_state;
> + struct shrink_control shrink = {
> + .gfp_mask = sc->gfp_mask,
> + };
> +
> + /* Reclaim above the high watermark. */
> + sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));

OK, so the cap is at high watermark which sounds OK to me, although I
would expect balance_gap being considered here. Is it not used
intentionally or you just wanted to have a reasonable upper bound?

I am not objecting to that it just hit my eyes.

> + shrink_zone(zone, sc);
> +
> + reclaim_state->reclaimed_slab = 0;
> + nr_slab = shrink_slab(&shrink, sc->nr_scanned, lru_pages);
> + sc->nr_reclaimed += reclaim_state->reclaimed_slab;
> +
> + if (nr_slab == 0 && !zone_reclaimable(zone))
> + zone->all_unreclaimable = 1;
> +}
> +
> +/*
> * For kswapd, balance_pgdat() will work across all this node's zones until
> * they are all at high_wmark_pages(zone).
> *
--
Michal Hocko
SUSE Labs

2013-03-21 00:53:22

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority

On 03/17/2013 09:04 AM, Mel Gorman wrote:
> The number of pages kswapd can reclaim is bound by the number of pages it
> scans which is related to the size of the zone and the scanning priority. In
> many cases the priority remains low because it's reset every SWAP_CLUSTER_MAX
> reclaimed pages but in the event kswapd scans a large number of pages it
> cannot reclaim, it will raise the priority and potentially discard a large
> percentage of the zone as sc->nr_to_reclaim is ULONG_MAX. The user-visible
> effect is a reclaim "spike" where a large percentage of memory is suddenly
> freed. It would be bad enough if this was just unused memory but because
> of how anon/file pages are balanced it is possible that applications get
> pushed to swap unnecessarily.
>
> This patch limits the number of pages kswapd will reclaim to the high
> watermark. Reclaim will will overshoot due to it not being a hard limit as
> shrink_lruvec() will ignore the sc.nr_to_reclaim at DEF_PRIORITY but it
> prevents kswapd reclaiming the world at higher priorities. The number of
> pages it reclaims is not adjusted for high-order allocations as kswapd will
> reclaim excessively if it is to balance zones for high-order allocations.
>
> Signed-off-by: Mel Gorman <[email protected]>

Reviewed-by: Rik van Riel <[email protected]>


--
All rights reversed

2013-03-21 00:53:26

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority

On 03/20/2013 12:18 PM, Michal Hocko wrote:
> On Sun 17-03-13 13:04:07, Mel Gorman wrote:
> [...]
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 88c5fed..4835a7a 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -2593,6 +2593,32 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
>> }
>>
>> /*
>> + * kswapd shrinks the zone by the number of pages required to reach
>> + * the high watermark.
>> + */
>> +static void kswapd_shrink_zone(struct zone *zone,
>> + struct scan_control *sc,
>> + unsigned long lru_pages)
>> +{
>> + unsigned long nr_slab;
>> + struct reclaim_state *reclaim_state = current->reclaim_state;
>> + struct shrink_control shrink = {
>> + .gfp_mask = sc->gfp_mask,
>> + };
>> +
>> + /* Reclaim above the high watermark. */
>> + sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));
>
> OK, so the cap is at high watermark which sounds OK to me, although I
> would expect balance_gap being considered here. Is it not used
> intentionally or you just wanted to have a reasonable upper bound?
>
> I am not objecting to that it just hit my eyes.

This is the maximum number of pages to reclaim, not the point
at which to stop reclaiming.

I assume Mel chose this value because it guarantees that enough
pages will have been freed, while also making sure that the value
is scaled according to zone size (keeping pressure between zones
roughly equal).

--
All rights reversed

2013-03-21 01:11:29

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 02/10] mm: vmscan: Obey proportional scanning requirements for kswapd

On 03/17/2013 09:04 AM, Mel Gorman wrote:
> Simplistically, the anon and file LRU lists are scanned proportionally
> depending on the value of vm.swappiness although there are other factors
> taken into account by get_scan_count(). The patch "mm: vmscan: Limit
> the number of pages kswapd reclaims" limits the number of pages kswapd
> reclaims but it breaks this proportional scanning and may evenly shrink
> anon/file LRUs regardless of vm.swappiness.
>
> This patch preserves the proportional scanning and reclaim. It does mean
> that kswapd will reclaim more than requested but the number of pages will
> be related to the high watermark.
>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> mm/vmscan.c | 52 +++++++++++++++++++++++++++++++++++++++++-----------
> 1 file changed, 41 insertions(+), 11 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 4835a7a..182ff15 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1815,6 +1815,45 @@ out:
> }
> }
>
> +static void recalculate_scan_count(unsigned long nr_reclaimed,
> + unsigned long nr_to_reclaim,
> + unsigned long nr[NR_LRU_LISTS])
> +{
> + enum lru_list l;
> +
> + /*
> + * For direct reclaim, reclaim the number of pages requested. Less
> + * care is taken to ensure that scanning for each LRU is properly
> + * proportional. This is unfortunate and is improper aging but
> + * minimises the amount of time a process is stalled.
> + */
> + if (!current_is_kswapd()) {
> + if (nr_reclaimed >= nr_to_reclaim) {
> + for_each_evictable_lru(l)
> + nr[l] = 0;
> + }
> + return;
> + }

This part is obvious.

> + /*
> + * For kswapd, reclaim at least the number of pages requested.
> + * However, ensure that LRUs shrink by the proportion requested
> + * by get_scan_count() so vm.swappiness is obeyed.
> + */
> + if (nr_reclaimed >= nr_to_reclaim) {
> + unsigned long min = ULONG_MAX;
> +
> + /* Find the LRU with the fewest pages to reclaim */
> + for_each_evictable_lru(l)
> + if (nr[l] < min)
> + min = nr[l];
> +
> + /* Normalise the scan counts so kswapd scans proportionally */
> + for_each_evictable_lru(l)
> + nr[l] -= min;
> + }
> +}

This part took me a bit longer to get.

Before getting to this point, we scanned the LRUs evenly.
By subtracting min from all of the LRUs, we end up stopping
the scanning of the LRU where we have the fewest pages left
to scan.

This results in the scanning being concentrated where it
should be - on the LRUs where we have not done nearly
enough scanning yet.

However, I am not sure how to document it better than
your comment already has...

Acked-by: Rik van Riel <[email protected]>

--
All rights reversed

2013-03-21 01:21:11

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 05/10] mm: vmscan: Do not allow kswapd to scan at maximum priority

On 03/17/2013 09:04 AM, Mel Gorman wrote:
> Page reclaim at priority 0 will scan the entire LRU as priority 0 is
> considered to be a near OOM condition. Kswapd can reach priority 0 quite
> easily if it is encountering a large number of pages it cannot reclaim
> such as pages under writeback. When this happens, kswapd reclaims very
> aggressively even though there may be no real risk of allocation failure
> or OOM.
>
> This patch prevents kswapd reaching priority 0 and trying to reclaim
> the world. Direct reclaimers will still reach priority 0 in the event
> of an OOM situation.
>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> mm/vmscan.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 7513bd1..af3bb6f 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2891,7 +2891,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
> */
> if (raise_priority || !this_reclaimed)
> sc.priority--;
> - } while (sc.priority >= 0 &&
> + } while (sc.priority >= 1 &&
> !pgdat_balanced(pgdat, order, *classzone_idx));
>
> out:
>

If priority 0 is way way way way way too aggressive, what makes
priority 1 safe?

This makes me wonder, are the priorities useful at all to kswapd?

--
All rights reversed

2013-03-21 09:47:19

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority

On Wed, Mar 20, 2013 at 05:18:47PM +0100, Michal Hocko wrote:
> On Sun 17-03-13 13:04:07, Mel Gorman wrote:
> [...]
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 88c5fed..4835a7a 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2593,6 +2593,32 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
> > }
> >
> > /*
> > + * kswapd shrinks the zone by the number of pages required to reach
> > + * the high watermark.
> > + */
> > +static void kswapd_shrink_zone(struct zone *zone,
> > + struct scan_control *sc,
> > + unsigned long lru_pages)
> > +{
> > + unsigned long nr_slab;
> > + struct reclaim_state *reclaim_state = current->reclaim_state;
> > + struct shrink_control shrink = {
> > + .gfp_mask = sc->gfp_mask,
> > + };
> > +
> > + /* Reclaim above the high watermark. */
> > + sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));
>
> OK, so the cap is at high watermark which sounds OK to me, although I
> would expect balance_gap being considered here. Is it not used
> intentionally or you just wanted to have a reasonable upper bound?
>

It's intentional. The balance_gap is taken into account before the
decision to shrink but not afterwards. As the watermark check after
shrinking is based on just the high watermark, I decided to have
shrink_zone reclaim on that basis.

--
Mel Gorman
SUSE Labs

2013-03-21 09:54:52

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 02/10] mm: vmscan: Obey proportional scanning requirements for kswapd

On Wed, Mar 20, 2013 at 09:10:31PM -0400, Rik van Riel wrote:
> On 03/17/2013 09:04 AM, Mel Gorman wrote:
> >Simplistically, the anon and file LRU lists are scanned proportionally
> >depending on the value of vm.swappiness although there are other factors
> >taken into account by get_scan_count(). The patch "mm: vmscan: Limit
> >the number of pages kswapd reclaims" limits the number of pages kswapd
> >reclaims but it breaks this proportional scanning and may evenly shrink
> >anon/file LRUs regardless of vm.swappiness.
> >
> >This patch preserves the proportional scanning and reclaim. It does mean
> >that kswapd will reclaim more than requested but the number of pages will
> >be related to the high watermark.
> >
> >Signed-off-by: Mel Gorman <[email protected]>
> >---
> > mm/vmscan.c | 52 +++++++++++++++++++++++++++++++++++++++++-----------
> > 1 file changed, 41 insertions(+), 11 deletions(-)
> >
> >diff --git a/mm/vmscan.c b/mm/vmscan.c
> >index 4835a7a..182ff15 100644
> >--- a/mm/vmscan.c
> >+++ b/mm/vmscan.c
> >@@ -1815,6 +1815,45 @@ out:
> > }
> > }
> >
> >+static void recalculate_scan_count(unsigned long nr_reclaimed,
> >+ unsigned long nr_to_reclaim,
> >+ unsigned long nr[NR_LRU_LISTS])
> >+{
> >+ enum lru_list l;
> >+
> >+ /*
> >+ * For direct reclaim, reclaim the number of pages requested. Less
> >+ * care is taken to ensure that scanning for each LRU is properly
> >+ * proportional. This is unfortunate and is improper aging but
> >+ * minimises the amount of time a process is stalled.
> >+ */
> >+ if (!current_is_kswapd()) {
> >+ if (nr_reclaimed >= nr_to_reclaim) {
> >+ for_each_evictable_lru(l)
> >+ nr[l] = 0;
> >+ }
> >+ return;
> >+ }
>
> This part is obvious.
>
> >+ /*
> >+ * For kswapd, reclaim at least the number of pages requested.
> >+ * However, ensure that LRUs shrink by the proportion requested
> >+ * by get_scan_count() so vm.swappiness is obeyed.
> >+ */
> >+ if (nr_reclaimed >= nr_to_reclaim) {
> >+ unsigned long min = ULONG_MAX;
> >+
> >+ /* Find the LRU with the fewest pages to reclaim */
> >+ for_each_evictable_lru(l)
> >+ if (nr[l] < min)
> >+ min = nr[l];
> >+
> >+ /* Normalise the scan counts so kswapd scans proportionally */
> >+ for_each_evictable_lru(l)
> >+ nr[l] -= min;
> >+ }
> >+}
>
> This part took me a bit longer to get.
>
> Before getting to this point, we scanned the LRUs evenly.
> By subtracting min from all of the LRUs, we end up stopping
> the scanning of the LRU where we have the fewest pages left
> to scan.
>
> This results in the scanning being concentrated where it
> should be - on the LRUs where we have not done nearly
> enough scanning yet.
>

This is exactly what my intention was. It does mean that we potentially
reclaim much more than required by sc->nr_to_reclaim but I did not think
of a straight-forward way around that that would work in every case.

> However, I am not sure how to document it better than
> your comment already has...
>
> Acked-by: Rik van Riel <[email protected]>
>

Thanks.

--
Mel Gorman
SUSE Labs

2013-03-21 10:12:17

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 05/10] mm: vmscan: Do not allow kswapd to scan at maximum priority

On Wed, Mar 20, 2013 at 09:20:14PM -0400, Rik van Riel wrote:
> On 03/17/2013 09:04 AM, Mel Gorman wrote:
> >Page reclaim at priority 0 will scan the entire LRU as priority 0 is
> >considered to be a near OOM condition. Kswapd can reach priority 0 quite
> >easily if it is encountering a large number of pages it cannot reclaim
> >such as pages under writeback. When this happens, kswapd reclaims very
> >aggressively even though there may be no real risk of allocation failure
> >or OOM.
> >
> >This patch prevents kswapd reaching priority 0 and trying to reclaim
> >the world. Direct reclaimers will still reach priority 0 in the event
> >of an OOM situation.
> >
> >Signed-off-by: Mel Gorman <[email protected]>
> >---
> > mm/vmscan.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> >diff --git a/mm/vmscan.c b/mm/vmscan.c
> >index 7513bd1..af3bb6f 100644
> >--- a/mm/vmscan.c
> >+++ b/mm/vmscan.c
> >@@ -2891,7 +2891,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
> > */
> > if (raise_priority || !this_reclaimed)
> > sc.priority--;
> >- } while (sc.priority >= 0 &&
> >+ } while (sc.priority >= 1 &&
> > !pgdat_balanced(pgdat, order, *classzone_idx));
> >
> > out:
> >
>
> If priority 0 is way way way way way too aggressive, what makes
> priority 1 safe?
>

The fact that priority 1 selects a sensible number of pages to reclaim and
obeys swappiness makes it a lot safer. Priority 0 does this in get_scan_count

/*
* Do not apply any pressure balancing cleverness when the
* system is close to OOM, scan both anon and file equally
* (unless the swappiness setting disagrees with swapping).
*/
if (!sc->priority && vmscan_swappiness(sc)) {
scan_balance = SCAN_EQUAL;
goto out;
}

.....

size = get_lru_size(lruvec, lru);
scan = size >> sc->priority;

if (!scan && force_scan)
scan = min(size, SWAP_CLUSTER_MAX);

switch (scan_balance) {
case SCAN_EQUAL:
/* Scan lists relative to size */
break;

.....
}
nr[lru] = scan;

That is saying -- at priority 0, scan everything in all LRUs. When put in
combination with patch 2 it effectively means reclaim everything in all LRUs.
It reclaims every file page it can and swaps as much as possible resulting
in major slowdowns.

> This makes me wonder, are the priorities useful at all to kswapd?
>

They are not as useful as I'd like. Just a streaming writer will be
enough to ensure that the lower priorities will never reclaim enough
pages to move the zone from the min to high watermark making the low
priorities almost completely useless.

--
Mel Gorman
SUSE Labs

2013-03-21 12:31:34

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 05/10] mm: vmscan: Do not allow kswapd to scan at maximum priority

On 03/21/2013 06:12 AM, Mel Gorman wrote:
> On Wed, Mar 20, 2013 at 09:20:14PM -0400, Rik van Riel wrote:
>> On 03/17/2013 09:04 AM, Mel Gorman wrote:
>>> Page reclaim at priority 0 will scan the entire LRU as priority 0 is
>>> considered to be a near OOM condition. Kswapd can reach priority 0 quite
>>> easily if it is encountering a large number of pages it cannot reclaim
>>> such as pages under writeback. When this happens, kswapd reclaims very
>>> aggressively even though there may be no real risk of allocation failure
>>> or OOM.
>>>
>>> This patch prevents kswapd reaching priority 0 and trying to reclaim
>>> the world. Direct reclaimers will still reach priority 0 in the event
>>> of an OOM situation.
>>>
>>> Signed-off-by: Mel Gorman <[email protected]>
>>> ---
>>> mm/vmscan.c | 2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index 7513bd1..af3bb6f 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -2891,7 +2891,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
>>> */
>>> if (raise_priority || !this_reclaimed)
>>> sc.priority--;
>>> - } while (sc.priority >= 0 &&
>>> + } while (sc.priority >= 1 &&
>>> !pgdat_balanced(pgdat, order, *classzone_idx));
>>>
>>> out:
>>>
>>
>> If priority 0 is way way way way way too aggressive, what makes
>> priority 1 safe?
>>
>
> The fact that priority 1 selects a sensible number of pages to reclaim and
> obeys swappiness makes it a lot safer. Priority 0 does this in get_scan_count
^^^^^^^^^^^^^^^^

Ahhh, good point! We stay away from all the "emergency" code, which
kswapd should never run.

Acked-by: Rik van Riel <[email protected]>

--
All rights reversed

2013-03-21 12:59:42

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority

On Thu 21-03-13 09:47:13, Mel Gorman wrote:
> On Wed, Mar 20, 2013 at 05:18:47PM +0100, Michal Hocko wrote:
> > On Sun 17-03-13 13:04:07, Mel Gorman wrote:
> > [...]
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 88c5fed..4835a7a 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -2593,6 +2593,32 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
> > > }
> > >
> > > /*
> > > + * kswapd shrinks the zone by the number of pages required to reach
> > > + * the high watermark.
> > > + */
> > > +static void kswapd_shrink_zone(struct zone *zone,
> > > + struct scan_control *sc,
> > > + unsigned long lru_pages)
> > > +{
> > > + unsigned long nr_slab;
> > > + struct reclaim_state *reclaim_state = current->reclaim_state;
> > > + struct shrink_control shrink = {
> > > + .gfp_mask = sc->gfp_mask,
> > > + };
> > > +
> > > + /* Reclaim above the high watermark. */
> > > + sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));
> >
> > OK, so the cap is at high watermark which sounds OK to me, although I
> > would expect balance_gap being considered here. Is it not used
> > intentionally or you just wanted to have a reasonable upper bound?
> >
>
> It's intentional. The balance_gap is taken into account before the
> decision to shrink but not afterwards. As the watermark check after
> shrinking is based on just the high watermark, I decided to have
> shrink_zone reclaim on that basis.

OK, it makes sense. Thanks both you and Rik for clarification.

--
Michal Hocko
SUSE Labs

2013-03-21 14:01:57

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 02/10] mm: vmscan: Obey proportional scanning requirements for kswapd

On Sun 17-03-13 13:04:08, Mel Gorman wrote:
> Simplistically, the anon and file LRU lists are scanned proportionally
> depending on the value of vm.swappiness although there are other factors
> taken into account by get_scan_count(). The patch "mm: vmscan: Limit
> the number of pages kswapd reclaims" limits the number of pages kswapd
> reclaims but it breaks this proportional scanning and may evenly shrink
> anon/file LRUs regardless of vm.swappiness.
>
> This patch preserves the proportional scanning and reclaim. It does mean
> that kswapd will reclaim more than requested but the number of pages will
> be related to the high watermark.
>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> mm/vmscan.c | 52 +++++++++++++++++++++++++++++++++++++++++-----------
> 1 file changed, 41 insertions(+), 11 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 4835a7a..182ff15 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1815,6 +1815,45 @@ out:
> }
> }
>
> +static void recalculate_scan_count(unsigned long nr_reclaimed,
> + unsigned long nr_to_reclaim,
> + unsigned long nr[NR_LRU_LISTS])
> +{
> + enum lru_list l;
> +
> + /*
> + * For direct reclaim, reclaim the number of pages requested. Less
> + * care is taken to ensure that scanning for each LRU is properly
> + * proportional. This is unfortunate and is improper aging but
> + * minimises the amount of time a process is stalled.
> + */
> + if (!current_is_kswapd()) {
> + if (nr_reclaimed >= nr_to_reclaim) {
> + for_each_evictable_lru(l)
> + nr[l] = 0;
> + }
> + return;

Heh, this is nicely cryptically said what could be done in shrink_lruvec
as
if (!current_is_kswapd()) {
if (nr_reclaimed >= nr_to_reclaim)
break;
}

Besides that this is not memcg aware which I think it would break
targeted reclaim which is kind of direct reclaim but it still would be
good to stay proportional because it starts with DEF_PRIORITY.

I would suggest moving this back to shrink_lruvec and update the test as
follows:
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 182ff15..5cf5a4b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1822,23 +1822,9 @@ static void recalculate_scan_count(unsigned long nr_reclaimed,
enum lru_list l;

/*
- * For direct reclaim, reclaim the number of pages requested. Less
- * care is taken to ensure that scanning for each LRU is properly
- * proportional. This is unfortunate and is improper aging but
- * minimises the amount of time a process is stalled.
- */
- if (!current_is_kswapd()) {
- if (nr_reclaimed >= nr_to_reclaim) {
- for_each_evictable_lru(l)
- nr[l] = 0;
- }
- return;
- }
-
- /*
- * For kswapd, reclaim at least the number of pages requested.
- * However, ensure that LRUs shrink by the proportion requested
- * by get_scan_count() so vm.swappiness is obeyed.
+ * Reclaim at least the number of pages requested. However,
+ * ensure that LRUs shrink by the proportion requested by
+ * get_scan_count() so vm.swappiness is obeyed.
*/
if (nr_reclaimed >= nr_to_reclaim) {
unsigned long min = ULONG_MAX;
@@ -1881,6 +1867,18 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
}
}

+ /*
+ * For global direct reclaim, reclaim the number of
+ * pages requested. Less care is taken to ensure that
+ * scanning for each LRU is properly proportional. This
+ * is unfortunate and is improper aging but minimises
+ * the amount of time a process is stalled.
+ */
+ if (global_reclaim(sc) && !current_is_kswapd()) {
+ if (nr_reclaimed >= nr_to_reclaim)
+ break
+ }
+
recalculate_scan_count(nr_reclaimed, nr_to_reclaim, nr);
}
blk_finish_plug(&plug);

> + }
> +
> + /*
> + * For kswapd, reclaim at least the number of pages requested.
> + * However, ensure that LRUs shrink by the proportion requested
> + * by get_scan_count() so vm.swappiness is obeyed.
> + */
> + if (nr_reclaimed >= nr_to_reclaim) {
> + unsigned long min = ULONG_MAX;
> +
> + /* Find the LRU with the fewest pages to reclaim */
> + for_each_evictable_lru(l)
> + if (nr[l] < min)
> + min = nr[l];
> +
> + /* Normalise the scan counts so kswapd scans proportionally */
> + for_each_evictable_lru(l)
> + nr[l] -= min;
> + }

It looked scary at first glance but it makes sense. Every round (after we
have reclaimed enough) one LRU is pulled out and others are
proportionally inhibited.

> +}
> +
> /*
> * This is a basic per-zone page freer. Used by both kswapd and direct reclaim.
> */
> @@ -1841,17 +1880,8 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> lruvec, sc);
> }
> }
> - /*
> - * On large memory systems, scan >> priority can become
> - * really large. This is fine for the starting priority;
> - * we want to put equal scanning pressure on each zone.
> - * However, if the VM has a harder time of freeing pages,
> - * with multiple processes reclaiming pages, the total
> - * freeing target can get unreasonably large.
> - */
> - if (nr_reclaimed >= nr_to_reclaim &&
> - sc->priority < DEF_PRIORITY)
> - break;
> +
> + recalculate_scan_count(nr_reclaimed, nr_to_reclaim, nr);
> }
> blk_finish_plug(&plug);
> sc->nr_reclaimed += nr_reclaimed;
> --
> 1.8.1.4
>

--
Michal Hocko
SUSE Labs

2013-03-21 14:31:21

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 02/10] mm: vmscan: Obey proportional scanning requirements for kswapd

On Thu, Mar 21, 2013 at 03:01:54PM +0100, Michal Hocko wrote:
> On Sun 17-03-13 13:04:08, Mel Gorman wrote:
> > Simplistically, the anon and file LRU lists are scanned proportionally
> > depending on the value of vm.swappiness although there are other factors
> > taken into account by get_scan_count(). The patch "mm: vmscan: Limit
> > the number of pages kswapd reclaims" limits the number of pages kswapd
> > reclaims but it breaks this proportional scanning and may evenly shrink
> > anon/file LRUs regardless of vm.swappiness.
> >
> > This patch preserves the proportional scanning and reclaim. It does mean
> > that kswapd will reclaim more than requested but the number of pages will
> > be related to the high watermark.
> >
> > Signed-off-by: Mel Gorman <[email protected]>
> > ---
> > mm/vmscan.c | 52 +++++++++++++++++++++++++++++++++++++++++-----------
> > 1 file changed, 41 insertions(+), 11 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 4835a7a..182ff15 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1815,6 +1815,45 @@ out:
> > }
> > }
> >
> > +static void recalculate_scan_count(unsigned long nr_reclaimed,
> > + unsigned long nr_to_reclaim,
> > + unsigned long nr[NR_LRU_LISTS])
> > +{
> > + enum lru_list l;
> > +
> > + /*
> > + * For direct reclaim, reclaim the number of pages requested. Less
> > + * care is taken to ensure that scanning for each LRU is properly
> > + * proportional. This is unfortunate and is improper aging but
> > + * minimises the amount of time a process is stalled.
> > + */
> > + if (!current_is_kswapd()) {
> > + if (nr_reclaimed >= nr_to_reclaim) {
> > + for_each_evictable_lru(l)
> > + nr[l] = 0;
> > + }
> > + return;
>
> Heh, this is nicely cryptically said what could be done in shrink_lruvec
> as
> if (!current_is_kswapd()) {
> if (nr_reclaimed >= nr_to_reclaim)
> break;
> }
>

Pretty much. At one point during development, this function was more
complex and it evolved into this without me rechecking if splitting it
out still made sense.

> Besides that this is not memcg aware which I think it would break
> targeted reclaim which is kind of direct reclaim but it still would be
> good to stay proportional because it starts with DEF_PRIORITY.
>

This does break memcg because it's a special sort of direct reclaim.

> I would suggest moving this back to shrink_lruvec and update the test as
> follows:

I also noticed that we check whether the scan counts need to be
normalised more than once and this reshuffling checks nr_reclaimed
twice. How about this?

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 182ff15..320a2f4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1815,45 +1815,6 @@ out:
}
}

-static void recalculate_scan_count(unsigned long nr_reclaimed,
- unsigned long nr_to_reclaim,
- unsigned long nr[NR_LRU_LISTS])
-{
- enum lru_list l;
-
- /*
- * For direct reclaim, reclaim the number of pages requested. Less
- * care is taken to ensure that scanning for each LRU is properly
- * proportional. This is unfortunate and is improper aging but
- * minimises the amount of time a process is stalled.
- */
- if (!current_is_kswapd()) {
- if (nr_reclaimed >= nr_to_reclaim) {
- for_each_evictable_lru(l)
- nr[l] = 0;
- }
- return;
- }
-
- /*
- * For kswapd, reclaim at least the number of pages requested.
- * However, ensure that LRUs shrink by the proportion requested
- * by get_scan_count() so vm.swappiness is obeyed.
- */
- if (nr_reclaimed >= nr_to_reclaim) {
- unsigned long min = ULONG_MAX;
-
- /* Find the LRU with the fewest pages to reclaim */
- for_each_evictable_lru(l)
- if (nr[l] < min)
- min = nr[l];
-
- /* Normalise the scan counts so kswapd scans proportionally */
- for_each_evictable_lru(l)
- nr[l] -= min;
- }
-}
-
/*
* This is a basic per-zone page freer. Used by both kswapd and direct reclaim.
*/
@@ -1864,7 +1825,9 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
enum lru_list lru;
unsigned long nr_reclaimed = 0;
unsigned long nr_to_reclaim = sc->nr_to_reclaim;
+ unsigned long min;
struct blk_plug plug;
+ bool scan_adjusted = false;

get_scan_count(lruvec, sc, nr);

@@ -1881,7 +1844,33 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
}
}

- recalculate_scan_count(nr_reclaimed, nr_to_reclaim, nr);
+ if (nr_reclaimed < nr_to_reclaim || scan_adjusted)
+ continue;
+
+ /*
+ * For global direct reclaim, reclaim only the number of pages
+ * requested. Less care is taken to scan proportionally as it
+ * is more important to minimise direct reclaim stall latency
+ * than it is to properly age the LRU lists.
+ */
+ if (global_reclaim(sc) && !current_is_kswapd())
+ break;
+
+ /*
+ * For kswapd and memcg, reclaim at least the number of pages
+ * requested. However, ensure that LRUs shrink by the
+ * proportion requested by get_scan_count() so vm.swappiness
+ * is obeyed. Find the smallest LRU list and normalise the
+ * scan counts so the fewest number of pages are reclaimed
+ * while still maintaining proportionality.
+ */
+ min = ULONG_MAX;
+ for_each_evictable_lru(lru)
+ if (nr[lru] < min)
+ min = nr[lru];
+ for_each_evictable_lru(lru)
+ nr[lru] -= min;
+ scan_adjusted = true;
}
blk_finish_plug(&plug);
sc->nr_reclaimed += nr_reclaimed;

2013-03-21 14:55:02

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 03/10] mm: vmscan: Flatten kswapd priority loop

On Sun 17-03-13 13:04:09, Mel Gorman wrote:
> kswapd stops raising the scanning priority when at least SWAP_CLUSTER_MAX
> pages have been reclaimed or the pgdat is considered balanced. It then
> rechecks if it needs to restart at DEF_PRIORITY and whether high-order
> reclaim needs to be reset. This is not wrong per-se but it is confusing
> to follow and forcing kswapd to stay at DEF_PRIORITY may require several
> restarts before it has scanned enough pages to meet the high watermark even
> at 100% efficiency.

> This patch irons out the logic a bit by controlling when priority is
> raised and removing the "goto loop_again".

Applause Mr. Gorman ;) I really hat this goto loop_again. It makes my
head scratch all the time.

> This patch has kswapd raise the scanning priority until it is scanning
> enough pages that it could meet the high watermark in one shrink of the
> LRU lists if it is able to reclaim at 100% efficiency. It will not raise
> the scanning prioirty higher unless it is failing to reclaim any pages.
>
> To avoid infinite looping for high-order allocation requests kswapd will
> not reclaim for high-order allocations when it has reclaimed at least
> twice the number of pages as the allocation request.
>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> mm/vmscan.c | 86 ++++++++++++++++++++++++++++++-------------------------------
> 1 file changed, 42 insertions(+), 44 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 182ff15..279d0c2 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2625,8 +2625,11 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
> /*
> * kswapd shrinks the zone by the number of pages required to reach
> * the high watermark.
> + *
> + * Returns true if kswapd scanned at least the requested number of
> + * pages to reclaim.

Maybe move the comment about not rising priority in such case here to be
clear what the return value means. Without that the return value could
be misinterpreted that kswapd_shrink_zone succeeded in shrinking might
be not true.
Or maybe even better, leave the void there and add bool *raise_priority
argument here so the decision and raise_priority are at the same place.

> */
> -static void kswapd_shrink_zone(struct zone *zone,
> +static bool kswapd_shrink_zone(struct zone *zone,
> struct scan_control *sc,
> unsigned long lru_pages)
> {
> @@ -2646,6 +2649,8 @@ static void kswapd_shrink_zone(struct zone *zone,
>
> if (nr_slab == 0 && !zone_reclaimable(zone))
> zone->all_unreclaimable = 1;
> +
> + return sc->nr_scanned >= sc->nr_to_reclaim;
> }
>
> /*
[...]
> @@ -2803,8 +2805,16 @@ loop_again:
>
> if ((buffer_heads_over_limit && is_highmem_idx(i)) ||
> !zone_balanced(zone, testorder,
> - balance_gap, end_zone))
> - kswapd_shrink_zone(zone, &sc, lru_pages);
> + balance_gap, end_zone)) {
> + /*
> + * There should be no need to raise the
> + * scanning priority if enough pages are
> + * already being scanned that that high

s/that that/that/

> + * watermark would be met at 100% efficiency.
> + */
> + if (kswapd_shrink_zone(zone, &sc, lru_pages))
> + raise_priority = false;
> + }
>
> /*
> * If we're getting trouble reclaiming, start doing
> @@ -2839,46 +2849,33 @@ loop_again:
> pfmemalloc_watermark_ok(pgdat))
> wake_up(&pgdat->pfmemalloc_wait);
>
> - if (pgdat_balanced(pgdat, order, *classzone_idx)) {
> - pgdat_is_balanced = true;
> - break; /* kswapd: all done */
> - }
> -
> /*
> - * We do this so kswapd doesn't build up large priorities for
> - * example when it is freeing in parallel with allocators. It
> - * matches the direct reclaim path behaviour in terms of impact
> - * on zone->*_priority.
> + * Fragmentation may mean that the system cannot be rebalanced
> + * for high-order allocations in all zones. If twice the
> + * allocation size has been reclaimed and the zones are still
> + * not balanced then recheck the watermarks at order-0 to
> + * prevent kswapd reclaiming excessively. Assume that a
> + * process requested a high-order can direct reclaim/compact.
> */
> - if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX)
> - break;
> - } while (--sc.priority >= 0);
> + if (order && sc.nr_reclaimed >= 2UL << order)
> + order = sc.order = 0;
>
> -out:
> - if (!pgdat_is_balanced) {
> - cond_resched();
> + /* Check if kswapd should be suspending */
> + if (try_to_freeze() || kthread_should_stop())
> + break;
>
> - try_to_freeze();
> + /* If no reclaim progress then increase scanning priority */
> + if (sc.nr_reclaimed - nr_reclaimed == 0)
> + raise_priority = true;
>
> /*
> - * Fragmentation may mean that the system cannot be
> - * rebalanced for high-order allocations in all zones.
> - * At this point, if nr_reclaimed < SWAP_CLUSTER_MAX,
> - * it means the zones have been fully scanned and are still
> - * not balanced. For high-order allocations, there is
> - * little point trying all over again as kswapd may
> - * infinite loop.
> - *
> - * Instead, recheck all watermarks at order-0 as they
> - * are the most important. If watermarks are ok, kswapd will go
> - * back to sleep. High-order users can still perform direct
> - * reclaim if they wish.
> + * Raise priority if scanning rate is too low or there was no
> + * progress in reclaiming pages
> */
> - if (sc.nr_reclaimed < SWAP_CLUSTER_MAX)
> - order = sc.order = 0;
> -
> - goto loop_again;
> - }
> + if (raise_priority || sc.nr_reclaimed - nr_reclaimed == 0)

(sc.nr_reclaimed - nr_reclaimed == 0) is redundant because you already
set raise_priority above in that case.

> + sc.priority--;
> + } while (sc.priority >= 0 &&
> + !pgdat_balanced(pgdat, order, *classzone_idx));
>
> /*
> * If kswapd was reclaiming at a higher order, it has the option of
> @@ -2907,6 +2904,7 @@ out:
> compact_pgdat(pgdat, order);
> }
>
> +out:
> /*
> * Return the order we were reclaiming at so prepare_kswapd_sleep()
> * makes a decision on the order we were last reclaiming at. However,

It looks OK otherwise but I have to think some more as balance_pgdat is
still tricky, albeit less then it was before so this is definitely
progress.

Thanks!
--
Michal Hocko
SUSE Labs

2013-03-21 15:07:59

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 02/10] mm: vmscan: Obey proportional scanning requirements for kswapd

On Thu 21-03-13 14:31:15, Mel Gorman wrote:
> On Thu, Mar 21, 2013 at 03:01:54PM +0100, Michal Hocko wrote:
> > On Sun 17-03-13 13:04:08, Mel Gorman wrote:
> > > Simplistically, the anon and file LRU lists are scanned proportionally
> > > depending on the value of vm.swappiness although there are other factors
> > > taken into account by get_scan_count(). The patch "mm: vmscan: Limit
> > > the number of pages kswapd reclaims" limits the number of pages kswapd
> > > reclaims but it breaks this proportional scanning and may evenly shrink
> > > anon/file LRUs regardless of vm.swappiness.
> > >
> > > This patch preserves the proportional scanning and reclaim. It does mean
> > > that kswapd will reclaim more than requested but the number of pages will
> > > be related to the high watermark.
> > >
> > > Signed-off-by: Mel Gorman <[email protected]>
> > > ---
> > > mm/vmscan.c | 52 +++++++++++++++++++++++++++++++++++++++++-----------
> > > 1 file changed, 41 insertions(+), 11 deletions(-)
> > >
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 4835a7a..182ff15 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -1815,6 +1815,45 @@ out:
> > > }
> > > }
> > >
> > > +static void recalculate_scan_count(unsigned long nr_reclaimed,
> > > + unsigned long nr_to_reclaim,
> > > + unsigned long nr[NR_LRU_LISTS])
> > > +{
> > > + enum lru_list l;
> > > +
> > > + /*
> > > + * For direct reclaim, reclaim the number of pages requested. Less
> > > + * care is taken to ensure that scanning for each LRU is properly
> > > + * proportional. This is unfortunate and is improper aging but
> > > + * minimises the amount of time a process is stalled.
> > > + */
> > > + if (!current_is_kswapd()) {
> > > + if (nr_reclaimed >= nr_to_reclaim) {
> > > + for_each_evictable_lru(l)
> > > + nr[l] = 0;
> > > + }
> > > + return;
> >
> > Heh, this is nicely cryptically said what could be done in shrink_lruvec
> > as
> > if (!current_is_kswapd()) {
> > if (nr_reclaimed >= nr_to_reclaim)
> > break;
> > }
> >
>
> Pretty much. At one point during development, this function was more
> complex and it evolved into this without me rechecking if splitting it
> out still made sense.
>
> > Besides that this is not memcg aware which I think it would break
> > targeted reclaim which is kind of direct reclaim but it still would be
> > good to stay proportional because it starts with DEF_PRIORITY.
> >
>
> This does break memcg because it's a special sort of direct reclaim.
>
> > I would suggest moving this back to shrink_lruvec and update the test as
> > follows:
>
> I also noticed that we check whether the scan counts need to be
> normalised more than once

I didn't mind this because it "disqualified" at least one LRU every
round which sounds reasonable to me because all LRUs would be scanned
proportionally. E.g. if swappiness is 0 then nr[anon] would be 0 and
then the active/inactive aging would break? Or am I missing something?

> and this reshuffling checks nr_reclaimed twice. How about this?
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 182ff15..320a2f4 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1815,45 +1815,6 @@ out:
> }
> }
>
> -static void recalculate_scan_count(unsigned long nr_reclaimed,
> - unsigned long nr_to_reclaim,
> - unsigned long nr[NR_LRU_LISTS])
> -{
> - enum lru_list l;
> -
> - /*
> - * For direct reclaim, reclaim the number of pages requested. Less
> - * care is taken to ensure that scanning for each LRU is properly
> - * proportional. This is unfortunate and is improper aging but
> - * minimises the amount of time a process is stalled.
> - */
> - if (!current_is_kswapd()) {
> - if (nr_reclaimed >= nr_to_reclaim) {
> - for_each_evictable_lru(l)
> - nr[l] = 0;
> - }
> - return;
> - }
> -
> - /*
> - * For kswapd, reclaim at least the number of pages requested.
> - * However, ensure that LRUs shrink by the proportion requested
> - * by get_scan_count() so vm.swappiness is obeyed.
> - */
> - if (nr_reclaimed >= nr_to_reclaim) {
> - unsigned long min = ULONG_MAX;
> -
> - /* Find the LRU with the fewest pages to reclaim */
> - for_each_evictable_lru(l)
> - if (nr[l] < min)
> - min = nr[l];
> -
> - /* Normalise the scan counts so kswapd scans proportionally */
> - for_each_evictable_lru(l)
> - nr[l] -= min;
> - }
> -}
> -
> /*
> * This is a basic per-zone page freer. Used by both kswapd and direct reclaim.
> */
> @@ -1864,7 +1825,9 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> enum lru_list lru;
> unsigned long nr_reclaimed = 0;
> unsigned long nr_to_reclaim = sc->nr_to_reclaim;
> + unsigned long min;
> struct blk_plug plug;
> + bool scan_adjusted = false;
>
> get_scan_count(lruvec, sc, nr);
>
> @@ -1881,7 +1844,33 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> }
> }
>
> - recalculate_scan_count(nr_reclaimed, nr_to_reclaim, nr);
> + if (nr_reclaimed < nr_to_reclaim || scan_adjusted)
> + continue;
> +
> + /*
> + * For global direct reclaim, reclaim only the number of pages
> + * requested. Less care is taken to scan proportionally as it
> + * is more important to minimise direct reclaim stall latency
> + * than it is to properly age the LRU lists.
> + */
> + if (global_reclaim(sc) && !current_is_kswapd())
> + break;
> +
> + /*
> + * For kswapd and memcg, reclaim at least the number of pages
> + * requested. However, ensure that LRUs shrink by the
> + * proportion requested by get_scan_count() so vm.swappiness
> + * is obeyed. Find the smallest LRU list and normalise the
> + * scan counts so the fewest number of pages are reclaimed
> + * while still maintaining proportionality.
> + */
> + min = ULONG_MAX;
> + for_each_evictable_lru(lru)
> + if (nr[lru] < min)
> + min = nr[lru];
> + for_each_evictable_lru(lru)
> + nr[lru] -= min;
> + scan_adjusted = true;
> }
> blk_finish_plug(&plug);
> sc->nr_reclaimed += nr_reclaimed;

--
Michal Hocko
SUSE Labs

2013-03-21 15:26:10

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 03/10] mm: vmscan: Flatten kswapd priority loop

On Thu, Mar 21, 2013 at 03:54:58PM +0100, Michal Hocko wrote:
> > Signed-off-by: Mel Gorman <[email protected]>
> > ---
> > mm/vmscan.c | 86 ++++++++++++++++++++++++++++++-------------------------------
> > 1 file changed, 42 insertions(+), 44 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 182ff15..279d0c2 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2625,8 +2625,11 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
> > /*
> > * kswapd shrinks the zone by the number of pages required to reach
> > * the high watermark.
> > + *
> > + * Returns true if kswapd scanned at least the requested number of
> > + * pages to reclaim.
>
> Maybe move the comment about not rising priority in such case here to be
> clear what the return value means. Without that the return value could
> be misinterpreted that kswapd_shrink_zone succeeded in shrinking might
> be not true.

I moved the comment.

> Or maybe even better, leave the void there and add bool *raise_priority
> argument here so the decision and raise_priority are at the same place.
>

The priority is raised if kswapd failed to reclaim from any of the unbalanced
zone. If raise_priority is moved inside kswapd_shrink_zone then it can
only take one zone into account.

> > */
> > -static void kswapd_shrink_zone(struct zone *zone,
> > +static bool kswapd_shrink_zone(struct zone *zone,
> > struct scan_control *sc,
> > unsigned long lru_pages)
> > {
> > @@ -2646,6 +2649,8 @@ static void kswapd_shrink_zone(struct zone *zone,
> >
> > if (nr_slab == 0 && !zone_reclaimable(zone))
> > zone->all_unreclaimable = 1;
> > +
> > + return sc->nr_scanned >= sc->nr_to_reclaim;
> > }
> >
> > /*
> [...]
> > @@ -2803,8 +2805,16 @@ loop_again:
> >
> > if ((buffer_heads_over_limit && is_highmem_idx(i)) ||
> > !zone_balanced(zone, testorder,
> > - balance_gap, end_zone))
> > - kswapd_shrink_zone(zone, &sc, lru_pages);
> > + balance_gap, end_zone)) {
> > + /*
> > + * There should be no need to raise the
> > + * scanning priority if enough pages are
> > + * already being scanned that that high
>
> s/that that/that/
>

Fixed

> > + * watermark would be met at 100% efficiency.
> > + */
> > + if (kswapd_shrink_zone(zone, &sc, lru_pages))
> > + raise_priority = false;
> > + }
> >
> > /*
> > * If we're getting trouble reclaiming, start doing
> > @@ -2839,46 +2849,33 @@ loop_again:
> > pfmemalloc_watermark_ok(pgdat))
> > wake_up(&pgdat->pfmemalloc_wait);
> >
> > - if (pgdat_balanced(pgdat, order, *classzone_idx)) {
> > - pgdat_is_balanced = true;
> > - break; /* kswapd: all done */
> > - }
> > -
> > /*
> > - * We do this so kswapd doesn't build up large priorities for
> > - * example when it is freeing in parallel with allocators. It
> > - * matches the direct reclaim path behaviour in terms of impact
> > - * on zone->*_priority.
> > + * Fragmentation may mean that the system cannot be rebalanced
> > + * for high-order allocations in all zones. If twice the
> > + * allocation size has been reclaimed and the zones are still
> > + * not balanced then recheck the watermarks at order-0 to
> > + * prevent kswapd reclaiming excessively. Assume that a
> > + * process requested a high-order can direct reclaim/compact.
> > */
> > - if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX)
> > - break;
> > - } while (--sc.priority >= 0);
> > + if (order && sc.nr_reclaimed >= 2UL << order)
> > + order = sc.order = 0;
> >
> > -out:
> > - if (!pgdat_is_balanced) {
> > - cond_resched();
> > + /* Check if kswapd should be suspending */
> > + if (try_to_freeze() || kthread_should_stop())
> > + break;
> >
> > - try_to_freeze();
> > + /* If no reclaim progress then increase scanning priority */
> > + if (sc.nr_reclaimed - nr_reclaimed == 0)
> > + raise_priority = true;
> >
> > /*
> > - * Fragmentation may mean that the system cannot be
> > - * rebalanced for high-order allocations in all zones.
> > - * At this point, if nr_reclaimed < SWAP_CLUSTER_MAX,
> > - * it means the zones have been fully scanned and are still
> > - * not balanced. For high-order allocations, there is
> > - * little point trying all over again as kswapd may
> > - * infinite loop.
> > - *
> > - * Instead, recheck all watermarks at order-0 as they
> > - * are the most important. If watermarks are ok, kswapd will go
> > - * back to sleep. High-order users can still perform direct
> > - * reclaim if they wish.
> > + * Raise priority if scanning rate is too low or there was no
> > + * progress in reclaiming pages
> > */
> > - if (sc.nr_reclaimed < SWAP_CLUSTER_MAX)
> > - order = sc.order = 0;
> > -
> > - goto loop_again;
> > - }
> > + if (raise_priority || sc.nr_reclaimed - nr_reclaimed == 0)
>
> (sc.nr_reclaimed - nr_reclaimed == 0) is redundant because you already
> set raise_priority above in that case.
>

I removed the redundant check.

> > + sc.priority--;
> > + } while (sc.priority >= 0 &&
> > + !pgdat_balanced(pgdat, order, *classzone_idx));
> >
> > /*
> > * If kswapd was reclaiming at a higher order, it has the option of
> > @@ -2907,6 +2904,7 @@ out:
> > compact_pgdat(pgdat, order);
> > }
> >
> > +out:
> > /*
> > * Return the order we were reclaiming at so prepare_kswapd_sleep()
> > * makes a decision on the order we were last reclaiming at. However,
>
> It looks OK otherwise but I have to think some more as balance_pgdat is
> still tricky, albeit less then it was before so this is definitely
> progress.
>

Thanks.

--
Mel Gorman
SUSE Labs

2013-03-21 15:32:34

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 04/10] mm: vmscan: Decide whether to compact the pgdat based on reclaim progress

On Sun 17-03-13 13:04:10, Mel Gorman wrote:
> In the past, kswapd makes a decision on whether to compact memory after the
> pgdat was considered balanced. This more or less worked but it is late to
> make such a decision and does not fit well now that kswapd makes a decision
> whether to exit the zone scanning loop depending on reclaim progress.
>
> This patch will compact a pgdat if at least the requested number of pages
> were reclaimed from unbalanced zones for a given priority. If any zone is
> currently balanced, kswapd will not call compaction as it is expected the
> necessary pages are already available.
>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> mm/vmscan.c | 52 +++++++++++++++++++++-------------------------------
> 1 file changed, 21 insertions(+), 31 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 279d0c2..7513bd1 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2694,8 +2694,11 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
>
> do {
> unsigned long lru_pages = 0;
> + unsigned long nr_to_reclaim = 0;
> unsigned long nr_reclaimed = sc.nr_reclaimed;
> + unsigned long this_reclaimed;
> bool raise_priority = true;
> + bool pgdat_needs_compaction = true;

I am confused. We don't want to compact for order == 0, do we?

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7513bd1..1b89c29 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2698,7 +2698,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
unsigned long nr_reclaimed = sc.nr_reclaimed;
unsigned long this_reclaimed;
bool raise_priority = true;
- bool pgdat_needs_compaction = true;
+ bool pgdat_needs_compaction = order > 0;

/*
* Scan in the highmem->dma direction for the highest

Other than that it makes sense to me.

>
> /*
> * Scan in the highmem->dma direction for the highest
> @@ -2743,7 +2746,17 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
> for (i = 0; i <= end_zone; i++) {
> struct zone *zone = pgdat->node_zones + i;
>
> + if (!populated_zone(zone))
> + continue;
> +
> lru_pages += zone_reclaimable_pages(zone);
> +
> + /* Check if the memory needs to be defragmented */
> + if (order && pgdat_needs_compaction &&
> + zone_watermark_ok(zone, order,
> + low_wmark_pages(zone),
> + *classzone_idx, 0))
> + pgdat_needs_compaction = false;
> }
>
> /*
> @@ -2814,6 +2827,8 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
> */
> if (kswapd_shrink_zone(zone, &sc, lru_pages))
> raise_priority = false;
> +
> + nr_to_reclaim += sc.nr_to_reclaim;
> }
>
> /*
> @@ -2864,46 +2879,21 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
> if (try_to_freeze() || kthread_should_stop())
> break;
>
> - /* If no reclaim progress then increase scanning priority */
> - if (sc.nr_reclaimed - nr_reclaimed == 0)
> - raise_priority = true;
> + /* Compact if necessary and kswapd is reclaiming efficiently */
> + this_reclaimed = sc.nr_reclaimed - nr_reclaimed;
> + if (order && pgdat_needs_compaction &&
> + this_reclaimed > nr_to_reclaim)
> + compact_pgdat(pgdat, order);
>
> /*
> * Raise priority if scanning rate is too low or there was no
> * progress in reclaiming pages
> */
> - if (raise_priority || sc.nr_reclaimed - nr_reclaimed == 0)
> + if (raise_priority || !this_reclaimed)
> sc.priority--;
> } while (sc.priority >= 0 &&
> !pgdat_balanced(pgdat, order, *classzone_idx));
>
> - /*
> - * If kswapd was reclaiming at a higher order, it has the option of
> - * sleeping without all zones being balanced. Before it does, it must
> - * ensure that the watermarks for order-0 on *all* zones are met and
> - * that the congestion flags are cleared. The congestion flag must
> - * be cleared as kswapd is the only mechanism that clears the flag
> - * and it is potentially going to sleep here.
> - */
> - if (order) {
> - int zones_need_compaction = 1;
> -
> - for (i = 0; i <= end_zone; i++) {
> - struct zone *zone = pgdat->node_zones + i;
> -
> - if (!populated_zone(zone))
> - continue;
> -
> - /* Check if the memory needs to be defragmented. */
> - if (zone_watermark_ok(zone, order,
> - low_wmark_pages(zone), *classzone_idx, 0))
> - zones_need_compaction = 0;
> - }
> -
> - if (zones_need_compaction)
> - compact_pgdat(pgdat, order);
> - }
> -
> out:
> /*
> * Return the order we were reclaiming at so prepare_kswapd_sleep()
> --
> 1.8.1.4
>

--
Michal Hocko
SUSE Labs

2013-03-21 15:34:47

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 02/10] mm: vmscan: Obey proportional scanning requirements for kswapd

On Thu, Mar 21, 2013 at 04:07:55PM +0100, Michal Hocko wrote:
> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > index 4835a7a..182ff15 100644
> > > > --- a/mm/vmscan.c
> > > > +++ b/mm/vmscan.c
> > > > @@ -1815,6 +1815,45 @@ out:
> > > > }
> > > > }
> > > >
> > > > +static void recalculate_scan_count(unsigned long nr_reclaimed,
> > > > + unsigned long nr_to_reclaim,
> > > > + unsigned long nr[NR_LRU_LISTS])
> > > > +{
> > > > + enum lru_list l;
> > > > +
> > > > + /*
> > > > + * For direct reclaim, reclaim the number of pages requested. Less
> > > > + * care is taken to ensure that scanning for each LRU is properly
> > > > + * proportional. This is unfortunate and is improper aging but
> > > > + * minimises the amount of time a process is stalled.
> > > > + */
> > > > + if (!current_is_kswapd()) {
> > > > + if (nr_reclaimed >= nr_to_reclaim) {
> > > > + for_each_evictable_lru(l)
> > > > + nr[l] = 0;
> > > > + }
> > > > + return;
> > >
> > > Heh, this is nicely cryptically said what could be done in shrink_lruvec
> > > as
> > > if (!current_is_kswapd()) {
> > > if (nr_reclaimed >= nr_to_reclaim)
> > > break;
> > > }
> > >
> >
> > Pretty much. At one point during development, this function was more
> > complex and it evolved into this without me rechecking if splitting it
> > out still made sense.
> >
> > > Besides that this is not memcg aware which I think it would break
> > > targeted reclaim which is kind of direct reclaim but it still would be
> > > good to stay proportional because it starts with DEF_PRIORITY.
> > >
> >
> > This does break memcg because it's a special sort of direct reclaim.
> >
> > > I would suggest moving this back to shrink_lruvec and update the test as
> > > follows:
> >
> > I also noticed that we check whether the scan counts need to be
> > normalised more than once
>
> I didn't mind this because it "disqualified" at least one LRU every
> round which sounds reasonable to me because all LRUs would be scanned
> proportionally.

Once the scan count for one LRU is 0 then min will always be 0 and no
further adjustment is made. It's just redundant to check again.

> E.g. if swappiness is 0 then nr[anon] would be 0 and
> then the active/inactive aging would break? Or am I missing something?
>

If swappiness is 0 and nr[anon] is zero then the number of pages to scan
from every other LRU will never be adjusted. I do not see how this would
affect active/inactive scanning but maybe I'm misunderstanding you.

--
Mel Gorman
SUSE Labs

2013-03-21 15:39:04

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 03/10] mm: vmscan: Flatten kswapd priority loop

On Thu 21-03-13 15:26:02, Mel Gorman wrote:
> On Thu, Mar 21, 2013 at 03:54:58PM +0100, Michal Hocko wrote:
> > > Signed-off-by: Mel Gorman <[email protected]>
> > > ---
> > > mm/vmscan.c | 86 ++++++++++++++++++++++++++++++-------------------------------
> > > 1 file changed, 42 insertions(+), 44 deletions(-)
> > >
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 182ff15..279d0c2 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -2625,8 +2625,11 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
> > > /*
> > > * kswapd shrinks the zone by the number of pages required to reach
> > > * the high watermark.
> > > + *
> > > + * Returns true if kswapd scanned at least the requested number of
> > > + * pages to reclaim.
> >
> > Maybe move the comment about not rising priority in such case here to be
> > clear what the return value means. Without that the return value could
> > be misinterpreted that kswapd_shrink_zone succeeded in shrinking might
> > be not true.
>
> I moved the comment.

Thanks

> > Or maybe even better, leave the void there and add bool *raise_priority
> > argument here so the decision and raise_priority are at the same place.
> >
>
> The priority is raised if kswapd failed to reclaim from any of the unbalanced
> zone. If raise_priority is moved inside kswapd_shrink_zone then it can
> only take one zone into account.

Right you are. I am blind.

--
Michal Hocko
SUSE Labs

2013-03-21 15:47:35

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 04/10] mm: vmscan: Decide whether to compact the pgdat based on reclaim progress

On Thu, Mar 21, 2013 at 04:32:31PM +0100, Michal Hocko wrote:
> On Sun 17-03-13 13:04:10, Mel Gorman wrote:
> > In the past, kswapd makes a decision on whether to compact memory after the
> > pgdat was considered balanced. This more or less worked but it is late to
> > make such a decision and does not fit well now that kswapd makes a decision
> > whether to exit the zone scanning loop depending on reclaim progress.
> >
> > This patch will compact a pgdat if at least the requested number of pages
> > were reclaimed from unbalanced zones for a given priority. If any zone is
> > currently balanced, kswapd will not call compaction as it is expected the
> > necessary pages are already available.
> >
> > Signed-off-by: Mel Gorman <[email protected]>
> > ---
> > mm/vmscan.c | 52 +++++++++++++++++++++-------------------------------
> > 1 file changed, 21 insertions(+), 31 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 279d0c2..7513bd1 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2694,8 +2694,11 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
> >
> > do {
> > unsigned long lru_pages = 0;
> > + unsigned long nr_to_reclaim = 0;
> > unsigned long nr_reclaimed = sc.nr_reclaimed;
> > + unsigned long this_reclaimed;
> > bool raise_priority = true;
> > + bool pgdat_needs_compaction = true;
>
> I am confused. We don't want to compact for order == 0, do we?
>

No, but an order check is made later which I felt it was clearer. You are
the second person to bring it up so I'll base the initialisation on order.

--
Mel Gorman
SUSE Labs

2013-03-21 15:48:12

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 05/10] mm: vmscan: Do not allow kswapd to scan at maximum priority

On Sun 17-03-13 13:04:11, Mel Gorman wrote:
> Page reclaim at priority 0 will scan the entire LRU as priority 0 is
> considered to be a near OOM condition. Kswapd can reach priority 0 quite
> easily if it is encountering a large number of pages it cannot reclaim
> such as pages under writeback. When this happens, kswapd reclaims very
> aggressively even though there may be no real risk of allocation failure
> or OOM.
>
> This patch prevents kswapd reaching priority 0 and trying to reclaim
> the world. Direct reclaimers will still reach priority 0 in the event
> of an OOM situation.

OK, it should work. raise_priority should prevent from pointless
lowerinng the priority and if there is really nothing to reclaim then
relying on the direct reclaim is probably a better idea.

> Signed-off-by: Mel Gorman <[email protected]>

Reviewed-by: Michal Hocko <[email protected]>

> ---
> mm/vmscan.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 7513bd1..af3bb6f 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2891,7 +2891,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
> */
> if (raise_priority || !this_reclaimed)
> sc.priority--;
> - } while (sc.priority >= 0 &&
> + } while (sc.priority >= 1 &&
> !pgdat_balanced(pgdat, order, *classzone_idx));
>
> out:
> --
> 1.8.1.4
>

--
Michal Hocko
SUSE Labs

2013-03-21 15:50:49

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 04/10] mm: vmscan: Decide whether to compact the pgdat based on reclaim progress

On Thu 21-03-13 15:47:31, Mel Gorman wrote:
> On Thu, Mar 21, 2013 at 04:32:31PM +0100, Michal Hocko wrote:
> > On Sun 17-03-13 13:04:10, Mel Gorman wrote:
> > > In the past, kswapd makes a decision on whether to compact memory after the
> > > pgdat was considered balanced. This more or less worked but it is late to
> > > make such a decision and does not fit well now that kswapd makes a decision
> > > whether to exit the zone scanning loop depending on reclaim progress.
> > >
> > > This patch will compact a pgdat if at least the requested number of pages
> > > were reclaimed from unbalanced zones for a given priority. If any zone is
> > > currently balanced, kswapd will not call compaction as it is expected the
> > > necessary pages are already available.
> > >
> > > Signed-off-by: Mel Gorman <[email protected]>
> > > ---
> > > mm/vmscan.c | 52 +++++++++++++++++++++-------------------------------
> > > 1 file changed, 21 insertions(+), 31 deletions(-)
> > >
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 279d0c2..7513bd1 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -2694,8 +2694,11 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
> > >
> > > do {
> > > unsigned long lru_pages = 0;
> > > + unsigned long nr_to_reclaim = 0;
> > > unsigned long nr_reclaimed = sc.nr_reclaimed;
> > > + unsigned long this_reclaimed;
> > > bool raise_priority = true;
> > > + bool pgdat_needs_compaction = true;
> >
> > I am confused. We don't want to compact for order == 0, do we?
> >
>
> No, but an order check is made later which I felt it was clearer. You are
> the second person to bring it up so I'll base the initialisation on order.

Dohh. Yes compact_pgdat is called only if order != 0. I was so focused
on pgdat_needs_compaction that I've missed it. Both checks use (order &&
pgdat_needs_compaction) so initialization based on order would be
probably better for readability.
--
Michal Hocko
SUSE Labs

2013-03-21 15:58:05

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority

On Sun, Mar 17, 2013 at 01:04:07PM +0000, Mel Gorman wrote:
> The number of pages kswapd can reclaim is bound by the number of pages it
> scans which is related to the size of the zone and the scanning priority. In
> many cases the priority remains low because it's reset every SWAP_CLUSTER_MAX
> reclaimed pages but in the event kswapd scans a large number of pages it
> cannot reclaim, it will raise the priority and potentially discard a large
> percentage of the zone as sc->nr_to_reclaim is ULONG_MAX. The user-visible
> effect is a reclaim "spike" where a large percentage of memory is suddenly
> freed. It would be bad enough if this was just unused memory but because
> of how anon/file pages are balanced it is possible that applications get
> pushed to swap unnecessarily.
>
> This patch limits the number of pages kswapd will reclaim to the high
> watermark. Reclaim will will overshoot due to it not being a hard limit as

will -> still?

> shrink_lruvec() will ignore the sc.nr_to_reclaim at DEF_PRIORITY but it
> prevents kswapd reclaiming the world at higher priorities. The number of
> pages it reclaims is not adjusted for high-order allocations as kswapd will
> reclaim excessively if it is to balance zones for high-order allocations.

I don't really understand this last sentence. Is the excessive
reclaim a result of the patch, a description of what's happening
now...?

> Signed-off-by: Mel Gorman <[email protected]>

Nice, thank you. Using the high watermark for larger zones is more
reasonable than my hack that just always went with SWAP_CLUSTER_MAX,
what with inter-zone LRU cycle time balancing and all.

Acked-by: Johannes Weiner <[email protected]>

2013-03-21 16:26:17

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 02/10] mm: vmscan: Obey proportional scanning requirements for kswapd

On Sun, Mar 17, 2013 at 01:04:08PM +0000, Mel Gorman wrote:
> Simplistically, the anon and file LRU lists are scanned proportionally
> depending on the value of vm.swappiness although there are other factors
> taken into account by get_scan_count(). The patch "mm: vmscan: Limit
> the number of pages kswapd reclaims" limits the number of pages kswapd
> reclaims but it breaks this proportional scanning and may evenly shrink
> anon/file LRUs regardless of vm.swappiness.
>
> This patch preserves the proportional scanning and reclaim. It does mean
> that kswapd will reclaim more than requested but the number of pages will
> be related to the high watermark.

Swappiness is about page types, but this implementation compares all
LRUs against each other, and I'm not convinced that this makes sense
as there is no guaranteed balance between the inactive and active
lists. For example, the active file LRU could get knocked out when
it's almost empty while the inactive file LRU has more easy cache than
the anon lists combined.

Would it be better to compare the sum of file pages with the sum of
anon pages and then knock out the smaller pair?

2013-03-21 16:32:32

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 07/10 -v2r1] mm: vmscan: Block kswapd if it is encountering pages under writeback

Here is what you have in your mm-vmscan-limit-reclaim-v2r1 branch:
> commit 0dae7d4be56e6a7fe3f128284679f5efc0cc2383
> Author: Mel Gorman <[email protected]>
> Date: Tue Mar 12 10:33:31 2013 +0000
>
> mm: vmscan: Block kswapd if it is encountering pages under writeback
>
> Historically, kswapd used to congestion_wait() at higher priorities if it
> was not making forward progress. This made no sense as the failure to make
> progress could be completely independent of IO. It was later replaced by
> wait_iff_congested() and removed entirely by commit 258401a6 (mm: don't
> wait on congested zones in balance_pgdat()) as it was duplicating logic
> in shrink_inactive_list().
>
> This is problematic. If kswapd encounters many pages under writeback and
> it continues to scan until it reaches the high watermark then it will
> quickly skip over the pages under writeback and reclaim clean young
> pages or push applications out to swap.
>
> The use of wait_iff_congested() is not suited to kswapd as it will only
> stall if the underlying BDI is really congested or a direct reclaimer was
> unable to write to the underlying BDI. kswapd bypasses the BDI congestion
> as it sets PF_SWAPWRITE but even if this was taken into account then it
> would cause direct reclaimers to stall on writeback which is not desirable.
>
> This patch sets a ZONE_WRITEBACK flag if direct reclaim or kswapd is
> encountering too many pages under writeback. If this flag is set and
> kswapd encounters a PageReclaim page under writeback then it'll assume
> that the LRU lists are being recycled too quickly before IO can complete
> and block waiting for some IO to complete.
>
> Signed-off-by: Mel Gorman <[email protected]>

Looks reasonable to me.
Reviewed-by: Michal Hocko <[email protected]>

>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index afedd1d..dd0d266 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -499,6 +499,9 @@ typedef enum {
> * many dirty file pages at the tail
> * of the LRU.
> */
> + ZONE_WRITEBACK, /* reclaim scanning has recently found
> + * many pages under writeback
> + */
> } zone_flags_t;
>
> static inline void zone_set_flag(struct zone *zone, zone_flags_t flag)
> @@ -526,6 +529,11 @@ static inline int zone_is_reclaim_dirty(const struct zone *zone)
> return test_bit(ZONE_TAIL_LRU_DIRTY, &zone->flags);
> }
>
> +static inline int zone_is_reclaim_writeback(const struct zone *zone)
> +{
> + return test_bit(ZONE_WRITEBACK, &zone->flags);
> +}
> +
> static inline int zone_is_reclaim_locked(const struct zone *zone)
> {
> return test_bit(ZONE_RECLAIM_LOCKED, &zone->flags);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a8b94fa..e87de90 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -723,25 +723,51 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
> (PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
>
> + /*
> + * If a page at the tail of the LRU is under writeback, there
> + * are three cases to consider.
> + *
> + * 1) If reclaim is encountering an excessive number of pages
> + * under writeback and this page is both under writeback and
> + * PageReclaim then it indicates that pages are being queued
> + * for IO but are being recycled through the LRU before the
> + * IO can complete. In this case, wait on the IO to complete
> + * and then clear the ZONE_WRITEBACK flag to recheck if the
> + * condition exists.
> + *
> + * 2) Global reclaim encounters a page, memcg encounters a
> + * page that is not marked for immediate reclaim or
> + * the caller does not have __GFP_IO. In this case mark
> + * the page for immediate reclaim and continue scanning.
> + *
> + * __GFP_IO is checked because a loop driver thread might
> + * enter reclaim, and deadlock if it waits on a page for
> + * which it is needed to do the write (loop masks off
> + * __GFP_IO|__GFP_FS for this reason); but more thought
> + * would probably show more reasons.
> + *
> + * Don't require __GFP_FS, since we're not going into the
> + * FS, just waiting on its writeback completion. Worryingly,
> + * ext4 gfs2 and xfs allocate pages with
> + * grab_cache_page_write_begin(,,AOP_FLAG_NOFS), so testing
> + * may_enter_fs here is liable to OOM on them.
> + *
> + * 3) memcg encounters a page that is not already marked
> + * PageReclaim. memcg does not have any dirty pages
> + * throttling so we could easily OOM just because too many
> + * pages are in writeback and there is nothing else to
> + * reclaim. Wait for the writeback to complete.
> + */
> if (PageWriteback(page)) {
> - /*
> - * memcg doesn't have any dirty pages throttling so we
> - * could easily OOM just because too many pages are in
> - * writeback and there is nothing else to reclaim.
> - *
> - * Check __GFP_IO, certainly because a loop driver
> - * thread might enter reclaim, and deadlock if it waits
> - * on a page for which it is needed to do the write
> - * (loop masks off __GFP_IO|__GFP_FS for this reason);
> - * but more thought would probably show more reasons.
> - *
> - * Don't require __GFP_FS, since we're not going into
> - * the FS, just waiting on its writeback completion.
> - * Worryingly, ext4 gfs2 and xfs allocate pages with
> - * grab_cache_page_write_begin(,,AOP_FLAG_NOFS), so
> - * testing may_enter_fs here is liable to OOM on them.
> - */
> - if (global_reclaim(sc) ||
> + /* Case 1 above */
> + if (current_is_kswapd() &&
> + PageReclaim(page) &&
> + zone_is_reclaim_writeback(zone)) {
> + wait_on_page_writeback(page);
> + zone_clear_flag(zone, ZONE_WRITEBACK);
> +
> + /* Case 2 above */
> + } else if (global_reclaim(sc) ||
> !PageReclaim(page) || !(sc->gfp_mask & __GFP_IO)) {
> /*
> * This is slightly racy - end_page_writeback()
> @@ -756,9 +782,13 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> */
> SetPageReclaim(page);
> nr_writeback++;
> +
> goto keep_locked;
> +
> + /* Case 3 above */
> + } else {
> + wait_on_page_writeback(page);
> }
> - wait_on_page_writeback(page);
> }
>
> if (!force_reclaim)
> @@ -1373,8 +1403,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
> * isolated page is PageWriteback
> */
> if (nr_writeback && nr_writeback >=
> - (nr_taken >> (DEF_PRIORITY - sc->priority)))
> + (nr_taken >> (DEF_PRIORITY - sc->priority))) {
> wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
> + zone_set_flag(zone, ZONE_WRITEBACK);
> + }
>
> /*
> * Similarly, if many dirty pages are encountered that are not
> @@ -2639,8 +2671,8 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
> * kswapd shrinks the zone by the number of pages required to reach
> * the high watermark.
> *
> - * Returns true if kswapd scanned at least the requested number of
> - * pages to reclaim.
> + * Returns true if kswapd scanned at least the requested number of pages to
> + * reclaim or if the lack of process was due to pages under writeback.
> */
> static bool kswapd_shrink_zone(struct zone *zone,
> struct scan_control *sc,
> @@ -2663,6 +2695,8 @@ static bool kswapd_shrink_zone(struct zone *zone,
> if (nr_slab == 0 && !zone_reclaimable(zone))
> zone->all_unreclaimable = 1;
>
> + zone_clear_flag(zone, ZONE_WRITEBACK);
> +
> return sc->nr_scanned >= sc->nr_to_reclaim;
> }

--
Michal Hocko
SUSE Labs

2013-03-21 16:47:42

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority

On Thu, Mar 21, 2013 at 11:57:05AM -0400, Johannes Weiner wrote:
> On Sun, Mar 17, 2013 at 01:04:07PM +0000, Mel Gorman wrote:
> > The number of pages kswapd can reclaim is bound by the number of pages it
> > scans which is related to the size of the zone and the scanning priority. In
> > many cases the priority remains low because it's reset every SWAP_CLUSTER_MAX
> > reclaimed pages but in the event kswapd scans a large number of pages it
> > cannot reclaim, it will raise the priority and potentially discard a large
> > percentage of the zone as sc->nr_to_reclaim is ULONG_MAX. The user-visible
> > effect is a reclaim "spike" where a large percentage of memory is suddenly
> > freed. It would be bad enough if this was just unused memory but because
> > of how anon/file pages are balanced it is possible that applications get
> > pushed to swap unnecessarily.
> >
> > This patch limits the number of pages kswapd will reclaim to the high
> > watermark. Reclaim will will overshoot due to it not being a hard limit as
>
> will -> still?
>
> > shrink_lruvec() will ignore the sc.nr_to_reclaim at DEF_PRIORITY but it
> > prevents kswapd reclaiming the world at higher priorities. The number of
> > pages it reclaims is not adjusted for high-order allocations as kswapd will
> > reclaim excessively if it is to balance zones for high-order allocations.
>
> I don't really understand this last sentence. Is the excessive
> reclaim a result of the patch, a description of what's happening
> now...?
>

It's a very basic description of what happens now and with the patch
applied. Until patch 5 is applied, kswapd can still reclaim the world if
it reaches priority 0.

> > Signed-off-by: Mel Gorman <[email protected]>
>
> Nice, thank you. Using the high watermark for larger zones is more
> reasonable than my hack that just always went with SWAP_CLUSTER_MAX,
> what with inter-zone LRU cycle time balancing and all.
>
> Acked-by: Johannes Weiner <[email protected]>

Thanks.

--
Mel Gorman
SUSE Labs

2013-03-21 16:47:40

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 08/10] mm: vmscan: Have kswapd shrink slab only once per priority

On Sun 17-03-13 13:04:14, Mel Gorman wrote:
> If kswaps fails to make progress but continues to shrink slab then it'll
> either discard all of slab or consume CPU uselessly scanning shrinkers.
> This patch causes kswapd to only call the shrinkers once per priority.
>
> Signed-off-by: Mel Gorman <[email protected]>

OK, looks good.
Reviewed-by: Michal Hocko <[email protected]>

> ---
> mm/vmscan.c | 28 +++++++++++++++++++++-------
> 1 file changed, 21 insertions(+), 7 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 7d5a932..84375b2 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2661,9 +2661,10 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
> */
> static bool kswapd_shrink_zone(struct zone *zone,
> struct scan_control *sc,
> - unsigned long lru_pages)
> + unsigned long lru_pages,
> + bool shrinking_slab)
> {
> - unsigned long nr_slab;
> + unsigned long nr_slab = 0;
> struct reclaim_state *reclaim_state = current->reclaim_state;
> struct shrink_control shrink = {
> .gfp_mask = sc->gfp_mask,
> @@ -2673,9 +2674,15 @@ static bool kswapd_shrink_zone(struct zone *zone,
> sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));
> shrink_zone(zone, sc);
>
> - reclaim_state->reclaimed_slab = 0;
> - nr_slab = shrink_slab(&shrink, sc->nr_scanned, lru_pages);
> - sc->nr_reclaimed += reclaim_state->reclaimed_slab;
> + /*
> + * Slabs are shrunk for each zone once per priority or if the zone
> + * being balanced is otherwise unreclaimable
> + */
> + if (shrinking_slab || !zone_reclaimable(zone)) {
> + reclaim_state->reclaimed_slab = 0;
> + nr_slab = shrink_slab(&shrink, sc->nr_scanned, lru_pages);
> + sc->nr_reclaimed += reclaim_state->reclaimed_slab;
> + }
>
> if (nr_slab == 0 && !zone_reclaimable(zone))
> zone->all_unreclaimable = 1;
> @@ -2713,6 +2720,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
> int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */
> unsigned long nr_soft_reclaimed;
> unsigned long nr_soft_scanned;
> + bool shrinking_slab = true;
> struct scan_control sc = {
> .gfp_mask = GFP_KERNEL,
> .priority = DEF_PRIORITY,
> @@ -2861,7 +2869,8 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
> * already being scanned that that high
> * watermark would be met at 100% efficiency.
> */
> - if (kswapd_shrink_zone(zone, &sc, lru_pages))
> + if (kswapd_shrink_zone(zone, &sc,
> + lru_pages, shrinking_slab))
> raise_priority = false;
>
> nr_to_reclaim += sc.nr_to_reclaim;
> @@ -2900,6 +2909,9 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
> pfmemalloc_watermark_ok(pgdat))
> wake_up(&pgdat->pfmemalloc_wait);
>
> + /* Only shrink slab once per priority */
> + shrinking_slab = false;
> +
> /*
> * Fragmentation may mean that the system cannot be rebalanced
> * for high-order allocations in all zones. If twice the
> @@ -2925,8 +2937,10 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
> * Raise priority if scanning rate is too low or there was no
> * progress in reclaiming pages
> */
> - if (raise_priority || !this_reclaimed)
> + if (raise_priority || !this_reclaimed) {
> sc.priority--;
> + shrinking_slab = true;
> + }
> } while (sc.priority >= 1 &&
> !pgdat_balanced(pgdat, order, *classzone_idx));
>
> --
> 1.8.1.4
>

--
Michal Hocko
SUSE Labs

2013-03-21 16:58:41

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 09/10] mm: vmscan: Check if kswapd should writepage once per priority

On Sun 17-03-13 13:04:15, Mel Gorman wrote:
> Currently kswapd checks if it should start writepage as it shrinks
> each zone without taking into consideration if the zone is balanced or
> not. This is not wrong as such but it does not make much sense either.
> This patch checks once per priority if kswapd should be writing pages.

Except it is not once per priority strictly speaking... It doesn't make
any difference though.

> Signed-off-by: Mel Gorman <[email protected]>

Reviewed-by: Michal Hocko <[email protected]>

> ---
> mm/vmscan.c | 14 +++++++-------
> 1 file changed, 7 insertions(+), 7 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 84375b2..8c66e5a 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2804,6 +2804,13 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
> }
>
> /*
> + * If we're getting trouble reclaiming, start doing writepage
> + * even in laptop mode.
> + */
> + if (sc.priority < DEF_PRIORITY - 2)
> + sc.may_writepage = 1;
> +
> + /*
> * Now scan the zone in the dma->highmem direction, stopping
> * at the last zone which needs scanning.
> *
> @@ -2876,13 +2883,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
> nr_to_reclaim += sc.nr_to_reclaim;
> }
>
> - /*
> - * If we're getting trouble reclaiming, start doing
> - * writepage even in laptop mode.
> - */
> - if (sc.priority < DEF_PRIORITY - 2)
> - sc.may_writepage = 1;
> -
> if (zone->all_unreclaimable) {
> if (end_zone && end_zone == i)
> end_zone--;
> --
> 1.8.1.4
>

--
Michal Hocko
SUSE Labs

2013-03-21 17:18:09

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 10/10] mm: vmscan: Move logic from balance_pgdat() to kswapd_shrink_zone()

On Sun 17-03-13 13:04:16, Mel Gorman wrote:
> balance_pgdat() is very long and some of the logic can and should
> be internal to kswapd_shrink_zone(). Move it so the flow of
> balance_pgdat() is marginally easier to follow.
>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> mm/vmscan.c | 104 +++++++++++++++++++++++++++++-------------------------------
> 1 file changed, 51 insertions(+), 53 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 8c66e5a..d7cf384 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2660,18 +2660,53 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
> * reclaim or if the lack of process was due to pages under writeback.
> */
> static bool kswapd_shrink_zone(struct zone *zone,
> + int classzone_idx,
> struct scan_control *sc,
> unsigned long lru_pages,
> bool shrinking_slab)
> {
> + int testorder = sc->order;
> unsigned long nr_slab = 0;
> + unsigned long balance_gap;
> struct reclaim_state *reclaim_state = current->reclaim_state;
> struct shrink_control shrink = {
> .gfp_mask = sc->gfp_mask,
> };
> + bool lowmem_pressure;
>
> /* Reclaim above the high watermark. */
> sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));
> +
> + /*
> + * Kswapd reclaims only single pages with compaction enabled. Trying
> + * too hard to reclaim until contiguous free pages have become
> + * available can hurt performance by evicting too much useful data
> + * from memory. Do not reclaim more than needed for compaction.
> + */
> + if (IS_ENABLED(CONFIG_COMPACTION) && sc->order &&
> + compaction_suitable(zone, sc->order) !=
> + COMPACT_SKIPPED)
> + testorder = 0;
> +
> + /*
> + * We put equal pressure on every zone, unless one zone has way too
> + * many pages free already. The "too many pages" is defined as the
> + * high wmark plus a "gap" where the gap is either the low
> + * watermark or 1% of the zone, whichever is smaller.
> + */
> + balance_gap = min(low_wmark_pages(zone),
> + (zone->managed_pages + KSWAPD_ZONE_BALANCE_GAP_RATIO-1) /
> + KSWAPD_ZONE_BALANCE_GAP_RATIO);
> +
> + /*
> + * If there is no low memory pressure or the zone is balanced then no
> + * reclaim is necessary
> + */
> + lowmem_pressure = (buffer_heads_over_limit && is_highmem(zone));
> + if (!(lowmem_pressure || !zone_balanced(zone, testorder,
> + balance_gap, classzone_idx)))

if (!lowmem_pressure && zone_balanced) would be less cryptic I guess

> + return true;
> +
> shrink_zone(zone, sc);
>
> /*
> @@ -2689,6 +2724,16 @@ static bool kswapd_shrink_zone(struct zone *zone,
>
> zone_clear_flag(zone, ZONE_WRITEBACK);
>
> + /*
> + * If a zone reaches its high watermark, consider it to be no longer
> + * congested. It's possible there are dirty pages backed by congested
> + * BDIs but as pressure is relieved, speculatively avoid congestion
> + * waits.
> + */
> + if (!zone->all_unreclaimable &&
> + zone_balanced(zone, testorder, 0, classzone_idx))
> + zone_clear_flag(zone, ZONE_CONGESTED);
> +
> return sc->nr_scanned >= sc->nr_to_reclaim;
> }
>
> @@ -2821,8 +2866,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
> */
> for (i = 0; i <= end_zone; i++) {
> struct zone *zone = pgdat->node_zones + i;
> - int testorder;
> - unsigned long balance_gap;
>
> if (!populated_zone(zone))
> continue;
> @@ -2843,61 +2886,16 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
> sc.nr_reclaimed += nr_soft_reclaimed;
>
> /*
> - * We put equal pressure on every zone, unless
> - * one zone has way too many pages free
> - * already. The "too many pages" is defined
> - * as the high wmark plus a "gap" where the
> - * gap is either the low watermark or 1%
> - * of the zone, whichever is smaller.
> - */
> - balance_gap = min(low_wmark_pages(zone),
> - (zone->managed_pages +
> - KSWAPD_ZONE_BALANCE_GAP_RATIO-1) /
> - KSWAPD_ZONE_BALANCE_GAP_RATIO);
> - /*
> - * Kswapd reclaims only single pages with compaction
> - * enabled. Trying too hard to reclaim until contiguous
> - * free pages have become available can hurt performance
> - * by evicting too much useful data from memory.
> - * Do not reclaim more than needed for compaction.
> + * There should be no need to raise the scanning
> + * priority if enough pages are already being scanned
> + * that that high watermark would be met at 100%
> + * efficiency.
> */
> - testorder = order;
> - if (IS_ENABLED(CONFIG_COMPACTION) && order &&
> - compaction_suitable(zone, order) !=
> - COMPACT_SKIPPED)
> - testorder = 0;
> -
> - if ((buffer_heads_over_limit && is_highmem_idx(i)) ||
> - !zone_balanced(zone, testorder,
> - balance_gap, end_zone)) {
> - /*
> - * There should be no need to raise the
> - * scanning priority if enough pages are
> - * already being scanned that that high
> - * watermark would be met at 100% efficiency.
> - */
> - if (kswapd_shrink_zone(zone, &sc,
> + if (kswapd_shrink_zone(zone, end_zone, &sc,
> lru_pages, shrinking_slab))
> raise_priority = false;
>
> - nr_to_reclaim += sc.nr_to_reclaim;
> - }
> -
> - if (zone->all_unreclaimable) {
> - if (end_zone && end_zone == i)
> - end_zone--;
> - continue;
> - }
> -
> - if (zone_balanced(zone, testorder, 0, end_zone))
> - /*
> - * If a zone reaches its high watermark,
> - * consider it to be no longer congested. It's
> - * possible there are dirty pages backed by
> - * congested BDIs but as pressure is relieved,
> - * speculatively avoid congestion waits
> - */
> - zone_clear_flag(zone, ZONE_CONGESTED);
> + nr_to_reclaim += sc.nr_to_reclaim;
> }

nr_to_reclaim is updated if the zone is balanced an no reclaim is done
which break compaction condition AFAICS.

--
Michal Hocko
SUSE Labs

2013-03-21 17:56:37

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 06/10] mm: vmscan: Have kswapd writeback pages based on dirty pages encountered, not priority

On 03/17/2013 11:11 AM, Mel Gorman wrote:
> On Sun, Mar 17, 2013 at 07:42:39AM -0700, Andi Kleen wrote:
>> Mel Gorman <[email protected]> writes:
>>
>>> @@ -495,6 +495,9 @@ typedef enum {
>>> ZONE_CONGESTED, /* zone has many dirty pages backed by
>>> * a congested BDI
>>> */
>>> + ZONE_DIRTY, /* reclaim scanning has recently found
>>> + * many dirty file pages
>>> + */
>>
>> Needs a better name. ZONE_DIRTY_CONGESTED ?
>>
>
> That might be confusing. The underlying BDI is not necessarily
> congested. I accept your point though and will try thinking of a better
> name.

ZONE_LOTS_DIRTY ?

>>> + * currently being written then flag that kswapd should start
>>> + * writing back pages.
>>> + */
>>> + if (global_reclaim(sc) && nr_dirty &&
>>> + nr_dirty >= (nr_taken >> (DEF_PRIORITY - sc->priority)))
>>> + zone_set_flag(zone, ZONE_DIRTY);
>>> +
>>> trace_mm_vmscan_lru_shrink_inactive(zone->zone_pgdat->node_id,
>>
>> I suppose you want to trace the dirty case here too.
>>
>
> I guess it wouldn't hurt to have a new tracepoint for when the flag gets
> set. A vmstat might be helpful as well.
>

2013-03-21 18:02:45

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 02/10] mm: vmscan: Obey proportional scanning requirements for kswapd

On Thu, Mar 21, 2013 at 12:25:18PM -0400, Johannes Weiner wrote:
> On Sun, Mar 17, 2013 at 01:04:08PM +0000, Mel Gorman wrote:
> > Simplistically, the anon and file LRU lists are scanned proportionally
> > depending on the value of vm.swappiness although there are other factors
> > taken into account by get_scan_count(). The patch "mm: vmscan: Limit
> > the number of pages kswapd reclaims" limits the number of pages kswapd
> > reclaims but it breaks this proportional scanning and may evenly shrink
> > anon/file LRUs regardless of vm.swappiness.
> >
> > This patch preserves the proportional scanning and reclaim. It does mean
> > that kswapd will reclaim more than requested but the number of pages will
> > be related to the high watermark.
>
> Swappiness is about page types, but this implementation compares all
> LRUs against each other, and I'm not convinced that this makes sense
> as there is no guaranteed balance between the inactive and active
> lists. For example, the active file LRU could get knocked out when
> it's almost empty while the inactive file LRU has more easy cache than
> the anon lists combined.
>

Ok, I see your point. I think Michal was making the same point but I
failed to understand it the first time around.

> Would it be better to compare the sum of file pages with the sum of
> anon pages and then knock out the smaller pair?

Yes, it makes more sense but the issue then becomes how can we do that
sensibly, The following is straight-forward and roughly in line with your
suggestion but it does not preseve the scanning ratio between active and
inactive of the remaining LRU lists.

/*
* For kswapd and memcg, reclaim at least the number of pages
* requested. Ensure that the anon and file LRUs shrink
* proportionally what was requested by get_scan_count(). We
* stop reclaiming one LRU and reduce the amount scanning
* required on the other.
*/
nr_file = nr[LRU_INACTIVE_FILE] + nr[LRU_ACTIVE_FILE];
nr_anon = nr[LRU_INACTIVE_ANON] + nr[LRU_ACTIVE_ANON];

if (nr_file > nr_anon) {
nr[LRU_INACTIVE_FILE] -= min(nr_anon, nr[LRU_INACTIVE_FILE]);
nr[LRU_ACTIVE_FILE] -= min(nr_anon, nr[LRU_ACTIVE_FILE]);
nr[LRU_INACTIVE_ANON] = nr[LRU_ACTIVE_ANON] = 0;
} else {
nr[LRU_INACTIVE_ANON] -= min(nr_file, nr[LRU_INACTIVE_ANON]);
nr[LRU_ACTIVE_ANON] -= min(nr_file, nr[LRU_ACTIVE_ANON]);
nr[LRU_INACTIVE_FILE] = nr[LRU_ACTIVE_FILE] = 0;
}
scan_adjusted = true;

Preserving the ratio gets complicated and to avoid excessive branching,
it ends up looking like the following untested code.

/*
* For kswapd and memcg, reclaim at least the number of pages
* requested. Ensure that the anon and file LRUs shrink
* proportionally what was requested by get_scan_count(). We
* stop reclaiming one LRU and reduce the amount scanning
* required on the other preserving the ratio between the
* active/inactive lists.
*
* Start by preparing to shrink the larger of the LRUs by
* the size of the smaller list.
*/
nr_file = nr[LRU_INACTIVE_FILE] + nr[LRU_ACTIVE_FILE];
nr_anon = nr[LRU_INACTIVE_ANON] + nr[LRU_ACTIVE_ANON];
nr_shrink = (nr_file > nr_anon) ? nr_anon : nr_file;
lru = (nr_file > nr_anon) ? LRU_FILE : 0;

/* Work out the ratio of the inactive/active list */
top = min(nr[LRU_ACTIVE + lru], nr[lru]);
bottom = max(nr[LRU_ACTIVE + lru], nr[lru]);
percentage = top * 100 / bottom;
nr_fraction = nr_shrink * percentage / 100;
nr_remaining = nr_anon - nr_fraction;

/* Reduce the remaining pages to scan proportionally */
if (nr[LRU_ACTIVE + lru] > nr[lru]) {
nr[LRU_ACTIVE + lru] -= min(nr_remaining, nr[LRU_ACTIVE + lru]);
nr[lru] -= min(nr_fraction, nr[lru]);
} else {
nr[LRU_ACTIVE + lru] -= min(nr_fraction, nr[LRU_ACTIVE + lru]);
nr[lru] -= min(nr_remaining, nr[lru]);
}

/* Stop scanning the smaller LRU */
lru = (lru == LRU_FILE) ? LRU_BASE : LRU_FILE;
nr[LRU_ACTIVE + lru] = 0;
nr[lru] = 0;

Is this what you had in mind or had you something simplier in mind?

--
Mel Gorman
SUSE Labs

2013-03-21 18:07:40

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 09/10] mm: vmscan: Check if kswapd should writepage once per priority

On Thu, Mar 21, 2013 at 05:58:37PM +0100, Michal Hocko wrote:
> On Sun 17-03-13 13:04:15, Mel Gorman wrote:
> > Currently kswapd checks if it should start writepage as it shrinks
> > each zone without taking into consideration if the zone is balanced or
> > not. This is not wrong as such but it does not make much sense either.
> > This patch checks once per priority if kswapd should be writing pages.
>
> Except it is not once per priority strictly speaking... It doesn't make
> any difference though.
>

Whoops, at one point during development it really was once per priority
which was always raised. I reworded it to "once per pgdat scan".

Thanks.

--
Mel Gorman
SUSE Labs

2013-03-21 18:14:03

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 10/10] mm: vmscan: Move logic from balance_pgdat() to kswapd_shrink_zone()

On Thu, Mar 21, 2013 at 06:18:04PM +0100, Michal Hocko wrote:
> On Sun 17-03-13 13:04:16, Mel Gorman wrote:
> > +
> > + /*
> > + * Kswapd reclaims only single pages with compaction enabled. Trying
> > + * too hard to reclaim until contiguous free pages have become
> > + * available can hurt performance by evicting too much useful data
> > + * from memory. Do not reclaim more than needed for compaction.
> > + */
> > + if (IS_ENABLED(CONFIG_COMPACTION) && sc->order &&
> > + compaction_suitable(zone, sc->order) !=
> > + COMPACT_SKIPPED)
> > + testorder = 0;
> > +
> > + /*
> > + * We put equal pressure on every zone, unless one zone has way too
> > + * many pages free already. The "too many pages" is defined as the
> > + * high wmark plus a "gap" where the gap is either the low
> > + * watermark or 1% of the zone, whichever is smaller.
> > + */
> > + balance_gap = min(low_wmark_pages(zone),
> > + (zone->managed_pages + KSWAPD_ZONE_BALANCE_GAP_RATIO-1) /
> > + KSWAPD_ZONE_BALANCE_GAP_RATIO);
> > +
> > + /*
> > + * If there is no low memory pressure or the zone is balanced then no
> > + * reclaim is necessary
> > + */
> > + lowmem_pressure = (buffer_heads_over_limit && is_highmem(zone));
> > + if (!(lowmem_pressure || !zone_balanced(zone, testorder,
> > + balance_gap, classzone_idx)))
>
> if (!lowmem_pressure && zone_balanced) would be less cryptic I guess
>

It would.

> > + return true;
> > +
> > shrink_zone(zone, sc);
> >
> > /*
> > @@ -2689,6 +2724,16 @@ static bool kswapd_shrink_zone(struct zone *zone,
> >
> > zone_clear_flag(zone, ZONE_WRITEBACK);
> >
> > + /*
> > + * If a zone reaches its high watermark, consider it to be no longer
> > + * congested. It's possible there are dirty pages backed by congested
> > + * BDIs but as pressure is relieved, speculatively avoid congestion
> > + * waits.
> > + */
> > + if (!zone->all_unreclaimable &&
> > + zone_balanced(zone, testorder, 0, classzone_idx))
> > + zone_clear_flag(zone, ZONE_CONGESTED);
> > +
> > return sc->nr_scanned >= sc->nr_to_reclaim;
> > }
> >
> > @@ -2821,8 +2866,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
> > */
> > for (i = 0; i <= end_zone; i++) {
> > struct zone *zone = pgdat->node_zones + i;
> > - int testorder;
> > - unsigned long balance_gap;
> >
> > if (!populated_zone(zone))
> > continue;
> > @@ -2843,61 +2886,16 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
> > sc.nr_reclaimed += nr_soft_reclaimed;
> >
> > /*
> > - * We put equal pressure on every zone, unless
> > - * one zone has way too many pages free
> > - * already. The "too many pages" is defined
> > - * as the high wmark plus a "gap" where the
> > - * gap is either the low watermark or 1%
> > - * of the zone, whichever is smaller.
> > - */
> > - balance_gap = min(low_wmark_pages(zone),
> > - (zone->managed_pages +
> > - KSWAPD_ZONE_BALANCE_GAP_RATIO-1) /
> > - KSWAPD_ZONE_BALANCE_GAP_RATIO);
> > - /*
> > - * Kswapd reclaims only single pages with compaction
> > - * enabled. Trying too hard to reclaim until contiguous
> > - * free pages have become available can hurt performance
> > - * by evicting too much useful data from memory.
> > - * Do not reclaim more than needed for compaction.
> > + * There should be no need to raise the scanning
> > + * priority if enough pages are already being scanned
> > + * that that high watermark would be met at 100%
> > + * efficiency.
> > */
> > - testorder = order;
> > - if (IS_ENABLED(CONFIG_COMPACTION) && order &&
> > - compaction_suitable(zone, order) !=
> > - COMPACT_SKIPPED)
> > - testorder = 0;
> > -
> > - if ((buffer_heads_over_limit && is_highmem_idx(i)) ||
> > - !zone_balanced(zone, testorder,
> > - balance_gap, end_zone)) {
> > - /*
> > - * There should be no need to raise the
> > - * scanning priority if enough pages are
> > - * already being scanned that that high
> > - * watermark would be met at 100% efficiency.
> > - */
> > - if (kswapd_shrink_zone(zone, &sc,
> > + if (kswapd_shrink_zone(zone, end_zone, &sc,
> > lru_pages, shrinking_slab))
> > raise_priority = false;
> >
> > - nr_to_reclaim += sc.nr_to_reclaim;
> > - }
> > -
> > - if (zone->all_unreclaimable) {
> > - if (end_zone && end_zone == i)
> > - end_zone--;
> > - continue;
> > - }
> > -
> > - if (zone_balanced(zone, testorder, 0, end_zone))
> > - /*
> > - * If a zone reaches its high watermark,
> > - * consider it to be no longer congested. It's
> > - * possible there are dirty pages backed by
> > - * congested BDIs but as pressure is relieved,
> > - * speculatively avoid congestion waits
> > - */
> > - zone_clear_flag(zone, ZONE_CONGESTED);
> > + nr_to_reclaim += sc.nr_to_reclaim;
> > }
>
> nr_to_reclaim is updated if the zone is balanced an no reclaim is done
> which break compaction condition AFAICS.
>

True, it only makes sense to account for the pages it actually attempted
to reclaim. Thanks.

--
Mel Gorman
SUSE Labs

2013-03-21 18:15:38

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 06/10] mm: vmscan: Have kswapd writeback pages based on dirty pages encountered, not priority

On Thu, Mar 21, 2013 at 01:53:41PM -0400, Rik van Riel wrote:
> On 03/17/2013 11:11 AM, Mel Gorman wrote:
> >On Sun, Mar 17, 2013 at 07:42:39AM -0700, Andi Kleen wrote:
> >>Mel Gorman <[email protected]> writes:
> >>
> >>>@@ -495,6 +495,9 @@ typedef enum {
> >>> ZONE_CONGESTED, /* zone has many dirty pages backed by
> >>> * a congested BDI
> >>> */
> >>>+ ZONE_DIRTY, /* reclaim scanning has recently found
> >>>+ * many dirty file pages
> >>>+ */
> >>
> >>Needs a better name. ZONE_DIRTY_CONGESTED ?
> >>
> >
> >That might be confusing. The underlying BDI is not necessarily
> >congested. I accept your point though and will try thinking of a better
> >name.
>
> ZONE_LOTS_DIRTY ?
>

I had changed it to

ZONE_TAIL_LRU_DIRTY, /* reclaim scanning has recently found
* many dirty file pages at the tail
* of the LRU.
*/

Is that reasonable?

--
Mel Gorman
SUSE Labs

2013-03-21 18:24:08

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 06/10] mm: vmscan: Have kswapd writeback pages based on dirty pages encountered, not priority

On 03/21/2013 02:15 PM, Mel Gorman wrote:
> On Thu, Mar 21, 2013 at 01:53:41PM -0400, Rik van Riel wrote:
>> On 03/17/2013 11:11 AM, Mel Gorman wrote:
>>> On Sun, Mar 17, 2013 at 07:42:39AM -0700, Andi Kleen wrote:
>>>> Mel Gorman <[email protected]> writes:
>>>>
>>>>> @@ -495,6 +495,9 @@ typedef enum {
>>>>> ZONE_CONGESTED, /* zone has many dirty pages backed by
>>>>> * a congested BDI
>>>>> */
>>>>> + ZONE_DIRTY, /* reclaim scanning has recently found
>>>>> + * many dirty file pages
>>>>> + */
>>>>
>>>> Needs a better name. ZONE_DIRTY_CONGESTED ?
>>>>
>>>
>>> That might be confusing. The underlying BDI is not necessarily
>>> congested. I accept your point though and will try thinking of a better
>>> name.
>>
>> ZONE_LOTS_DIRTY ?
>>
>
> I had changed it to
>
> ZONE_TAIL_LRU_DIRTY, /* reclaim scanning has recently found
> * many dirty file pages at the tail
> * of the LRU.
> */
>
> Is that reasonable?

Works for me.

2013-03-21 18:44:32

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 07/10] mm: vmscan: Block kswapd if it is encountering pages under writeback

On 03/17/2013 09:04 AM, Mel Gorman wrote:
> Historically, kswapd used to congestion_wait() at higher priorities if it
> was not making forward progress. This made no sense as the failure to make
> progress could be completely independent of IO. It was later replaced by
> wait_iff_congested() and removed entirely by commit 258401a6 (mm: don't
> wait on congested zones in balance_pgdat()) as it was duplicating logic
> in shrink_inactive_list().
>
> This is problematic. If kswapd encounters many pages under writeback and
> it continues to scan until it reaches the high watermark then it will
> quickly skip over the pages under writeback and reclaim clean young
> pages or push applications out to swap.
>
> The use of wait_iff_congested() is not suited to kswapd as it will only
> stall if the underlying BDI is really congested or a direct reclaimer was
> unable to write to the underlying BDI. kswapd bypasses the BDI congestion
> as it sets PF_SWAPWRITE but even if this was taken into account then it
> would cause direct reclaimers to stall on writeback which is not desirable.
>
> This patch sets a ZONE_WRITEBACK flag if direct reclaim or kswapd is
> encountering too many pages under writeback. If this flag is set and
> kswapd encounters a PageReclaim page under writeback then it'll assume
> that the LRU lists are being recycled too quickly before IO can complete
> and block waiting for some IO to complete.

I really like the concept of this patch.

> @@ -756,9 +769,11 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> */
> SetPageReclaim(page);
> nr_writeback++;
> +
> goto keep_locked;
> + } else {
> + wait_on_page_writeback(page);
> }
> - wait_on_page_writeback(page);
> }
>
> if (!force_reclaim)

This looks like an area for future improvement.

We do not need to wait for this specific page to finish writeback,
we only have to wait for any (bunch of) page(s) to finish writeback,
since we do not particularly care which of the pages from near the
end of the LRU get reclaimed first.

I wonder if this is one of the causes for the high latencies that
are sometimes observed in direct reclaim...

2013-03-21 19:50:06

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 08/10] mm: vmscan: Have kswapd shrink slab only once per priority

On 03/17/2013 09:04 AM, Mel Gorman wrote:
> If kswaps fails to make progress but continues to shrink slab then it'll
> either discard all of slab or consume CPU uselessly scanning shrinkers.
> This patch causes kswapd to only call the shrinkers once per priority.
>
> Signed-off-by: Mel Gorman <[email protected]>

Acked-by: Rik van Riel <[email protected]>

2013-03-21 19:55:17

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 09/10] mm: vmscan: Check if kswapd should writepage once per priority

On 03/17/2013 09:04 AM, Mel Gorman wrote:
> Currently kswapd checks if it should start writepage as it shrinks
> each zone without taking into consideration if the zone is balanced or
> not. This is not wrong as such but it does not make much sense either.
> This patch checks once per priority if kswapd should be writing pages.
>
> Signed-off-by: Mel Gorman <[email protected]>

Acked-by: Rik van Riel <[email protected]>

2013-03-22 00:05:41

by Will Huck

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority

Hi Johannes,
On 03/21/2013 11:57 PM, Johannes Weiner wrote:
> On Sun, Mar 17, 2013 at 01:04:07PM +0000, Mel Gorman wrote:
>> The number of pages kswapd can reclaim is bound by the number of pages it
>> scans which is related to the size of the zone and the scanning priority. In
>> many cases the priority remains low because it's reset every SWAP_CLUSTER_MAX
>> reclaimed pages but in the event kswapd scans a large number of pages it
>> cannot reclaim, it will raise the priority and potentially discard a large
>> percentage of the zone as sc->nr_to_reclaim is ULONG_MAX. The user-visible
>> effect is a reclaim "spike" where a large percentage of memory is suddenly
>> freed. It would be bad enough if this was just unused memory but because
>> of how anon/file pages are balanced it is possible that applications get
>> pushed to swap unnecessarily.
>>
>> This patch limits the number of pages kswapd will reclaim to the high
>> watermark. Reclaim will will overshoot due to it not being a hard limit as
> will -> still?
>
>> shrink_lruvec() will ignore the sc.nr_to_reclaim at DEF_PRIORITY but it
>> prevents kswapd reclaiming the world at higher priorities. The number of
>> pages it reclaims is not adjusted for high-order allocations as kswapd will
>> reclaim excessively if it is to balance zones for high-order allocations.
> I don't really understand this last sentence. Is the excessive
> reclaim a result of the patch, a description of what's happening
> now...?
>
>> Signed-off-by: Mel Gorman <[email protected]>
> Nice, thank you. Using the high watermark for larger zones is more
> reasonable than my hack that just always went with SWAP_CLUSTER_MAX,
> what with inter-zone LRU cycle time balancing and all.
>
> Acked-by: Johannes Weiner <[email protected]>

One offline question, how to understand this in function balance_pgdat:
/*
* Do some background aging of the anon list, to give
* pages a chance to be referenced before reclaiming.
*/
age_acitve_anon(zone, &sc);
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2013-03-22 00:09:05

by Will Huck

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority

Hi Rik,
On 03/21/2013 08:52 AM, Rik van Riel wrote:
> On 03/20/2013 12:18 PM, Michal Hocko wrote:
>> On Sun 17-03-13 13:04:07, Mel Gorman wrote:
>> [...]
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index 88c5fed..4835a7a 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -2593,6 +2593,32 @@ static bool prepare_kswapd_sleep(pg_data_t
>>> *pgdat, int order, long remaining,
>>> }
>>>
>>> /*
>>> + * kswapd shrinks the zone by the number of pages required to reach
>>> + * the high watermark.
>>> + */
>>> +static void kswapd_shrink_zone(struct zone *zone,
>>> + struct scan_control *sc,
>>> + unsigned long lru_pages)
>>> +{
>>> + unsigned long nr_slab;
>>> + struct reclaim_state *reclaim_state = current->reclaim_state;
>>> + struct shrink_control shrink = {
>>> + .gfp_mask = sc->gfp_mask,
>>> + };
>>> +
>>> + /* Reclaim above the high watermark. */
>>> + sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));
>>
>> OK, so the cap is at high watermark which sounds OK to me, although I
>> would expect balance_gap being considered here. Is it not used
>> intentionally or you just wanted to have a reasonable upper bound?
>>
>> I am not objecting to that it just hit my eyes.
>
> This is the maximum number of pages to reclaim, not the point
> at which to stop reclaiming.

What's the difference between the maximum number of pages to reclaim and
the point at which to stop reclaiming?

>
> I assume Mel chose this value because it guarantees that enough
> pages will have been freed, while also making sure that the value
> is scaled according to zone size (keeping pressure between zones
> roughly equal).
>

2013-03-22 03:54:04

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority

On 03/21/2013 08:05 PM, Will Huck wrote:

> One offline question, how to understand this in function balance_pgdat:
> /*
> * Do some background aging of the anon list, to give
> * pages a chance to be referenced before reclaiming.
> */
> age_acitve_anon(zone, &sc);

The anon lrus use a two-handed clock algorithm. New anonymous pages
start off on the active anon list. Older anonymous pages get moved
to the inactive anon list.

If they get referenced before they reach the end of the inactive anon
list, they get moved back to the active list.

If we need to swap something out and find a non-referenced page at the
end of the inactive anon list, we will swap it out.

In order to make good pageout decisions, pages need to stay on the
inactive anon list for a longer time, so they have plenty of time to
get referenced, before the reclaim code looks at them.

To achieve that, we will move some active anon pages to the inactive
anon list even when we do not want to swap anything out - as long as
the inactive anon list is below its target size.

Does that make sense?

--
All rights reversed

2013-03-22 05:00:13

by Will Huck

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority

Hi Rik,
On 03/22/2013 11:56 AM, Will Huck wrote:
> Hi Rik,
> On 03/22/2013 11:52 AM, Rik van Riel wrote:
>> On 03/21/2013 08:05 PM, Will Huck wrote:
>>
>>> One offline question, how to understand this in function balance_pgdat:
>>> /*
>>> * Do some background aging of the anon list, to give
>>> * pages a chance to be referenced before reclaiming.
>>> */
>>> age_acitve_anon(zone, &sc);
>>
>> The anon lrus use a two-handed clock algorithm. New anonymous pages
>> start off on the active anon list. Older anonymous pages get moved
>> to the inactive anon list.
>
> The file lrus also use the two-handed clock algorithm, correct?

After reinvestigate the codes, the answer is no. But why have this
difference? I think you are the expert for this question, expect your
explanation. :-)

>
>>
>> If they get referenced before they reach the end of the inactive anon
>> list, they get moved back to the active list.
>>
>> If we need to swap something out and find a non-referenced page at the
>> end of the inactive anon list, we will swap it out.
>>
>> In order to make good pageout decisions, pages need to stay on the
>> inactive anon list for a longer time, so they have plenty of time to
>> get referenced, before the reclaim code looks at them.
>>
>> To achieve that, we will move some active anon pages to the inactive
>> anon list even when we do not want to swap anything out - as long as
>> the inactive anon list is below its target size.
>>
>> Does that make sense?
>
> Make sense, thanks.
>

2013-03-22 05:03:10

by Will Huck

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority

Hi Rik,
On 03/22/2013 11:52 AM, Rik van Riel wrote:
> On 03/21/2013 08:05 PM, Will Huck wrote:
>
>> One offline question, how to understand this in function balance_pgdat:
>> /*
>> * Do some background aging of the anon list, to give
>> * pages a chance to be referenced before reclaiming.
>> */
>> age_acitve_anon(zone, &sc);
>
> The anon lrus use a two-handed clock algorithm. New anonymous pages
> start off on the active anon list. Older anonymous pages get moved
> to the inactive anon list.

The file lrus also use the two-handed clock algorithm, correct?

>
> If they get referenced before they reach the end of the inactive anon
> list, they get moved back to the active list.
>
> If we need to swap something out and find a non-referenced page at the
> end of the inactive anon list, we will swap it out.
>
> In order to make good pageout decisions, pages need to stay on the
> inactive anon list for a longer time, so they have plenty of time to
> get referenced, before the reclaim code looks at them.
>
> To achieve that, we will move some active anon pages to the inactive
> anon list even when we do not want to swap anything out - as long as
> the inactive anon list is below its target size.
>
> Does that make sense?

Make sense, thanks.

2013-03-22 07:54:31

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 02/10] mm: vmscan: Obey proportional scanning requirements for kswapd

On Thu 21-03-13 15:34:42, Mel Gorman wrote:
> On Thu, Mar 21, 2013 at 04:07:55PM +0100, Michal Hocko wrote:
> > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > > index 4835a7a..182ff15 100644
> > > > > --- a/mm/vmscan.c
> > > > > +++ b/mm/vmscan.c
> > > > > @@ -1815,6 +1815,45 @@ out:
> > > > > }
> > > > > }
> > > > >
> > > > > +static void recalculate_scan_count(unsigned long nr_reclaimed,
> > > > > + unsigned long nr_to_reclaim,
> > > > > + unsigned long nr[NR_LRU_LISTS])
> > > > > +{
> > > > > + enum lru_list l;
> > > > > +
> > > > > + /*
> > > > > + * For direct reclaim, reclaim the number of pages requested. Less
> > > > > + * care is taken to ensure that scanning for each LRU is properly
> > > > > + * proportional. This is unfortunate and is improper aging but
> > > > > + * minimises the amount of time a process is stalled.
> > > > > + */
> > > > > + if (!current_is_kswapd()) {
> > > > > + if (nr_reclaimed >= nr_to_reclaim) {
> > > > > + for_each_evictable_lru(l)
> > > > > + nr[l] = 0;
> > > > > + }
> > > > > + return;
> > > >
> > > > Heh, this is nicely cryptically said what could be done in shrink_lruvec
> > > > as
> > > > if (!current_is_kswapd()) {
> > > > if (nr_reclaimed >= nr_to_reclaim)
> > > > break;
> > > > }
> > > >
> > >
> > > Pretty much. At one point during development, this function was more
> > > complex and it evolved into this without me rechecking if splitting it
> > > out still made sense.
> > >
> > > > Besides that this is not memcg aware which I think it would break
> > > > targeted reclaim which is kind of direct reclaim but it still would be
> > > > good to stay proportional because it starts with DEF_PRIORITY.
> > > >
> > >
> > > This does break memcg because it's a special sort of direct reclaim.
> > >
> > > > I would suggest moving this back to shrink_lruvec and update the test as
> > > > follows:
> > >
> > > I also noticed that we check whether the scan counts need to be
> > > normalised more than once
> >
> > I didn't mind this because it "disqualified" at least one LRU every
> > round which sounds reasonable to me because all LRUs would be scanned
> > proportionally.
>
> Once the scan count for one LRU is 0 then min will always be 0 and no
> further adjustment is made. It's just redundant to check again.

Hmm, I was almost sure I wrote that min should be adjusted only if it is >0
in the first loop but it is not there...

So for real this time.
for_each_evictable_lru(l)
if (nr[l] && nr[l] < min)
min = nr[l];

This should work, no? Everytime you shrink all LRUs you and you have
reclaimed enough already you get the smallest LRU out of game. This
should keep proportions evenly.
--
Michal Hocko
SUSE Labs

2013-03-22 08:27:26

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 07/10] mm: vmscan: Block kswapd if it is encountering pages under writeback

On Thu, Mar 21, 2013 at 02:42:26PM -0400, Rik van Riel wrote:
> On 03/17/2013 09:04 AM, Mel Gorman wrote:
> >Historically, kswapd used to congestion_wait() at higher priorities if it
> >was not making forward progress. This made no sense as the failure to make
> >progress could be completely independent of IO. It was later replaced by
> >wait_iff_congested() and removed entirely by commit 258401a6 (mm: don't
> >wait on congested zones in balance_pgdat()) as it was duplicating logic
> >in shrink_inactive_list().
> >
> >This is problematic. If kswapd encounters many pages under writeback and
> >it continues to scan until it reaches the high watermark then it will
> >quickly skip over the pages under writeback and reclaim clean young
> >pages or push applications out to swap.
> >
> >The use of wait_iff_congested() is not suited to kswapd as it will only
> >stall if the underlying BDI is really congested or a direct reclaimer was
> >unable to write to the underlying BDI. kswapd bypasses the BDI congestion
> >as it sets PF_SWAPWRITE but even if this was taken into account then it
> >would cause direct reclaimers to stall on writeback which is not desirable.
> >
> >This patch sets a ZONE_WRITEBACK flag if direct reclaim or kswapd is
> >encountering too many pages under writeback. If this flag is set and
> >kswapd encounters a PageReclaim page under writeback then it'll assume
> >that the LRU lists are being recycled too quickly before IO can complete
> >and block waiting for some IO to complete.
>
> I really like the concept of this patch.
>

Thanks.

> >@@ -756,9 +769,11 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> > */
> > SetPageReclaim(page);
> > nr_writeback++;
> >+
> > goto keep_locked;
> >+ } else {
> >+ wait_on_page_writeback(page);
> > }
> >- wait_on_page_writeback(page);
> > }
> >
> > if (!force_reclaim)
>
> This looks like an area for future improvement.
>
> We do not need to wait for this specific page to finish writeback,
> we only have to wait for any (bunch of) page(s) to finish writeback,
> since we do not particularly care which of the pages from near the
> end of the LRU get reclaimed first.
>

We do not have a good interface for waiting on IO to complete on any of a
list of pages. It could be polled but that feels unsatisfactory. Calling
congestion_wait() would sortof work and it's sortof what we used to do
in the past bvased on scanning priority but it only works if we happen to
wait on the correct async/sync queue and there is no guarnatee that it'll
wake when IO on a relevant page completes.

> I wonder if this is one of the causes for the high latencies that
> are sometimes observed in direct reclaim...
>

I'm skeptical.

In the far past, direct reclaim would only indirectly stall on page writeback
using congestion_wait or a similar interface. Later it was possible for
direct reclaim to stall on wait_on_page_writeback() during lumpy reclaim
and that might be what you're thinking of?

It could be an indirect cause of direct reclaim stalls. If kswapd is
blocked on page writeback then it does mean that a process may stall in
direct reclaim because kswapd is not making forward progress but it
should not be the cause of high latencies.

Under what circumstances are you seeing high latencies in
direct reclaim? We should be able to measure the stalls using the
trace_mm_vmscan_direct_reclaim_begin and trace_mm_vmscan_direct_reclaim_end
tracepoints and pin down the cause.

--
Mel Gorman
SUSE Labs

2013-03-22 08:37:11

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 02/10] mm: vmscan: Obey proportional scanning requirements for kswapd

On Fri, Mar 22, 2013 at 08:54:27AM +0100, Michal Hocko wrote:
> On Thu 21-03-13 15:34:42, Mel Gorman wrote:
> > On Thu, Mar 21, 2013 at 04:07:55PM +0100, Michal Hocko wrote:
> > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > > > index 4835a7a..182ff15 100644
> > > > > > --- a/mm/vmscan.c
> > > > > > +++ b/mm/vmscan.c
> > > > > > @@ -1815,6 +1815,45 @@ out:
> > > > > > }
> > > > > > }
> > > > > >
> > > > > > +static void recalculate_scan_count(unsigned long nr_reclaimed,
> > > > > > + unsigned long nr_to_reclaim,
> > > > > > + unsigned long nr[NR_LRU_LISTS])
> > > > > > +{
> > > > > > + enum lru_list l;
> > > > > > +
> > > > > > + /*
> > > > > > + * For direct reclaim, reclaim the number of pages requested. Less
> > > > > > + * care is taken to ensure that scanning for each LRU is properly
> > > > > > + * proportional. This is unfortunate and is improper aging but
> > > > > > + * minimises the amount of time a process is stalled.
> > > > > > + */
> > > > > > + if (!current_is_kswapd()) {
> > > > > > + if (nr_reclaimed >= nr_to_reclaim) {
> > > > > > + for_each_evictable_lru(l)
> > > > > > + nr[l] = 0;
> > > > > > + }
> > > > > > + return;
> > > > >
> > > > > Heh, this is nicely cryptically said what could be done in shrink_lruvec
> > > > > as
> > > > > if (!current_is_kswapd()) {
> > > > > if (nr_reclaimed >= nr_to_reclaim)
> > > > > break;
> > > > > }
> > > > >
> > > >
> > > > Pretty much. At one point during development, this function was more
> > > > complex and it evolved into this without me rechecking if splitting it
> > > > out still made sense.
> > > >
> > > > > Besides that this is not memcg aware which I think it would break
> > > > > targeted reclaim which is kind of direct reclaim but it still would be
> > > > > good to stay proportional because it starts with DEF_PRIORITY.
> > > > >
> > > >
> > > > This does break memcg because it's a special sort of direct reclaim.
> > > >
> > > > > I would suggest moving this back to shrink_lruvec and update the test as
> > > > > follows:
> > > >
> > > > I also noticed that we check whether the scan counts need to be
> > > > normalised more than once
> > >
> > > I didn't mind this because it "disqualified" at least one LRU every
> > > round which sounds reasonable to me because all LRUs would be scanned
> > > proportionally.
> >
> > Once the scan count for one LRU is 0 then min will always be 0 and no
> > further adjustment is made. It's just redundant to check again.
>
> Hmm, I was almost sure I wrote that min should be adjusted only if it is >0
> in the first loop but it is not there...
>
> So for real this time.
> for_each_evictable_lru(l)
> if (nr[l] && nr[l] < min)
> min = nr[l];
>
> This should work, no? Everytime you shrink all LRUs you and you have
> reclaimed enough already you get the smallest LRU out of game. This
> should keep proportions evenly.

Lets say we started like this

LRU_INACTIVE_ANON 60
LRU_ACTIVE_FILE 1000
LRU_INACTIVE_FILE 3000

and we've reclaimed nr_to_reclaim pages then we recalculate the number
of pages to scan from each list as;

LRU_INACTIVE_ANON 0
LRU_ACTIVE_FILE 940
LRU_INACTIVE_FILE 2940

We then shrink SWAP_CLUSTER_MAX from each LRU giving us this.

LRU_INACTIVE_ANON 0
LRU_ACTIVE_FILE 908
LRU_INACTIVE_FILE 2908

Then under your suggestion this would be recalculated as

LRU_INACTIVE_ANON 0
LRU_ACTIVE_FILE 0
LRU_INACTIVE_FILE 2000

another SWAP_CLUSTER_MAX reclaims and then it stops we stop reclaiming. I
might still be missing the point of your suggestion but I do not think it
would preserve the proportion of pages we reclaim from the anon or file LRUs.

--
Mel Gorman
SUSE Labs

2013-03-22 10:04:53

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 02/10] mm: vmscan: Obey proportional scanning requirements for kswapd

On Fri 22-03-13 08:37:04, Mel Gorman wrote:
> On Fri, Mar 22, 2013 at 08:54:27AM +0100, Michal Hocko wrote:
> > On Thu 21-03-13 15:34:42, Mel Gorman wrote:
> > > On Thu, Mar 21, 2013 at 04:07:55PM +0100, Michal Hocko wrote:
> > > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > > > > index 4835a7a..182ff15 100644
> > > > > > > --- a/mm/vmscan.c
> > > > > > > +++ b/mm/vmscan.c
> > > > > > > @@ -1815,6 +1815,45 @@ out:
> > > > > > > }
> > > > > > > }
> > > > > > >
> > > > > > > +static void recalculate_scan_count(unsigned long nr_reclaimed,
> > > > > > > + unsigned long nr_to_reclaim,
> > > > > > > + unsigned long nr[NR_LRU_LISTS])
> > > > > > > +{
> > > > > > > + enum lru_list l;
> > > > > > > +
> > > > > > > + /*
> > > > > > > + * For direct reclaim, reclaim the number of pages requested. Less
> > > > > > > + * care is taken to ensure that scanning for each LRU is properly
> > > > > > > + * proportional. This is unfortunate and is improper aging but
> > > > > > > + * minimises the amount of time a process is stalled.
> > > > > > > + */
> > > > > > > + if (!current_is_kswapd()) {
> > > > > > > + if (nr_reclaimed >= nr_to_reclaim) {
> > > > > > > + for_each_evictable_lru(l)
> > > > > > > + nr[l] = 0;
> > > > > > > + }
> > > > > > > + return;
> > > > > >
> > > > > > Heh, this is nicely cryptically said what could be done in shrink_lruvec
> > > > > > as
> > > > > > if (!current_is_kswapd()) {
> > > > > > if (nr_reclaimed >= nr_to_reclaim)
> > > > > > break;
> > > > > > }
> > > > > >
> > > > >
> > > > > Pretty much. At one point during development, this function was more
> > > > > complex and it evolved into this without me rechecking if splitting it
> > > > > out still made sense.
> > > > >
> > > > > > Besides that this is not memcg aware which I think it would break
> > > > > > targeted reclaim which is kind of direct reclaim but it still would be
> > > > > > good to stay proportional because it starts with DEF_PRIORITY.
> > > > > >
> > > > >
> > > > > This does break memcg because it's a special sort of direct reclaim.
> > > > >
> > > > > > I would suggest moving this back to shrink_lruvec and update the test as
> > > > > > follows:
> > > > >
> > > > > I also noticed that we check whether the scan counts need to be
> > > > > normalised more than once
> > > >
> > > > I didn't mind this because it "disqualified" at least one LRU every
> > > > round which sounds reasonable to me because all LRUs would be scanned
> > > > proportionally.
> > >
> > > Once the scan count for one LRU is 0 then min will always be 0 and no
> > > further adjustment is made. It's just redundant to check again.
> >
> > Hmm, I was almost sure I wrote that min should be adjusted only if it is >0
> > in the first loop but it is not there...
> >
> > So for real this time.
> > for_each_evictable_lru(l)
> > if (nr[l] && nr[l] < min)
> > min = nr[l];
> >
> > This should work, no? Everytime you shrink all LRUs you and you have
> > reclaimed enough already you get the smallest LRU out of game. This
> > should keep proportions evenly.
>
> Lets say we started like this
>
> LRU_INACTIVE_ANON 60
> LRU_ACTIVE_FILE 1000
> LRU_INACTIVE_FILE 3000
>
> and we've reclaimed nr_to_reclaim pages then we recalculate the number
> of pages to scan from each list as;
>
> LRU_INACTIVE_ANON 0
> LRU_ACTIVE_FILE 940
> LRU_INACTIVE_FILE 2940
>
> We then shrink SWAP_CLUSTER_MAX from each LRU giving us this.
>
> LRU_INACTIVE_ANON 0
> LRU_ACTIVE_FILE 908
> LRU_INACTIVE_FILE 2908
>
> Then under your suggestion this would be recalculated as
>
> LRU_INACTIVE_ANON 0
> LRU_ACTIVE_FILE 0
> LRU_INACTIVE_FILE 2000
>
> another SWAP_CLUSTER_MAX reclaims and then it stops we stop reclaiming. I
> might still be missing the point of your suggestion but I do not think it
> would preserve the proportion of pages we reclaim from the anon or file LRUs.

It wouldn't preserve proportion precisely because each reclaim round is
in SWAP_CLUSTER_MAX units but it would reclaim bigger lists more than
smaller ones which I thought was the whole point. So yes using word
"proportionally" is unfortunate but I didn't find out better one.
--
Michal Hocko
SUSE Labs

2013-03-22 10:47:42

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 02/10] mm: vmscan: Obey proportional scanning requirements for kswapd

On Fri 22-03-13 11:04:49, Michal Hocko wrote:
> On Fri 22-03-13 08:37:04, Mel Gorman wrote:
> > On Fri, Mar 22, 2013 at 08:54:27AM +0100, Michal Hocko wrote:
> > > On Thu 21-03-13 15:34:42, Mel Gorman wrote:
> > > > On Thu, Mar 21, 2013 at 04:07:55PM +0100, Michal Hocko wrote:
> > > > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > > > > > index 4835a7a..182ff15 100644
> > > > > > > > --- a/mm/vmscan.c
> > > > > > > > +++ b/mm/vmscan.c
> > > > > > > > @@ -1815,6 +1815,45 @@ out:
> > > > > > > > }
> > > > > > > > }
> > > > > > > >
> > > > > > > > +static void recalculate_scan_count(unsigned long nr_reclaimed,
> > > > > > > > + unsigned long nr_to_reclaim,
> > > > > > > > + unsigned long nr[NR_LRU_LISTS])
> > > > > > > > +{
> > > > > > > > + enum lru_list l;
> > > > > > > > +
> > > > > > > > + /*
> > > > > > > > + * For direct reclaim, reclaim the number of pages requested. Less
> > > > > > > > + * care is taken to ensure that scanning for each LRU is properly
> > > > > > > > + * proportional. This is unfortunate and is improper aging but
> > > > > > > > + * minimises the amount of time a process is stalled.
> > > > > > > > + */
> > > > > > > > + if (!current_is_kswapd()) {
> > > > > > > > + if (nr_reclaimed >= nr_to_reclaim) {
> > > > > > > > + for_each_evictable_lru(l)
> > > > > > > > + nr[l] = 0;
> > > > > > > > + }
> > > > > > > > + return;
> > > > > > >
> > > > > > > Heh, this is nicely cryptically said what could be done in shrink_lruvec
> > > > > > > as
> > > > > > > if (!current_is_kswapd()) {
> > > > > > > if (nr_reclaimed >= nr_to_reclaim)
> > > > > > > break;
> > > > > > > }
> > > > > > >
> > > > > >
> > > > > > Pretty much. At one point during development, this function was more
> > > > > > complex and it evolved into this without me rechecking if splitting it
> > > > > > out still made sense.
> > > > > >
> > > > > > > Besides that this is not memcg aware which I think it would break
> > > > > > > targeted reclaim which is kind of direct reclaim but it still would be
> > > > > > > good to stay proportional because it starts with DEF_PRIORITY.
> > > > > > >
> > > > > >
> > > > > > This does break memcg because it's a special sort of direct reclaim.
> > > > > >
> > > > > > > I would suggest moving this back to shrink_lruvec and update the test as
> > > > > > > follows:
> > > > > >
> > > > > > I also noticed that we check whether the scan counts need to be
> > > > > > normalised more than once
> > > > >
> > > > > I didn't mind this because it "disqualified" at least one LRU every
> > > > > round which sounds reasonable to me because all LRUs would be scanned
> > > > > proportionally.
> > > >
> > > > Once the scan count for one LRU is 0 then min will always be 0 and no
> > > > further adjustment is made. It's just redundant to check again.
> > >
> > > Hmm, I was almost sure I wrote that min should be adjusted only if it is >0
> > > in the first loop but it is not there...
> > >
> > > So for real this time.
> > > for_each_evictable_lru(l)
> > > if (nr[l] && nr[l] < min)
> > > min = nr[l];
> > >
> > > This should work, no? Everytime you shrink all LRUs you and you have
> > > reclaimed enough already you get the smallest LRU out of game. This
> > > should keep proportions evenly.
> >
> > Lets say we started like this
> >
> > LRU_INACTIVE_ANON 60
> > LRU_ACTIVE_FILE 1000
> > LRU_INACTIVE_FILE 3000
> >
> > and we've reclaimed nr_to_reclaim pages then we recalculate the number
> > of pages to scan from each list as;
> >
> > LRU_INACTIVE_ANON 0
> > LRU_ACTIVE_FILE 940
> > LRU_INACTIVE_FILE 2940
> >
> > We then shrink SWAP_CLUSTER_MAX from each LRU giving us this.
> >
> > LRU_INACTIVE_ANON 0
> > LRU_ACTIVE_FILE 908
> > LRU_INACTIVE_FILE 2908
> >
> > Then under your suggestion this would be recalculated as
> >
> > LRU_INACTIVE_ANON 0
> > LRU_ACTIVE_FILE 0
> > LRU_INACTIVE_FILE 2000
> >
> > another SWAP_CLUSTER_MAX reclaims and then it stops we stop reclaiming. I
> > might still be missing the point of your suggestion but I do not think it
> > would preserve the proportion of pages we reclaim from the anon or file LRUs.
>
> It wouldn't preserve proportion precisely because each reclaim round is
> in SWAP_CLUSTER_MAX units but it would reclaim bigger lists more than
> smaller ones which I thought was the whole point. So yes using word
> "proportionally" is unfortunate but I didn't find out better one.

OK, I have obviosly missed that you are not breaking out of the loop if
scan_adjusted. Now that I am looking at the updated patch again you just
do
if (nr_reclaimed < nr_to_reclaim || scan_adjusted)
continue;

So I thouught you would just do one round of reclaim after
nr_reclaimed >= nr_to_reclaim which din't feel right to me.

Sorry about the confusion!
--
Michal Hocko
SUSE Labs

2013-03-22 13:02:55

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority

On 03/22/2013 12:59 AM, Will Huck wrote:
> Hi Rik,
> On 03/22/2013 11:56 AM, Will Huck wrote:
>> Hi Rik,
>> On 03/22/2013 11:52 AM, Rik van Riel wrote:
>>> On 03/21/2013 08:05 PM, Will Huck wrote:
>>>
>>>> One offline question, how to understand this in function balance_pgdat:
>>>> /*
>>>> * Do some background aging of the anon list, to give
>>>> * pages a chance to be referenced before reclaiming.
>>>> */
>>>> age_acitve_anon(zone, &sc);
>>>
>>> The anon lrus use a two-handed clock algorithm. New anonymous pages
>>> start off on the active anon list. Older anonymous pages get moved
>>> to the inactive anon list.
>>
>> The file lrus also use the two-handed clock algorithm, correct?
>
> After reinvestigate the codes, the answer is no. But why have this
> difference? I think you are the expert for this question, expect your
> explanation. :-)

Anonymous memory has a smaller amount of memory (on the order
of system memory), most of which is or has been in a working
set at some point.

File system cache tends to have two distinct sets. One part
are the frequently accessed files, another part are the files
that are accessed just once or twice.

The file working set needs to be protected from streaming
IO. We do this by having new file pages start out on the
inactive file list, and only promoted to the active file
list if they get accessed twice.


--
All rights reversed

2013-03-22 14:37:35

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8] Reduce system disruption due to kswapd

On Sun, Mar 17, 2013 at 01:04:06PM +0000, Mel Gorman wrote:
> Kswapd and page reclaim behaviour has been screwy in one way or the other
> for a long time. Very broadly speaking it worked in the far past because
> machines were limited in memory so it did not have that many pages to scan
> and it stalled congestion_wait() frequently to prevent it going completely
> nuts. In recent times it has behaved very unsatisfactorily with some of
> the problems compounded by the removal of stall logic and the introduction
> of transparent hugepage support with high-order reclaims.
>

With the current set of feedback the series as it currently stands for
me is located here

git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-vmscan-limit-reclaim-v2r7

I haven't tested this version myself yet but others might be interested.

--
Mel Gorman
SUSE Labs

2013-03-22 16:54:55

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 02/10] mm: vmscan: Obey proportional scanning requirements for kswapd

On Thu, Mar 21, 2013 at 06:02:38PM +0000, Mel Gorman wrote:
> On Thu, Mar 21, 2013 at 12:25:18PM -0400, Johannes Weiner wrote:
> > On Sun, Mar 17, 2013 at 01:04:08PM +0000, Mel Gorman wrote:
> > > Simplistically, the anon and file LRU lists are scanned proportionally
> > > depending on the value of vm.swappiness although there are other factors
> > > taken into account by get_scan_count(). The patch "mm: vmscan: Limit
> > > the number of pages kswapd reclaims" limits the number of pages kswapd
> > > reclaims but it breaks this proportional scanning and may evenly shrink
> > > anon/file LRUs regardless of vm.swappiness.
> > >
> > > This patch preserves the proportional scanning and reclaim. It does mean
> > > that kswapd will reclaim more than requested but the number of pages will
> > > be related to the high watermark.
> >
> > Swappiness is about page types, but this implementation compares all
> > LRUs against each other, and I'm not convinced that this makes sense
> > as there is no guaranteed balance between the inactive and active
> > lists. For example, the active file LRU could get knocked out when
> > it's almost empty while the inactive file LRU has more easy cache than
> > the anon lists combined.
> >
>
> Ok, I see your point. I think Michal was making the same point but I
> failed to understand it the first time around.
>
> > Would it be better to compare the sum of file pages with the sum of
> > anon pages and then knock out the smaller pair?
>
> Yes, it makes more sense but the issue then becomes how can we do that
> sensibly, The following is straight-forward and roughly in line with your
> suggestion but it does not preseve the scanning ratio between active and
> inactive of the remaining LRU lists.

After thinking more about it, I wonder if subtracting absolute values
of one LRU goal from the other is right to begin with, because the
anon/file balance percentage is applied to individual LRU sizes, and
these sizes are not necessarily comparable.

Consider an unbalanced case of 64 file and 32768 anon pages targetted.
If the balance is 70% file and 30% anon, we will scan 70% of those 64
file pages and 30% of the 32768 anon pages.

Say we decide to bail after one iteration of 32 file pages reclaimed.
We would have scanned only 50% of the targetted file pages, but
subtracting those remaining 32 leaves us with 99% of the targetted
anon pages.

So would it make sense to determine the percentage scanned of the type
that we stop scanning, then scale the original goal of the remaining
LRUs to that percentage, and scan the remainder?

In the above example, we'd determine we scanned 50% of the targetted
file pages, so we reduce the anon inactive and active goals to 50% of
their original values, then scan the difference between those reduced
goals and the pages already scanned.

2013-03-22 18:26:06

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 02/10] mm: vmscan: Obey proportional scanning requirements for kswapd

On Fri, Mar 22, 2013 at 12:53:49PM -0400, Johannes Weiner wrote:
> On Thu, Mar 21, 2013 at 06:02:38PM +0000, Mel Gorman wrote:
> > On Thu, Mar 21, 2013 at 12:25:18PM -0400, Johannes Weiner wrote:
> > > On Sun, Mar 17, 2013 at 01:04:08PM +0000, Mel Gorman wrote:
> > > > Simplistically, the anon and file LRU lists are scanned proportionally
> > > > depending on the value of vm.swappiness although there are other factors
> > > > taken into account by get_scan_count(). The patch "mm: vmscan: Limit
> > > > the number of pages kswapd reclaims" limits the number of pages kswapd
> > > > reclaims but it breaks this proportional scanning and may evenly shrink
> > > > anon/file LRUs regardless of vm.swappiness.
> > > >
> > > > This patch preserves the proportional scanning and reclaim. It does mean
> > > > that kswapd will reclaim more than requested but the number of pages will
> > > > be related to the high watermark.
> > >
> > > Swappiness is about page types, but this implementation compares all
> > > LRUs against each other, and I'm not convinced that this makes sense
> > > as there is no guaranteed balance between the inactive and active
> > > lists. For example, the active file LRU could get knocked out when
> > > it's almost empty while the inactive file LRU has more easy cache than
> > > the anon lists combined.
> > >
> >
> > Ok, I see your point. I think Michal was making the same point but I
> > failed to understand it the first time around.
> >
> > > Would it be better to compare the sum of file pages with the sum of
> > > anon pages and then knock out the smaller pair?
> >
> > Yes, it makes more sense but the issue then becomes how can we do that
> > sensibly, The following is straight-forward and roughly in line with your
> > suggestion but it does not preseve the scanning ratio between active and
> > inactive of the remaining LRU lists.
>
> After thinking more about it, I wonder if subtracting absolute values
> of one LRU goal from the other is right to begin with, because the
> anon/file balance percentage is applied to individual LRU sizes, and
> these sizes are not necessarily comparable.
>

Good point and in itself it's not 100% clear that it's a good idea. If
swappiness reflected the ratio of anon/file pages that were reflected
then it's very easy to reason about. By our current definition, the rate
at which anon or file pages get reclaimed adjusts as reclaim progresses.

> <Snipped the example>
>

I agree and I see your point.

> So would it make sense to determine the percentage scanned of the type
> that we stop scanning, then scale the original goal of the remaining
> LRUs to that percentage, and scan the remainder?
>

To preserve existing behaviour, that makes sense. I'm not convinced that
it's necessarily the best idea but altering it would be beyond the scope
of this series and bite off more than I'm willing to chew. This actually
simplifies things a bit and shrink_lruvec turns into the (untested) code
below. It does not do exact proportional scanning but I do not think it's
necessary to either and is a useful enough approximation. It still could
end up reclaiming much more than sc->nr_to_reclaim unfortunately but fixing
it requires reworking how kswapd scans at different priorities.

Is this closer to what you had in mind?

static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
{
unsigned long nr[NR_LRU_LISTS];
unsigned long nr_to_scan;
enum lru_list lru;
unsigned long nr_reclaimed = 0;
unsigned long nr_to_reclaim = sc->nr_to_reclaim;
unsigned long nr_anon_scantarget, nr_file_scantarget;
struct blk_plug plug;
bool scan_adjusted = false;

get_scan_count(lruvec, sc, nr);

/* Record the original scan target for proportional adjustments later */
nr_file_scantarget = nr[LRU_INACTIVE_FILE] + nr[LRU_ACTIVE_FILE] + 1;
nr_anon_scantarget = nr[LRU_INACTIVE_ANON] + nr[LRU_ACTIVE_ANON] + 1;

blk_start_plug(&plug);
while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
nr[LRU_INACTIVE_FILE]) {
unsigned long nr_anon, nr_file, percentage;

for_each_evictable_lru(lru) {
if (nr[lru]) {
nr_to_scan = min(nr[lru], SWAP_CLUSTER_MAX);
nr[lru] -= nr_to_scan;

nr_reclaimed += shrink_list(lru, nr_to_scan,
lruvec, sc);
}
}

if (nr_reclaimed < nr_to_reclaim || scan_adjusted)
continue;

/*
* For global direct reclaim, reclaim only the number of pages
* requested. Less care is taken to scan proportionally as it
* is more important to minimise direct reclaim stall latency
* than it is to properly age the LRU lists.
*/
if (global_reclaim(sc) && !current_is_kswapd())
break;

/*
* For kswapd and memcg, reclaim at least the number of pages
* requested. Ensure that the anon and file LRUs shrink
* proportionally what was requested by get_scan_count(). We
* stop reclaiming one LRU and reduce the amount scanning
* proportional to the original scan target.
*/
nr_file = nr[LRU_INACTIVE_FILE] + nr[LRU_ACTIVE_FILE];
nr_anon = nr[LRU_INACTIVE_ANON] + nr[LRU_ACTIVE_ANON];

if (nr_file > nr_anon) {
lru = LRU_BASE;
percentage = nr_anon * 100 / nr_anon_scantarget;
} else {
lru = LRU_FILE;
percentage = nr_file * 100 / nr_file_scantarget;
}

/* Stop scanning the smaller of the LRU */
nr[lru] = 0;
nr[lru + LRU_ACTIVE] = 0;

/* Reduce scanning of the other LRU proportionally */
lru = (lru == LRU_FILE) ? LRU_BASE : LRU_FILE;
nr[lru] = nr[lru] * percentage / 100;;
nr[lru + LRU_ACTIVE] = nr[lru + LRU_ACTIVE] * percentage / 100;

scan_adjusted = true;
}
blk_finish_plug(&plug);
sc->nr_reclaimed += nr_reclaimed;

/*
* Even if we did not try to evict anon pages at all, we want to
* rebalance the anon lru active/inactive ratio.
*/
if (inactive_anon_is_low(lruvec))
shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
sc, LRU_ACTIVE_ANON);

throttle_vm_writeout(sc->gfp_mask);
}


--
Mel Gorman
SUSE Labs

2013-03-22 19:10:14

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 02/10] mm: vmscan: Obey proportional scanning requirements for kswapd

On Fri, Mar 22, 2013 at 06:25:56PM +0000, Mel Gorman wrote:
> On Fri, Mar 22, 2013 at 12:53:49PM -0400, Johannes Weiner wrote:
> > So would it make sense to determine the percentage scanned of the type
> > that we stop scanning, then scale the original goal of the remaining
> > LRUs to that percentage, and scan the remainder?
>
> To preserve existing behaviour, that makes sense. I'm not convinced that
> it's necessarily the best idea but altering it would be beyond the scope
> of this series and bite off more than I'm willing to chew. This actually
> simplifies things a bit and shrink_lruvec turns into the (untested) code
> below. It does not do exact proportional scanning but I do not think it's
> necessary to either and is a useful enough approximation. It still could
> end up reclaiming much more than sc->nr_to_reclaim unfortunately but fixing
> it requires reworking how kswapd scans at different priorities.

In which way does it not do exact proportional scanning? I commented
on one issue below, but maybe you were referring to something else.

Yes, it's a little unfortunate that we escalate to a gigantic scan
window first, and then have to contort ourselves in the process of
backing off gracefully after we reclaimed a few pages...

> Is this closer to what you had in mind?
>
> static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> {
> unsigned long nr[NR_LRU_LISTS];
> unsigned long nr_to_scan;
> enum lru_list lru;
> unsigned long nr_reclaimed = 0;
> unsigned long nr_to_reclaim = sc->nr_to_reclaim;
> unsigned long nr_anon_scantarget, nr_file_scantarget;
> struct blk_plug plug;
> bool scan_adjusted = false;
>
> get_scan_count(lruvec, sc, nr);
>
> /* Record the original scan target for proportional adjustments later */
> nr_file_scantarget = nr[LRU_INACTIVE_FILE] + nr[LRU_ACTIVE_FILE] + 1;
> nr_anon_scantarget = nr[LRU_INACTIVE_ANON] + nr[LRU_ACTIVE_ANON] + 1;
>
> blk_start_plug(&plug);
> while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
> nr[LRU_INACTIVE_FILE]) {
> unsigned long nr_anon, nr_file, percentage;
>
> for_each_evictable_lru(lru) {
> if (nr[lru]) {
> nr_to_scan = min(nr[lru], SWAP_CLUSTER_MAX);
> nr[lru] -= nr_to_scan;
>
> nr_reclaimed += shrink_list(lru, nr_to_scan,
> lruvec, sc);
> }
> }
>
> if (nr_reclaimed < nr_to_reclaim || scan_adjusted)
> continue;
>
> /*
> * For global direct reclaim, reclaim only the number of pages
> * requested. Less care is taken to scan proportionally as it
> * is more important to minimise direct reclaim stall latency
> * than it is to properly age the LRU lists.
> */
> if (global_reclaim(sc) && !current_is_kswapd())
> break;
>
> /*
> * For kswapd and memcg, reclaim at least the number of pages
> * requested. Ensure that the anon and file LRUs shrink
> * proportionally what was requested by get_scan_count(). We
> * stop reclaiming one LRU and reduce the amount scanning
> * proportional to the original scan target.
> */
> nr_file = nr[LRU_INACTIVE_FILE] + nr[LRU_ACTIVE_FILE];
> nr_anon = nr[LRU_INACTIVE_ANON] + nr[LRU_ACTIVE_ANON];
>
> if (nr_file > nr_anon) {
> lru = LRU_BASE;
> percentage = nr_anon * 100 / nr_anon_scantarget;
> } else {
> lru = LRU_FILE;
> percentage = nr_file * 100 / nr_file_scantarget;
> }
>
> /* Stop scanning the smaller of the LRU */
> nr[lru] = 0;
> nr[lru + LRU_ACTIVE] = 0;
>
> /* Reduce scanning of the other LRU proportionally */
> lru = (lru == LRU_FILE) ? LRU_BASE : LRU_FILE;
> nr[lru] = nr[lru] * percentage / 100;;
> nr[lru + LRU_ACTIVE] = nr[lru + LRU_ACTIVE] * percentage / 100;

The percentage is taken from the original goal but then applied to the
remainder of scan goal for the LRUs we continue scanning. The more
pages that have already been scanned, the more inaccurate this gets.
Is that what you had in mind with useful enough approximation?

2013-03-22 19:46:14

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 02/10] mm: vmscan: Obey proportional scanning requirements for kswapd

On Fri, Mar 22, 2013 at 03:09:02PM -0400, Johannes Weiner wrote:
> > To preserve existing behaviour, that makes sense. I'm not convinced that
> > it's necessarily the best idea but altering it would be beyond the scope
> > of this series and bite off more than I'm willing to chew. This actually
> > simplifies things a bit and shrink_lruvec turns into the (untested) code
> > below. It does not do exact proportional scanning but I do not think it's
> > necessary to either and is a useful enough approximation. It still could
> > end up reclaiming much more than sc->nr_to_reclaim unfortunately but fixing
> > it requires reworking how kswapd scans at different priorities.
>
> In which way does it not do exact proportional scanning? I commented
> on one issue below, but maybe you were referring to something else.
>

You guessed what I was referring to correctly.

> Yes, it's a little unfortunate that we escalate to a gigantic scan
> window first, and then have to contort ourselves in the process of
> backing off gracefully after we reclaimed a few pages...
>

The next patch "mm: vmscan: Flatten kswapd priority loop" mitigates the
problem slightly by improving how kswapd controls when priority gets raised.
It's not perfect though, lots of pages under writeback at the tail of
the LRU will still raise the priority quickly.

> > Is this closer to what you had in mind?
> >
> > static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> > {
> > unsigned long nr[NR_LRU_LISTS];
> > unsigned long nr_to_scan;
> > enum lru_list lru;
> > unsigned long nr_reclaimed = 0;
> > unsigned long nr_to_reclaim = sc->nr_to_reclaim;
> > unsigned long nr_anon_scantarget, nr_file_scantarget;
> > struct blk_plug plug;
> > bool scan_adjusted = false;
> >
> > get_scan_count(lruvec, sc, nr);
> >
> > /* Record the original scan target for proportional adjustments later */
> > nr_file_scantarget = nr[LRU_INACTIVE_FILE] + nr[LRU_ACTIVE_FILE] + 1;
> > nr_anon_scantarget = nr[LRU_INACTIVE_ANON] + nr[LRU_ACTIVE_ANON] + 1;
> >
> > blk_start_plug(&plug);
> > while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
> > nr[LRU_INACTIVE_FILE]) {
> > unsigned long nr_anon, nr_file, percentage;
> >
> > for_each_evictable_lru(lru) {
> > if (nr[lru]) {
> > nr_to_scan = min(nr[lru], SWAP_CLUSTER_MAX);
> > nr[lru] -= nr_to_scan;
> >
> > nr_reclaimed += shrink_list(lru, nr_to_scan,
> > lruvec, sc);
> > }
> > }
> >
> > if (nr_reclaimed < nr_to_reclaim || scan_adjusted)
> > continue;
> >
> > /*
> > * For global direct reclaim, reclaim only the number of pages
> > * requested. Less care is taken to scan proportionally as it
> > * is more important to minimise direct reclaim stall latency
> > * than it is to properly age the LRU lists.
> > */
> > if (global_reclaim(sc) && !current_is_kswapd())
> > break;
> >
> > /*
> > * For kswapd and memcg, reclaim at least the number of pages
> > * requested. Ensure that the anon and file LRUs shrink
> > * proportionally what was requested by get_scan_count(). We
> > * stop reclaiming one LRU and reduce the amount scanning
> > * proportional to the original scan target.
> > */
> > nr_file = nr[LRU_INACTIVE_FILE] + nr[LRU_ACTIVE_FILE];
> > nr_anon = nr[LRU_INACTIVE_ANON] + nr[LRU_ACTIVE_ANON];
> >
> > if (nr_file > nr_anon) {
> > lru = LRU_BASE;
> > percentage = nr_anon * 100 / nr_anon_scantarget;
> > } else {
> > lru = LRU_FILE;
> > percentage = nr_file * 100 / nr_file_scantarget;
> > }
> >
> > /* Stop scanning the smaller of the LRU */
> > nr[lru] = 0;
> > nr[lru + LRU_ACTIVE] = 0;
> >
> > /* Reduce scanning of the other LRU proportionally */
> > lru = (lru == LRU_FILE) ? LRU_BASE : LRU_FILE;
> > nr[lru] = nr[lru] * percentage / 100;;
> > nr[lru + LRU_ACTIVE] = nr[lru + LRU_ACTIVE] * percentage / 100;
>
> The percentage is taken from the original goal but then applied to the
> remainder of scan goal for the LRUs we continue scanning. The more
> pages that have already been scanned, the more inaccurate this gets.
> Is that what you had in mind with useful enough approximation?

Yes. I could record the original scan rates, recalculate as a percentage
and then do something like

nr[lru] = min(nr[lru], origin_nr[lru] * percentage / 100)

but it was not obvious that the result would be any better.


--
Mel Gorman
SUSE Labs

2013-03-24 19:00:14

by Jiri Slaby

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8] Reduce system disruption due to kswapd

On 03/17/2013 02:04 PM, Mel Gorman wrote:
> Kswapd and page reclaim behaviour has been screwy in one way or the other
> for a long time. Very broadly speaking it worked in the far past because
> machines were limited in memory so it did not have that many pages to scan
> and it stalled congestion_wait() frequently to prevent it going completely
> nuts. In recent times it has behaved very unsatisfactorily with some of
> the problems compounded by the removal of stall logic and the introduction
> of transparent hugepage support with high-order reclaims.
>
> There are many variations of bugs that are rooted in this area. One example
> is reports of a large copy operations or backup causing the machine to
> grind to a halt or applications pushed to swap. Sometimes in low memory
> situations a large percentage of memory suddenly gets reclaimed. In other
> cases an application starts and kswapd hits 100% CPU usage for prolonged
> periods of time and so on. There is now talk of introducing features like
> an extra free kbytes tunable to work around aspects of the problem instead
> of trying to deal with it. It's compounded by the problem that it can be
> very workload and machine specific.
>
> This RFC is aimed at investigating if kswapd can be address these various
> problems in a relatively straight-forward fashion without a fundamental
> rewrite.
>
> Patches 1-2 limits the number of pages kswapd reclaims while still obeying
> the anon/file proportion of the LRUs it should be scanning.

Hi,

patch 1 does not apply (on the top of -next), so I can't test this :(.

thanks,
--
js
suse labs

2013-03-25 08:17:28

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8] Reduce system disruption due to kswapd

On Sun 24-03-13 20:00:07, Jiri Slaby wrote:
[...]
> Hi,
>
> patch 1 does not apply (on the top of -next), so I can't test this :(.

It conflicts with (mm/vmscan.c: minor cleanup for kswapd). The one below
should apply
---
>From 027ce7ca785ecde184f858aa234bdc9461f1e3aa Mon Sep 17 00:00:00 2001
From: Mel Gorman <[email protected]>
Date: Mon, 11 Mar 2013 15:50:56 +0000
Subject: [PATCH] mm: vmscan: Limit the number of pages kswapd reclaims at
each priority

The number of pages kswapd can reclaim is bound by the number of pages it
scans which is related to the size of the zone and the scanning priority. In
many cases the priority remains low because it's reset every SWAP_CLUSTER_MAX
reclaimed pages but in the event kswapd scans a large number of pages it
cannot reclaim, it will raise the priority and potentially discard a large
percentage of the zone as sc->nr_to_reclaim is ULONG_MAX. The user-visible
effect is a reclaim "spike" where a large percentage of memory is suddenly
freed. It would be bad enough if this was just unused memory but because
of how anon/file pages are balanced it is possible that applications get
pushed to swap unnecessarily.

This patch limits the number of pages kswapd will reclaim to the high
watermark. Reclaim will still overshoot due to it not being a hard limit as
shrink_lruvec() will ignore the sc.nr_to_reclaim at DEF_PRIORITY but it
prevents kswapd reclaiming the world at higher priorities. The number of
pages it reclaims is not adjusted for high-order allocations as kswapd will
reclaim excessively if it is to balance zones for high-order allocations.

Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
---
mm/vmscan.c | 49 +++++++++++++++++++++++++++++--------------------
1 file changed, 29 insertions(+), 20 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index df78d17..4835a7a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2593,6 +2593,32 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
}

/*
+ * kswapd shrinks the zone by the number of pages required to reach
+ * the high watermark.
+ */
+static void kswapd_shrink_zone(struct zone *zone,
+ struct scan_control *sc,
+ unsigned long lru_pages)
+{
+ unsigned long nr_slab;
+ struct reclaim_state *reclaim_state = current->reclaim_state;
+ struct shrink_control shrink = {
+ .gfp_mask = sc->gfp_mask,
+ };
+
+ /* Reclaim above the high watermark. */
+ sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));
+ shrink_zone(zone, sc);
+
+ reclaim_state->reclaimed_slab = 0;
+ nr_slab = shrink_slab(&shrink, sc->nr_scanned, lru_pages);
+ sc->nr_reclaimed += reclaim_state->reclaimed_slab;
+
+ if (nr_slab == 0 && !zone_reclaimable(zone))
+ zone->all_unreclaimable = 1;
+}
+
+/*
* For kswapd, balance_pgdat() will work across all this node's zones until
* they are all at high_wmark_pages(zone).
*
@@ -2619,24 +2645,15 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
bool pgdat_is_balanced = false;
int i;
int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */
- struct reclaim_state *reclaim_state = current->reclaim_state;
unsigned long nr_soft_reclaimed;
unsigned long nr_soft_scanned;
struct scan_control sc = {
.gfp_mask = GFP_KERNEL,
.may_unmap = 1,
.may_swap = 1,
- /*
- * kswapd doesn't want to be bailed out while reclaim. because
- * we want to put equal scanning pressure on each zone.
- */
- .nr_to_reclaim = ULONG_MAX,
.order = order,
.target_mem_cgroup = NULL,
};
- struct shrink_control shrink = {
- .gfp_mask = sc.gfp_mask,
- };
loop_again:
sc.priority = DEF_PRIORITY;
sc.nr_reclaimed = 0;
@@ -2708,7 +2725,7 @@ loop_again:
*/
for (i = 0; i <= end_zone; i++) {
struct zone *zone = pgdat->node_zones + i;
- int nr_slab, testorder;
+ int testorder;
unsigned long balance_gap;

if (!populated_zone(zone))
@@ -2756,16 +2773,8 @@ loop_again:

if ((buffer_heads_over_limit && is_highmem_idx(i)) ||
!zone_balanced(zone, testorder,
- balance_gap, end_zone)) {
- shrink_zone(zone, &sc);
-
- reclaim_state->reclaimed_slab = 0;
- nr_slab = shrink_slab(&shrink, sc.nr_scanned, lru_pages);
- sc.nr_reclaimed += reclaim_state->reclaimed_slab;
-
- if (nr_slab == 0 && !zone_reclaimable(zone))
- zone->all_unreclaimable = 1;
- }
+ balance_gap, end_zone))
+ kswapd_shrink_zone(zone, &sc, lru_pages);

/*
* If we're getting trouble reclaiming, start doing
--
1.7.10.4

--
Michal Hocko
SUSE Labs

2013-03-25 09:08:04

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority

On Sun 17-03-13 13:04:07, Mel Gorman wrote:
> The number of pages kswapd can reclaim is bound by the number of pages it
> scans which is related to the size of the zone and the scanning priority. In
> many cases the priority remains low because it's reset every SWAP_CLUSTER_MAX
> reclaimed pages but in the event kswapd scans a large number of pages it
> cannot reclaim, it will raise the priority and potentially discard a large
> percentage of the zone as sc->nr_to_reclaim is ULONG_MAX. The user-visible
> effect is a reclaim "spike" where a large percentage of memory is suddenly
> freed. It would be bad enough if this was just unused memory but because
> of how anon/file pages are balanced it is possible that applications get
> pushed to swap unnecessarily.
>
> This patch limits the number of pages kswapd will reclaim to the high
> watermark. Reclaim will will overshoot due to it not being a hard limit as
> shrink_lruvec() will ignore the sc.nr_to_reclaim at DEF_PRIORITY but it
> prevents kswapd reclaiming the world at higher priorities. The number of
> pages it reclaims is not adjusted for high-order allocations as kswapd will
> reclaim excessively if it is to balance zones for high-order allocations.
>
> Signed-off-by: Mel Gorman <[email protected]>

It seems I forgot to add
Reviewed-by: Michal Hocko <[email protected]>

> ---
> mm/vmscan.c | 53 +++++++++++++++++++++++++++++------------------------
> 1 file changed, 29 insertions(+), 24 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 88c5fed..4835a7a 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2593,6 +2593,32 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
> }
>
> /*
> + * kswapd shrinks the zone by the number of pages required to reach
> + * the high watermark.
> + */
> +static void kswapd_shrink_zone(struct zone *zone,
> + struct scan_control *sc,
> + unsigned long lru_pages)
> +{
> + unsigned long nr_slab;
> + struct reclaim_state *reclaim_state = current->reclaim_state;
> + struct shrink_control shrink = {
> + .gfp_mask = sc->gfp_mask,
> + };
> +
> + /* Reclaim above the high watermark. */
> + sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));
> + shrink_zone(zone, sc);
> +
> + reclaim_state->reclaimed_slab = 0;
> + nr_slab = shrink_slab(&shrink, sc->nr_scanned, lru_pages);
> + sc->nr_reclaimed += reclaim_state->reclaimed_slab;
> +
> + if (nr_slab == 0 && !zone_reclaimable(zone))
> + zone->all_unreclaimable = 1;
> +}
> +
> +/*
> * For kswapd, balance_pgdat() will work across all this node's zones until
> * they are all at high_wmark_pages(zone).
> *
> @@ -2619,27 +2645,16 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
> bool pgdat_is_balanced = false;
> int i;
> int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */
> - unsigned long total_scanned;
> - struct reclaim_state *reclaim_state = current->reclaim_state;
> unsigned long nr_soft_reclaimed;
> unsigned long nr_soft_scanned;
> struct scan_control sc = {
> .gfp_mask = GFP_KERNEL,
> .may_unmap = 1,
> .may_swap = 1,
> - /*
> - * kswapd doesn't want to be bailed out while reclaim. because
> - * we want to put equal scanning pressure on each zone.
> - */
> - .nr_to_reclaim = ULONG_MAX,
> .order = order,
> .target_mem_cgroup = NULL,
> };
> - struct shrink_control shrink = {
> - .gfp_mask = sc.gfp_mask,
> - };
> loop_again:
> - total_scanned = 0;
> sc.priority = DEF_PRIORITY;
> sc.nr_reclaimed = 0;
> sc.may_writepage = !laptop_mode;
> @@ -2710,7 +2725,7 @@ loop_again:
> */
> for (i = 0; i <= end_zone; i++) {
> struct zone *zone = pgdat->node_zones + i;
> - int nr_slab, testorder;
> + int testorder;
> unsigned long balance_gap;
>
> if (!populated_zone(zone))
> @@ -2730,7 +2745,6 @@ loop_again:
> order, sc.gfp_mask,
> &nr_soft_scanned);
> sc.nr_reclaimed += nr_soft_reclaimed;
> - total_scanned += nr_soft_scanned;
>
> /*
> * We put equal pressure on every zone, unless
> @@ -2759,17 +2773,8 @@ loop_again:
>
> if ((buffer_heads_over_limit && is_highmem_idx(i)) ||
> !zone_balanced(zone, testorder,
> - balance_gap, end_zone)) {
> - shrink_zone(zone, &sc);
> -
> - reclaim_state->reclaimed_slab = 0;
> - nr_slab = shrink_slab(&shrink, sc.nr_scanned, lru_pages);
> - sc.nr_reclaimed += reclaim_state->reclaimed_slab;
> - total_scanned += sc.nr_scanned;
> -
> - if (nr_slab == 0 && !zone_reclaimable(zone))
> - zone->all_unreclaimable = 1;
> - }
> + balance_gap, end_zone))
> + kswapd_shrink_zone(zone, &sc, lru_pages);
>
> /*
> * If we're getting trouble reclaiming, start doing
> --
> 1.8.1.4
>

--
Michal Hocko
SUSE Labs

2013-03-25 09:13:48

by Jiri Slaby

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority

On 03/25/2013 10:07 AM, Michal Hocko wrote:
> On Sun 17-03-13 13:04:07, Mel Gorman wrote:
>> The number of pages kswapd can reclaim is bound by the number of pages it
>> scans which is related to the size of the zone and the scanning priority. In
>> many cases the priority remains low because it's reset every SWAP_CLUSTER_MAX
>> reclaimed pages but in the event kswapd scans a large number of pages it
>> cannot reclaim, it will raise the priority and potentially discard a large
>> percentage of the zone as sc->nr_to_reclaim is ULONG_MAX. The user-visible
>> effect is a reclaim "spike" where a large percentage of memory is suddenly
>> freed. It would be bad enough if this was just unused memory but because
>> of how anon/file pages are balanced it is possible that applications get
>> pushed to swap unnecessarily.
>>
>> This patch limits the number of pages kswapd will reclaim to the high
>> watermark. Reclaim will will overshoot due to it not being a hard limit as
>> shrink_lruvec() will ignore the sc.nr_to_reclaim at DEF_PRIORITY but it
>> prevents kswapd reclaiming the world at higher priorities. The number of
>> pages it reclaims is not adjusted for high-order allocations as kswapd will
>> reclaim excessively if it is to balance zones for high-order allocations.
>>
>> Signed-off-by: Mel Gorman <[email protected]>
>
> It seems I forgot to add
> Reviewed-by: Michal Hocko <[email protected]>

Thanks, now I applied all ten.

BTW I very pray this will fix also the issue I have when I run ltp tests
(highly I/O intensive, esp. `growfiles') in a VM while playing a movie
on the host resulting in a stuttered playback ;).

--
js
suse labs

2013-03-28 22:31:23

by Jiri Slaby

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority

On 03/25/2013 10:13 AM, Jiri Slaby wrote:
> BTW I very pray this will fix also the issue I have when I run ltp tests
> (highly I/O intensive, esp. `growfiles') in a VM while playing a movie
> on the host resulting in a stuttered playback ;).

No, this is still terrible. I was now updating a kernel in a VM and had
problems to even move with cursor. There was still 1.2G used by I/O cache.

thanks,
--
js
suse labs

2013-03-29 08:23:02

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority

On Thu 28-03-13 23:31:18, Jiri Slaby wrote:
> On 03/25/2013 10:13 AM, Jiri Slaby wrote:
> > BTW I very pray this will fix also the issue I have when I run ltp tests
> > (highly I/O intensive, esp. `growfiles') in a VM while playing a movie
> > on the host resulting in a stuttered playback ;).
>
> No, this is still terrible. I was now updating a kernel in a VM and had
> problems to even move with cursor.

:/

> There was still 1.2G used by I/O cache.

Could you collect /proc/zoneinfo and /proc/vmstat (say in 1 or 2s
intervals)?
--
Michal Hocko
SUSE Labs

2013-03-30 22:07:10

by Jiri Slaby

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority

On 03/29/2013 09:22 AM, Michal Hocko wrote:
> On Thu 28-03-13 23:31:18, Jiri Slaby wrote:
>> On 03/25/2013 10:13 AM, Jiri Slaby wrote:
>>> BTW I very pray this will fix also the issue I have when I run ltp tests
>>> (highly I/O intensive, esp. `growfiles') in a VM while playing a movie
>>> on the host resulting in a stuttered playback ;).
>>
>> No, this is still terrible. I was now updating a kernel in a VM and had
>> problems to even move with cursor.
>
> :/
>
>> There was still 1.2G used by I/O cache.
>
> Could you collect /proc/zoneinfo and /proc/vmstat (say in 1 or 2s
> intervals)?

Sure:
http://www.fi.muni.cz/~xslaby/sklad/zoneinfos.tar.xz

I ran the update like 10 s after I started taking snapshots. Mplayer
immediately started complaining:

************************************************
**** Your system is too SLOW to play this! ****
************************************************
etc.

thanks,
--
js
suse labs

2013-04-02 11:15:12

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority

On Sat, Mar 30, 2013 at 11:07:03PM +0100, Jiri Slaby wrote:
> On 03/29/2013 09:22 AM, Michal Hocko wrote:
> > On Thu 28-03-13 23:31:18, Jiri Slaby wrote:
> >> On 03/25/2013 10:13 AM, Jiri Slaby wrote:
> >>> BTW I very pray this will fix also the issue I have when I run ltp tests
> >>> (highly I/O intensive, esp. `growfiles') in a VM while playing a movie
> >>> on the host resulting in a stuttered playback ;).
> >>
> >> No, this is still terrible. I was now updating a kernel in a VM and had
> >> problems to even move with cursor.
> >
> > :/
> >
> >> There was still 1.2G used by I/O cache.
> >
> > Could you collect /proc/zoneinfo and /proc/vmstat (say in 1 or 2s
> > intervals)?
>
> Sure:
> http://www.fi.muni.cz/~xslaby/sklad/zoneinfos.tar.xz
>

There is no vmstat snapshots so we cannot see reclaim activity. However,
based on the zoneinfo I suspect there is little. The anon and file pages
are growing, there is no nr_vmscan_write or nr_vmscan_immediate_reclaim
activity. nr_islated_* occasionally has a few entries so there is some
reclaim activity but I'm not sure there is enough for this series to
make a difference.

nr_writeback is high during the window you record so there is IO
activity but I wonder if the source of the stalls in this case are an
IO change in the 3.9-rc window or a scheduler change.

There still is a reclaim-related problem but in this particular case I
think you might be triggering a different problem, one that the series
is not going to address.

Can you check vmstat and make sure reclaim is actually active when
mplayer performance goes to hell please?

--
Mel Gorman
SUSE Labs

2013-04-05 00:05:37

by Will Huck

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority

Hi Rik,
On 03/22/2013 09:01 PM, Rik van Riel wrote:
> On 03/22/2013 12:59 AM, Will Huck wrote:
>> Hi Rik,
>> On 03/22/2013 11:56 AM, Will Huck wrote:
>>> Hi Rik,
>>> On 03/22/2013 11:52 AM, Rik van Riel wrote:
>>>> On 03/21/2013 08:05 PM, Will Huck wrote:
>>>>
>>>>> One offline question, how to understand this in function
>>>>> balance_pgdat:
>>>>> /*
>>>>> * Do some background aging of the anon list, to give
>>>>> * pages a chance to be referenced before reclaiming.
>>>>> */
>>>>> age_acitve_anon(zone, &sc);
>>>>
>>>> The anon lrus use a two-handed clock algorithm. New anonymous pages
>>>> start off on the active anon list. Older anonymous pages get moved
>>>> to the inactive anon list.
>>>
>>> The file lrus also use the two-handed clock algorithm, correct?
>>
>> After reinvestigate the codes, the answer is no. But why have this
>> difference? I think you are the expert for this question, expect your
>> explanation. :-)
>
> Anonymous memory has a smaller amount of memory (on the order
> of system memory), most of which is or has been in a working
> set at some point.
>
> File system cache tends to have two distinct sets. One part
> are the frequently accessed files, another part are the files
> that are accessed just once or twice.
>
> The file working set needs to be protected from streaming
> IO. We do this by having new file pages start out on the

Is there streaming IO workload or benchmark?

> inactive file list, and only promoted to the active file
> list if they get accessed twice.
>
>

2013-04-07 07:32:24

by Will Huck

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority

Ping Rik.
On 04/05/2013 08:05 AM, Will Huck wrote:
> Hi Rik,
> On 03/22/2013 09:01 PM, Rik van Riel wrote:
>> On 03/22/2013 12:59 AM, Will Huck wrote:
>>> Hi Rik,
>>> On 03/22/2013 11:56 AM, Will Huck wrote:
>>>> Hi Rik,
>>>> On 03/22/2013 11:52 AM, Rik van Riel wrote:
>>>>> On 03/21/2013 08:05 PM, Will Huck wrote:
>>>>>
>>>>>> One offline question, how to understand this in function
>>>>>> balance_pgdat:
>>>>>> /*
>>>>>> * Do some background aging of the anon list, to give
>>>>>> * pages a chance to be referenced before reclaiming.
>>>>>> */
>>>>>> age_acitve_anon(zone, &sc);
>>>>>
>>>>> The anon lrus use a two-handed clock algorithm. New anonymous pages
>>>>> start off on the active anon list. Older anonymous pages get moved
>>>>> to the inactive anon list.
>>>>
>>>> The file lrus also use the two-handed clock algorithm, correct?
>>>
>>> After reinvestigate the codes, the answer is no. But why have this
>>> difference? I think you are the expert for this question, expect your
>>> explanation. :-)
>>
>> Anonymous memory has a smaller amount of memory (on the order
>> of system memory), most of which is or has been in a working
>> set at some point.
>>
>> File system cache tends to have two distinct sets. One part
>> are the frequently accessed files, another part are the files
>> that are accessed just once or twice.
>>
>> The file working set needs to be protected from streaming
>> IO. We do this by having new file pages start out on the
>
> Is there streaming IO workload or benchmark?
>
>> inactive file list, and only promoted to the active file
>> list if they get accessed twice.
>>
>>
>

2013-04-07 07:35:46

by Will Huck

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority

cc Fengguang,
On 04/05/2013 08:05 AM, Will Huck wrote:
> Hi Rik,
> On 03/22/2013 09:01 PM, Rik van Riel wrote:
>> On 03/22/2013 12:59 AM, Will Huck wrote:
>>> Hi Rik,
>>> On 03/22/2013 11:56 AM, Will Huck wrote:
>>>> Hi Rik,
>>>> On 03/22/2013 11:52 AM, Rik van Riel wrote:
>>>>> On 03/21/2013 08:05 PM, Will Huck wrote:
>>>>>
>>>>>> One offline question, how to understand this in function
>>>>>> balance_pgdat:
>>>>>> /*
>>>>>> * Do some background aging of the anon list, to give
>>>>>> * pages a chance to be referenced before reclaiming.
>>>>>> */
>>>>>> age_acitve_anon(zone, &sc);
>>>>>
>>>>> The anon lrus use a two-handed clock algorithm. New anonymous pages
>>>>> start off on the active anon list. Older anonymous pages get moved
>>>>> to the inactive anon list.
>>>>
>>>> The file lrus also use the two-handed clock algorithm, correct?
>>>
>>> After reinvestigate the codes, the answer is no. But why have this
>>> difference? I think you are the expert for this question, expect your
>>> explanation. :-)
>>
>> Anonymous memory has a smaller amount of memory (on the order
>> of system memory), most of which is or has been in a working
>> set at some point.
>>
>> File system cache tends to have two distinct sets. One part
>> are the frequently accessed files, another part are the files
>> that are accessed just once or twice.
>>
>> The file working set needs to be protected from streaming
>> IO. We do this by having new file pages start out on the
>
> Is there streaming IO workload or benchmark?
>
>> inactive file list, and only promoted to the active file
>> list if they get accessed twice.
>>
>>
>

2013-04-09 06:52:49

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 08/10] mm: vmscan: Have kswapd shrink slab only once per priority

Hello, Mel.
Sorry for too late question.

On Sun, Mar 17, 2013 at 01:04:14PM +0000, Mel Gorman wrote:
> If kswaps fails to make progress but continues to shrink slab then it'll
> either discard all of slab or consume CPU uselessly scanning shrinkers.
> This patch causes kswapd to only call the shrinkers once per priority.
>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> mm/vmscan.c | 28 +++++++++++++++++++++-------
> 1 file changed, 21 insertions(+), 7 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 7d5a932..84375b2 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2661,9 +2661,10 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
> */
> static bool kswapd_shrink_zone(struct zone *zone,
> struct scan_control *sc,
> - unsigned long lru_pages)
> + unsigned long lru_pages,
> + bool shrinking_slab)
> {
> - unsigned long nr_slab;
> + unsigned long nr_slab = 0;
> struct reclaim_state *reclaim_state = current->reclaim_state;
> struct shrink_control shrink = {
> .gfp_mask = sc->gfp_mask,
> @@ -2673,9 +2674,15 @@ static bool kswapd_shrink_zone(struct zone *zone,
> sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));
> shrink_zone(zone, sc);
>
> - reclaim_state->reclaimed_slab = 0;
> - nr_slab = shrink_slab(&shrink, sc->nr_scanned, lru_pages);
> - sc->nr_reclaimed += reclaim_state->reclaimed_slab;
> + /*
> + * Slabs are shrunk for each zone once per priority or if the zone
> + * being balanced is otherwise unreclaimable
> + */
> + if (shrinking_slab || !zone_reclaimable(zone)) {
> + reclaim_state->reclaimed_slab = 0;
> + nr_slab = shrink_slab(&shrink, sc->nr_scanned, lru_pages);
> + sc->nr_reclaimed += reclaim_state->reclaimed_slab;
> + }
>
> if (nr_slab == 0 && !zone_reclaimable(zone))
> zone->all_unreclaimable = 1;

Why shrink_slab() is called here?
I think that outside of zone loop is better place to run shrink_slab(),
because shrink_slab() is not directly related to a specific zone.

And this is a question not related to this patch.
Why nr_slab is used here to decide zone->all_unreclaimable?
nr_slab is not directly related whether a specific zone is reclaimable
or not, and, moreover, nr_slab is not directly related to number of
reclaimed pages. It just say some objects in the system are freed.

This question comes from my ignorance, so please enlighten me.

Thanks.

> @@ -2713,6 +2720,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
> int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */
> unsigned long nr_soft_reclaimed;
> unsigned long nr_soft_scanned;
> + bool shrinking_slab = true;
> struct scan_control sc = {
> .gfp_mask = GFP_KERNEL,
> .priority = DEF_PRIORITY,
> @@ -2861,7 +2869,8 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
> * already being scanned that that high
> * watermark would be met at 100% efficiency.
> */
> - if (kswapd_shrink_zone(zone, &sc, lru_pages))
> + if (kswapd_shrink_zone(zone, &sc,
> + lru_pages, shrinking_slab))
> raise_priority = false;
>
> nr_to_reclaim += sc.nr_to_reclaim;
> @@ -2900,6 +2909,9 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
> pfmemalloc_watermark_ok(pgdat))
> wake_up(&pgdat->pfmemalloc_wait);
>
> + /* Only shrink slab once per priority */
> + shrinking_slab = false;
> +
> /*
> * Fragmentation may mean that the system cannot be rebalanced
> * for high-order allocations in all zones. If twice the
> @@ -2925,8 +2937,10 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
> * Raise priority if scanning rate is too low or there was no
> * progress in reclaiming pages
> */
> - if (raise_priority || !this_reclaimed)
> + if (raise_priority || !this_reclaimed) {
> sc.priority--;
> + shrinking_slab = true;
> + }
> } while (sc.priority >= 1 &&
> !pgdat_balanced(pgdat, order, *classzone_idx));
>
> --
> 1.8.1.4
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2013-04-09 08:41:52

by Simon Jeons

[permalink] [raw]
Subject: Re: [PATCH 08/10] mm: vmscan: Have kswapd shrink slab only once per priority

Hi Joonsoo,
On 04/09/2013 02:53 PM, Joonsoo Kim wrote:
> Hello, Mel.
> Sorry for too late question.
>
> On Sun, Mar 17, 2013 at 01:04:14PM +0000, Mel Gorman wrote:
>> If kswaps fails to make progress but continues to shrink slab then it'll
>> either discard all of slab or consume CPU uselessly scanning shrinkers.
>> This patch causes kswapd to only call the shrinkers once per priority.
>>
>> Signed-off-by: Mel Gorman <[email protected]>
>> ---
>> mm/vmscan.c | 28 +++++++++++++++++++++-------
>> 1 file changed, 21 insertions(+), 7 deletions(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 7d5a932..84375b2 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -2661,9 +2661,10 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
>> */
>> static bool kswapd_shrink_zone(struct zone *zone,
>> struct scan_control *sc,
>> - unsigned long lru_pages)
>> + unsigned long lru_pages,
>> + bool shrinking_slab)
>> {
>> - unsigned long nr_slab;
>> + unsigned long nr_slab = 0;
>> struct reclaim_state *reclaim_state = current->reclaim_state;
>> struct shrink_control shrink = {
>> .gfp_mask = sc->gfp_mask,
>> @@ -2673,9 +2674,15 @@ static bool kswapd_shrink_zone(struct zone *zone,
>> sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));
>> shrink_zone(zone, sc);
>>
>> - reclaim_state->reclaimed_slab = 0;
>> - nr_slab = shrink_slab(&shrink, sc->nr_scanned, lru_pages);
>> - sc->nr_reclaimed += reclaim_state->reclaimed_slab;
>> + /*
>> + * Slabs are shrunk for each zone once per priority or if the zone
>> + * being balanced is otherwise unreclaimable
>> + */
>> + if (shrinking_slab || !zone_reclaimable(zone)) {
>> + reclaim_state->reclaimed_slab = 0;
>> + nr_slab = shrink_slab(&shrink, sc->nr_scanned, lru_pages);
>> + sc->nr_reclaimed += reclaim_state->reclaimed_slab;
>> + }
>>
>> if (nr_slab == 0 && !zone_reclaimable(zone))
>> zone->all_unreclaimable = 1;
> Why shrink_slab() is called here?
> I think that outside of zone loop is better place to run shrink_slab(),
> because shrink_slab() is not directly related to a specific zone.

True.

>
> And this is a question not related to this patch.
> Why nr_slab is used here to decide zone->all_unreclaimable?
> nr_slab is not directly related whether a specific zone is reclaimable
> or not, and, moreover, nr_slab is not directly related to number of
> reclaimed pages. It just say some objects in the system are freed.
>
> This question comes from my ignorance, so please enlighten me.

Good question, I also want to know. ;-)

>
> Thanks.
>
>> @@ -2713,6 +2720,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
>> int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */
>> unsigned long nr_soft_reclaimed;
>> unsigned long nr_soft_scanned;
>> + bool shrinking_slab = true;
>> struct scan_control sc = {
>> .gfp_mask = GFP_KERNEL,
>> .priority = DEF_PRIORITY,
>> @@ -2861,7 +2869,8 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
>> * already being scanned that that high
>> * watermark would be met at 100% efficiency.
>> */
>> - if (kswapd_shrink_zone(zone, &sc, lru_pages))
>> + if (kswapd_shrink_zone(zone, &sc,
>> + lru_pages, shrinking_slab))
>> raise_priority = false;
>>
>> nr_to_reclaim += sc.nr_to_reclaim;
>> @@ -2900,6 +2909,9 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
>> pfmemalloc_watermark_ok(pgdat))
>> wake_up(&pgdat->pfmemalloc_wait);
>>
>> + /* Only shrink slab once per priority */
>> + shrinking_slab = false;
>> +
>> /*
>> * Fragmentation may mean that the system cannot be rebalanced
>> * for high-order allocations in all zones. If twice the
>> @@ -2925,8 +2937,10 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
>> * Raise priority if scanning rate is too low or there was no
>> * progress in reclaiming pages
>> */
>> - if (raise_priority || !this_reclaimed)
>> + if (raise_priority || !this_reclaimed) {
>> sc.priority--;
>> + shrinking_slab = true;
>> + }
>> } while (sc.priority >= 1 &&
>> !pgdat_balanced(pgdat, order, *classzone_idx));
>>
>> --
>> 1.8.1.4
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to [email protected]. For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2013-04-09 11:24:56

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 08/10] mm: vmscan: Have kswapd shrink slab only once per priority

On Tue, Apr 09, 2013 at 03:53:25PM +0900, Joonsoo Kim wrote:
> Hello, Mel.
> Sorry for too late question.
>

No need to apologise at all.

> On Sun, Mar 17, 2013 at 01:04:14PM +0000, Mel Gorman wrote:
> > If kswaps fails to make progress but continues to shrink slab then it'll
> > either discard all of slab or consume CPU uselessly scanning shrinkers.
> > This patch causes kswapd to only call the shrinkers once per priority.
> >
> > Signed-off-by: Mel Gorman <[email protected]>
> > ---
> > mm/vmscan.c | 28 +++++++++++++++++++++-------
> > 1 file changed, 21 insertions(+), 7 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 7d5a932..84375b2 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2661,9 +2661,10 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
> > */
> > static bool kswapd_shrink_zone(struct zone *zone,
> > struct scan_control *sc,
> > - unsigned long lru_pages)
> > + unsigned long lru_pages,
> > + bool shrinking_slab)
> > {
> > - unsigned long nr_slab;
> > + unsigned long nr_slab = 0;
> > struct reclaim_state *reclaim_state = current->reclaim_state;
> > struct shrink_control shrink = {
> > .gfp_mask = sc->gfp_mask,
> > @@ -2673,9 +2674,15 @@ static bool kswapd_shrink_zone(struct zone *zone,
> > sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));
> > shrink_zone(zone, sc);
> >
> > - reclaim_state->reclaimed_slab = 0;
> > - nr_slab = shrink_slab(&shrink, sc->nr_scanned, lru_pages);
> > - sc->nr_reclaimed += reclaim_state->reclaimed_slab;
> > + /*
> > + * Slabs are shrunk for each zone once per priority or if the zone
> > + * being balanced is otherwise unreclaimable
> > + */
> > + if (shrinking_slab || !zone_reclaimable(zone)) {
> > + reclaim_state->reclaimed_slab = 0;
> > + nr_slab = shrink_slab(&shrink, sc->nr_scanned, lru_pages);
> > + sc->nr_reclaimed += reclaim_state->reclaimed_slab;
> > + }
> >
> > if (nr_slab == 0 && !zone_reclaimable(zone))
> > zone->all_unreclaimable = 1;
>
> Why shrink_slab() is called here?

Preserves existing behaviour.

> I think that outside of zone loop is better place to run shrink_slab(),
> because shrink_slab() is not directly related to a specific zone.
>

This is true and has been the case for a long time. The slab shrinkers
are not zone aware and it is complicated by the fact that slab usage can
indirectly pin memory on other zones. Consider for example a slab object
that is an inode entry that is allocated from the Normal zone on a
32-bit machine. Reclaiming may free memory from the Highmem zone.

It's less obvious a problem on 64-bit machines but freeing slab objects
from a zone like DMA32 can indirectly free memory from the Normal zone or
even another node entirely.

> And this is a question not related to this patch.
> Why nr_slab is used here to decide zone->all_unreclaimable?

Slab is not directly associated with a slab but as reclaiming slab can
free memory from unpredictable zones we do not consider a zone to be
fully unreclaimable until we cannot shrink slab any more.

You may be thinking that this is extremely heavy handed and you're
right, it is.

> nr_slab is not directly related whether a specific zone is reclaimable
> or not, and, moreover, nr_slab is not directly related to number of
> reclaimed pages. It just say some objects in the system are freed.
>

All true, it's the indirect relation between slab objects and the memory
that is freed when slab objects are reclaimed that has to be taken into
account.

> This question comes from my ignorance, so please enlighten me.
>

I hope this clarifies matters.

--
Mel Gorman
SUSE Labs

2013-04-10 01:07:40

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 08/10] mm: vmscan: Have kswapd shrink slab only once per priority

On Tue, Apr 09, 2013 at 12:13:59PM +0100, Mel Gorman wrote:
> On Tue, Apr 09, 2013 at 03:53:25PM +0900, Joonsoo Kim wrote:
>
> > I think that outside of zone loop is better place to run shrink_slab(),
> > because shrink_slab() is not directly related to a specific zone.
> >
>
> This is true and has been the case for a long time. The slab shrinkers
> are not zone aware and it is complicated by the fact that slab usage can
> indirectly pin memory on other zones.
......
> > And this is a question not related to this patch.
> > Why nr_slab is used here to decide zone->all_unreclaimable?
>
> Slab is not directly associated with a slab but as reclaiming slab can
> free memory from unpredictable zones we do not consider a zone to be
> fully unreclaimable until we cannot shrink slab any more.

This is something the numa aware shrinkers will greatly help with -
instead of being a global shrink it becomes a
node-the-zone-belongs-to shrink, and so....

> You may be thinking that this is extremely heavy handed and you're
> right, it is.

... it is much less heavy handed than the current code...

> > nr_slab is not directly related whether a specific zone is reclaimable
> > or not, and, moreover, nr_slab is not directly related to number of
> > reclaimed pages. It just say some objects in the system are freed.
> >
>
> All true, it's the indirect relation between slab objects and the memory
> that is freed when slab objects are reclaimed that has to be taken into
> account.

Node awareness within the shrinker infrastructure and LRUs make the
relationship much more direct ;)

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-04-10 05:21:03

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 08/10] mm: vmscan: Have kswapd shrink slab only once per priority

Hello, Mel.

On Tue, Apr 09, 2013 at 12:13:59PM +0100, Mel Gorman wrote:
> On Tue, Apr 09, 2013 at 03:53:25PM +0900, Joonsoo Kim wrote:
> > Hello, Mel.
> > Sorry for too late question.
> >
>
> No need to apologise at all.
>
> > On Sun, Mar 17, 2013 at 01:04:14PM +0000, Mel Gorman wrote:
> > > If kswaps fails to make progress but continues to shrink slab then it'll
> > > either discard all of slab or consume CPU uselessly scanning shrinkers.
> > > This patch causes kswapd to only call the shrinkers once per priority.
> > >
> > > Signed-off-by: Mel Gorman <[email protected]>
> > > ---
> > > mm/vmscan.c | 28 +++++++++++++++++++++-------
> > > 1 file changed, 21 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 7d5a932..84375b2 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -2661,9 +2661,10 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
> > > */
> > > static bool kswapd_shrink_zone(struct zone *zone,
> > > struct scan_control *sc,
> > > - unsigned long lru_pages)
> > > + unsigned long lru_pages,
> > > + bool shrinking_slab)
> > > {
> > > - unsigned long nr_slab;
> > > + unsigned long nr_slab = 0;
> > > struct reclaim_state *reclaim_state = current->reclaim_state;
> > > struct shrink_control shrink = {
> > > .gfp_mask = sc->gfp_mask,
> > > @@ -2673,9 +2674,15 @@ static bool kswapd_shrink_zone(struct zone *zone,
> > > sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));
> > > shrink_zone(zone, sc);
> > >
> > > - reclaim_state->reclaimed_slab = 0;
> > > - nr_slab = shrink_slab(&shrink, sc->nr_scanned, lru_pages);
> > > - sc->nr_reclaimed += reclaim_state->reclaimed_slab;
> > > + /*
> > > + * Slabs are shrunk for each zone once per priority or if the zone
> > > + * being balanced is otherwise unreclaimable
> > > + */
> > > + if (shrinking_slab || !zone_reclaimable(zone)) {
> > > + reclaim_state->reclaimed_slab = 0;
> > > + nr_slab = shrink_slab(&shrink, sc->nr_scanned, lru_pages);
> > > + sc->nr_reclaimed += reclaim_state->reclaimed_slab;
> > > + }
> > >
> > > if (nr_slab == 0 && !zone_reclaimable(zone))
> > > zone->all_unreclaimable = 1;
> >
> > Why shrink_slab() is called here?
>
> Preserves existing behaviour.

Yes, but, with this patch, existing behaviour is changed, that is, we call
shrink_slab() once per priority. For now, there is no reason this function
is called here. How about separating it and executing it outside of
zone loop?

We can do it with another zone loop in order to decide a
zone->all_unreclaimble. Below is pseudo code from my quick thought.


for each zone
shrink_zone()
end

nr_slab = shrink_slab()

if (nr_slab == 0) {
for each zone
if (!zone_reclaimable)
zone->all_unreclaimble = 1
end
end

}

>
> > I think that outside of zone loop is better place to run shrink_slab(),
> > because shrink_slab() is not directly related to a specific zone.
> >
>
> This is true and has been the case for a long time. The slab shrinkers
> are not zone aware and it is complicated by the fact that slab usage can
> indirectly pin memory on other zones. Consider for example a slab object
> that is an inode entry that is allocated from the Normal zone on a
> 32-bit machine. Reclaiming may free memory from the Highmem zone.
>
> It's less obvious a problem on 64-bit machines but freeing slab objects
> from a zone like DMA32 can indirectly free memory from the Normal zone or
> even another node entirely.
>
> > And this is a question not related to this patch.
> > Why nr_slab is used here to decide zone->all_unreclaimable?
>
> Slab is not directly associated with a slab but as reclaiming slab can
> free memory from unpredictable zones we do not consider a zone to be
> fully unreclaimable until we cannot shrink slab any more.
>
> You may be thinking that this is extremely heavy handed and you're
> right, it is.
>
> > nr_slab is not directly related whether a specific zone is reclaimable
> > or not, and, moreover, nr_slab is not directly related to number of
> > reclaimed pages. It just say some objects in the system are freed.
> >
>
> All true, it's the indirect relation between slab objects and the memory
> that is freed when slab objects are reclaimed that has to be taken into
> account.
>
> > This question comes from my ignorance, so please enlighten me.
> >
>
> I hope this clarifies matters.

Very helpful :)

Thanks.

>
> --
> Mel Gorman
> SUSE Labs
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2013-04-10 05:22:44

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 08/10] mm: vmscan: Have kswapd shrink slab only once per priority

Hello, Dave.

On Wed, Apr 10, 2013 at 11:07:34AM +1000, Dave Chinner wrote:
> On Tue, Apr 09, 2013 at 12:13:59PM +0100, Mel Gorman wrote:
> > On Tue, Apr 09, 2013 at 03:53:25PM +0900, Joonsoo Kim wrote:
> >
> > > I think that outside of zone loop is better place to run shrink_slab(),
> > > because shrink_slab() is not directly related to a specific zone.
> > >
> >
> > This is true and has been the case for a long time. The slab shrinkers
> > are not zone aware and it is complicated by the fact that slab usage can
> > indirectly pin memory on other zones.
> ......
> > > And this is a question not related to this patch.
> > > Why nr_slab is used here to decide zone->all_unreclaimable?
> >
> > Slab is not directly associated with a slab but as reclaiming slab can
> > free memory from unpredictable zones we do not consider a zone to be
> > fully unreclaimable until we cannot shrink slab any more.
>
> This is something the numa aware shrinkers will greatly help with -
> instead of being a global shrink it becomes a
> node-the-zone-belongs-to shrink, and so....
>
> > You may be thinking that this is extremely heavy handed and you're
> > right, it is.
>
> ... it is much less heavy handed than the current code...
>
> > > nr_slab is not directly related whether a specific zone is reclaimable
> > > or not, and, moreover, nr_slab is not directly related to number of
> > > reclaimed pages. It just say some objects in the system are freed.
> > >
> >
> > All true, it's the indirect relation between slab objects and the memory
> > that is freed when slab objects are reclaimed that has to be taken into
> > account.
>
> Node awareness within the shrinker infrastructure and LRUs make the
> relationship much more direct ;)

Yes, I think so ;)

Thanks.

>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2013-04-11 05:54:34

by Will Huck

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority

Hi Rik,
On 03/22/2013 11:52 AM, Rik van Riel wrote:
> On 03/21/2013 08:05 PM, Will Huck wrote:
>
>> One offline question, how to understand this in function balance_pgdat:
>> /*
>> * Do some background aging of the anon list, to give
>> * pages a chance to be referenced before reclaiming.
>> */
>> age_acitve_anon(zone, &sc);
>
> The anon lrus use a two-handed clock algorithm. New anonymous pages

Why the algorithm has relationship with two-handed clock?

> start off on the active anon list. Older anonymous pages get moved
> to the inactive anon list.
>
> If they get referenced before they reach the end of the inactive anon
> list, they get moved back to the active list.
>
> If we need to swap something out and find a non-referenced page at the
> end of the inactive anon list, we will swap it out.
>
> In order to make good pageout decisions, pages need to stay on the
> inactive anon list for a longer time, so they have plenty of time to
> get referenced, before the reclaim code looks at them.
>
> To achieve that, we will move some active anon pages to the inactive
> anon list even when we do not want to swap anything out - as long as
> the inactive anon list is below its target size.
>
> Does that make sense?
>

2013-04-11 05:58:46

by Will Huck

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority

Hi Rik,
On 03/22/2013 11:52 AM, Rik van Riel wrote:
> On 03/21/2013 08:05 PM, Will Huck wrote:
>
>> One offline question, how to understand this in function balance_pgdat:
>> /*
>> * Do some background aging of the anon list, to give
>> * pages a chance to be referenced before reclaiming.
>> */
>> age_acitve_anon(zone, &sc);
>
> The anon lrus use a two-handed clock algorithm. New anonymous pages
> start off on the active anon list. Older anonymous pages get moved
> to the inactive anon list.

The downside of page cache use-once replacement algorithm is
inter-reference distance, corret? Does it have any other downside?
What's the downside of two-handed clock algorithm against anonymous pages?

>
> If they get referenced before they reach the end of the inactive anon
> list, they get moved back to the active list.
>
> If we need to swap something out and find a non-referenced page at the
> end of the inactive anon list, we will swap it out.
>
> In order to make good pageout decisions, pages need to stay on the
> inactive anon list for a longer time, so they have plenty of time to
> get referenced, before the reclaim code looks at them.
>
> To achieve that, we will move some active anon pages to the inactive
> anon list even when we do not want to swap anything out - as long as
> the inactive anon list is below its target size.
>
> Does that make sense?
>

2013-04-11 09:53:33

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 08/10] mm: vmscan: Have kswapd shrink slab only once per priority

On Wed, Apr 10, 2013 at 11:07:34AM +1000, Dave Chinner wrote:
> On Tue, Apr 09, 2013 at 12:13:59PM +0100, Mel Gorman wrote:
> > On Tue, Apr 09, 2013 at 03:53:25PM +0900, Joonsoo Kim wrote:
> >
> > > I think that outside of zone loop is better place to run shrink_slab(),
> > > because shrink_slab() is not directly related to a specific zone.
> > >
> >
> > This is true and has been the case for a long time. The slab shrinkers
> > are not zone aware and it is complicated by the fact that slab usage can
> > indirectly pin memory on other zones.
> ......
> > > And this is a question not related to this patch.
> > > Why nr_slab is used here to decide zone->all_unreclaimable?
> >
> > Slab is not directly associated with a slab but as reclaiming slab can
> > free memory from unpredictable zones we do not consider a zone to be
> > fully unreclaimable until we cannot shrink slab any more.
>
> This is something the numa aware shrinkers will greatly help with -
> instead of being a global shrink it becomes a
> node-the-zone-belongs-to shrink, and so....
>

Yes, 100% agreed.

--
Mel Gorman
SUSE Labs

2013-04-11 10:01:20

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 08/10] mm: vmscan: Have kswapd shrink slab only once per priority

On Wed, Apr 10, 2013 at 02:21:42PM +0900, Joonsoo Kim wrote:
> > > > @@ -2673,9 +2674,15 @@ static bool kswapd_shrink_zone(struct zone *zone,
> > > > sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));
> > > > shrink_zone(zone, sc);
> > > >
> > > > - reclaim_state->reclaimed_slab = 0;
> > > > - nr_slab = shrink_slab(&shrink, sc->nr_scanned, lru_pages);
> > > > - sc->nr_reclaimed += reclaim_state->reclaimed_slab;
> > > > + /*
> > > > + * Slabs are shrunk for each zone once per priority or if the zone
> > > > + * being balanced is otherwise unreclaimable
> > > > + */
> > > > + if (shrinking_slab || !zone_reclaimable(zone)) {
> > > > + reclaim_state->reclaimed_slab = 0;
> > > > + nr_slab = shrink_slab(&shrink, sc->nr_scanned, lru_pages);
> > > > + sc->nr_reclaimed += reclaim_state->reclaimed_slab;
> > > > + }
> > > >
> > > > if (nr_slab == 0 && !zone_reclaimable(zone))
> > > > zone->all_unreclaimable = 1;
> > >
> > > Why shrink_slab() is called here?
> >
> > Preserves existing behaviour.
>
> Yes, but, with this patch, existing behaviour is changed, that is, we call
> shrink_slab() once per priority. For now, there is no reason this function
> is called here. How about separating it and executing it outside of
> zone loop?
>

We are calling it fewer times but it's still receiving the same information
from sc->nr_scanned it received before. With the change you are suggesting
it would be necessary to accumulating sc->nr_scanned for each zone shrunk
and then pass the sum to shrink_slab() once per priority. While this is not
necessarily wrong, there is little or no motivation to alter the shrinkers
in this manner in this series.

--
Mel Gorman
SUSE Labs

2013-04-11 10:29:46

by Ric Mason

[permalink] [raw]
Subject: Re: [PATCH 08/10] mm: vmscan: Have kswapd shrink slab only once per priority

Hi Mel,
On 04/11/2013 06:01 PM, Mel Gorman wrote:
> On Wed, Apr 10, 2013 at 02:21:42PM +0900, Joonsoo Kim wrote:
>>>>> @@ -2673,9 +2674,15 @@ static bool kswapd_shrink_zone(struct zone *zone,
>>>>> sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));
>>>>> shrink_zone(zone, sc);
>>>>>
>>>>> - reclaim_state->reclaimed_slab = 0;
>>>>> - nr_slab = shrink_slab(&shrink, sc->nr_scanned, lru_pages);
>>>>> - sc->nr_reclaimed += reclaim_state->reclaimed_slab;
>>>>> + /*
>>>>> + * Slabs are shrunk for each zone once per priority or if the zone
>>>>> + * being balanced is otherwise unreclaimable
>>>>> + */
>>>>> + if (shrinking_slab || !zone_reclaimable(zone)) {
>>>>> + reclaim_state->reclaimed_slab = 0;
>>>>> + nr_slab = shrink_slab(&shrink, sc->nr_scanned, lru_pages);
>>>>> + sc->nr_reclaimed += reclaim_state->reclaimed_slab;
>>>>> + }
>>>>>
>>>>> if (nr_slab == 0 && !zone_reclaimable(zone))
>>>>> zone->all_unreclaimable = 1;
>>>> Why shrink_slab() is called here?
>>> Preserves existing behaviour.
>> Yes, but, with this patch, existing behaviour is changed, that is, we call
>> shrink_slab() once per priority. For now, there is no reason this function
>> is called here. How about separating it and executing it outside of
>> zone loop?
>>
> We are calling it fewer times but it's still receiving the same information
> from sc->nr_scanned it received before. With the change you are suggesting
> it would be necessary to accumulating sc->nr_scanned for each zone shrunk
> and then pass the sum to shrink_slab() once per priority. While this is not
> necessarily wrong, there is little or no motivation to alter the shrinkers
> in this manner in this series.

Why the result is not the same?

2013-04-12 05:46:30

by Ric Mason

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority

Ping Rik, I also want to know the answer. ;-)
On 04/11/2013 01:58 PM, Will Huck wrote:
> Hi Rik,
> On 03/22/2013 11:52 AM, Rik van Riel wrote:
>> On 03/21/2013 08:05 PM, Will Huck wrote:
>>
>>> One offline question, how to understand this in function balance_pgdat:
>>> /*
>>> * Do some background aging of the anon list, to give
>>> * pages a chance to be referenced before reclaiming.
>>> */
>>> age_acitve_anon(zone, &sc);
>>
>> The anon lrus use a two-handed clock algorithm. New anonymous pages
>> start off on the active anon list. Older anonymous pages get moved
>> to the inactive anon list.
>
> The downside of page cache use-once replacement algorithm is
> inter-reference distance, corret? Does it have any other downside?
> What's the downside of two-handed clock algorithm against anonymous
> pages?
>
>>
>> If they get referenced before they reach the end of the inactive anon
>> list, they get moved back to the active list.
>>
>> If we need to swap something out and find a non-referenced page at the
>> end of the inactive anon list, we will swap it out.
>>
>> In order to make good pageout decisions, pages need to stay on the
>> inactive anon list for a longer time, so they have plenty of time to
>> get referenced, before the reclaim code looks at them.
>>
>> To achieve that, we will move some active anon pages to the inactive
>> anon list even when we do not want to swap anything out - as long as
>> the inactive anon list is below its target size.
>>
>> Does that make sense?
>>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2013-04-12 09:34:29

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority

On Fri, Apr 12, 2013 at 01:46:22PM +0800, Ric Mason wrote:
> Ping Rik, I also want to know the answer. ;-)

This question, like a *lot* of list traffic recently, is a "how long is a
piece of string" with hints that it is an important question but really
is just going to waste a developers time because the question lacks any
relevant meaning. The Inter-Reference Distance (IRD) is mentioned as a
problem but gives no context as to why it is perceived as a problem. IRD is
the distance in time or events between two references of the same page and
is a function of the workload and an arbitrary page, not the page reclaim
algorithm. A page algorithm may take IRD into account but IRD is not and
cannot be a "problem" so the question framing is already confusing.

Furthermore, the upsides and downsides of any given page reclaim algorithm
are complex but in most cases are discussed in the academic pages describing
them. People who are interested need to research and read these papers
and then see how it might apply to the algorithm implemented in Linux or
alternatively investigate what important workloads Linux treats badly
and addressing the problem. The result of such research (and patches)
is then a relevant discussion.

This question asks what the "downside" is versus anonymous pages. To me
the question lacks any meaning because how can a page reclaim algorithm
"against" anonymous pages? As the question lacks meaning, answering it is
impossible and it is effectively asking a developer to write a small paper
to try and discover the meaning of the question before then answering it.

I do not speak for Rik but I at least am ignoring most of these questions
because there is not enough time in the day already. Pings are not
likely to change my opinion.

--
Mel Gorman
SUSE Labs

2013-04-12 13:42:19

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm: vmscan: Limit the number of pages kswapd reclaims at each priority

On 04/12/2013 05:34 AM, Mel Gorman wrote:

> I do not speak for Rik but I at least am ignoring most of these questions
> because there is not enough time in the day already. Pings are not
> likely to change my opinion.

Same here. Some questions are just so lacking in context
that trying to answer them is unlikely to help anyone.

--
All rights reversed