LinuxLists.cc - [PATCH 0/9] Reduce system disruption due to kswapd V4

2013-05-13 08:12:48

[permalink] [raw]

Subject: [PATCH 0/9] Reduce system disruption due to kswapd V4

This series does not fix all the current known problems with reclaim but
it addresses one important swapping bug when there is background IO.

Changelog since V3
o Drop the slab shrink changes in light of Glaubers series and
discussions highlighted that there were a number of potential
problems with the patch. (mel)
o Rebased to 3.10-rc1

Changelog since V2
o Preserve ratio properly for proportional scanning (kamezawa)

Changelog since V1
o Rename ZONE_DIRTY to ZONE_TAIL_LRU_DIRTY (andi)
o Reformat comment in shrink_page_list (andi)
o Clarify some comments (dhillf)
o Rework how the proportional scanning is preserved
o Add PageReclaim check before kswapd starts writeback
o Reset sc.nr_reclaimed on every full zone scan

Kswapd and page reclaim behaviour has been screwy in one way or the other
for a long time. Very broadly speaking it worked in the far past because
machines were limited in memory so it did not have that many pages to scan
and it stalled congestion_wait() frequently to prevent it going completely
nuts. In recent times it has behaved very unsatisfactorily with some of
the problems compounded by the removal of stall logic and the introduction
of transparent hugepage support with high-order reclaims.

There are many variations of bugs that are rooted in this area. One example
is reports of a large copy operations or backup causing the machine to
grind to a halt or applications pushed to swap. Sometimes in low memory
situations a large percentage of memory suddenly gets reclaimed. In other
cases an application starts and kswapd hits 100% CPU usage for prolonged
periods of time and so on. There is now talk of introducing features like
an extra free kbytes tunable to work around aspects of the problem instead
of trying to deal with it. It's compounded by the problem that it can be
very workload and machine specific.

This series aims at addressing some of the worst of these problems without
attempting to fundmentally alter how page reclaim works.

Patches 1-2 limits the number of pages kswapd reclaims while still obeying
the anon/file proportion of the LRUs it should be scanning.

Patches 3-4 control how and when kswapd raises its scanning priority and
deletes the scanning restart logic which is tricky to follow.

Patch 5 notes that it is too easy for kswapd to reach priority 0 when
scanning and then reclaim the world. Down with that sort of thing.

Patch 6 notes that kswapd starts writeback based on scanning priority which
is not necessarily related to dirty pages. It will have kswapd
writeback pages if a number of unqueued dirty pages have been
recently encountered at the tail of the LRU.

Patch 7 notes that sometimes kswapd should stall waiting on IO to complete
to reduce LRU churn and the likelihood that it'll reclaim young
clean pages or push applications to swap. It will cause kswapd
to block on IO if it detects that pages being reclaimed under
writeback are recycling through the LRU before the IO completes.

Patchies 8-9 are cosmetic but balance_pgdat() is easier to follow after they
are applied.

This was tested using memcached+memcachetest while some background IO
was in progress as implemented by the parallel IO tests implement in MM
Tests. memcachetest benchmarks how many operations/second memcached can
service and it is run multiple times. It starts with no background IO and
then re-runs the test with larger amounts of IO in the background to roughly
simulate a large copy in progress. The expectation is that the IO should
have little or no impact on memcachetest which is running entirely in memory.

3.10.0-rc1 3.10.0-rc1
vanilla lessdisrupt-v4
Ops memcachetest-0M 22155.00 ( 0.00%) 22180.00 ( 0.11%)
Ops memcachetest-715M 22720.00 ( 0.00%) 22355.00 ( -1.61%)
Ops memcachetest-2385M 3939.00 ( 0.00%) 23450.00 (495.33%)
Ops memcachetest-4055M 3628.00 ( 0.00%) 24341.00 (570.92%)
Ops io-duration-0M 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops io-duration-715M 12.00 ( 0.00%) 7.00 ( 41.67%)
Ops io-duration-2385M 118.00 ( 0.00%) 21.00 ( 82.20%)
Ops io-duration-4055M 162.00 ( 0.00%) 36.00 ( 77.78%)
Ops swaptotal-0M 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops swaptotal-715M 140134.00 ( 0.00%) 18.00 ( 99.99%)
Ops swaptotal-2385M 392438.00 ( 0.00%) 0.00 ( 0.00%)
Ops swaptotal-4055M 449037.00 ( 0.00%) 27864.00 ( 93.79%)
Ops swapin-0M 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops swapin-715M 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops swapin-2385M 148031.00 ( 0.00%) 0.00 ( 0.00%)
Ops swapin-4055M 135109.00 ( 0.00%) 0.00 ( 0.00%)
Ops minorfaults-0M 1529984.00 ( 0.00%) 1530235.00 ( -0.02%)
Ops minorfaults-715M 1794168.00 ( 0.00%) 1613750.00 ( 10.06%)
Ops minorfaults-2385M 1739813.00 ( 0.00%) 1609396.00 ( 7.50%)
Ops minorfaults-4055M 1754460.00 ( 0.00%) 1614810.00 ( 7.96%)
Ops majorfaults-0M 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops majorfaults-715M 185.00 ( 0.00%) 180.00 ( 2.70%)
Ops majorfaults-2385M 24472.00 ( 0.00%) 101.00 ( 99.59%)
Ops majorfaults-4055M 22302.00 ( 0.00%) 229.00 ( 98.97%)

Note how the vanilla kernels performance collapses when there is enough
IO taking place in the background. This drop in performance is part of
what users complain of when they start backups. Note how the swapin and
major fault figures indicate that processes were being pushed to swap
prematurely. With the series applied, there is no noticable performance
drop and while there is still some swap activity, it's tiny.

3.10.0-rc1 3.10.0-rc1
vanilla lessdisrupt-v4
Page Ins 1234608 101892
Page Outs 12446272 11810468
Swap Ins 283406 0
Swap Outs 698469 27882
Direct pages scanned 0 136480
Kswapd pages scanned 6266537 5369364
Kswapd pages reclaimed 1088989 930832
Direct pages reclaimed 0 120901
Kswapd efficiency 17% 17%
Kswapd velocity 5398.371 4635.115
Direct efficiency 100% 88%
Direct velocity 0.000 117.817
Percentage direct scans 0% 2%
Page writes by reclaim 1655843 4009929
Page writes file 957374 3982047
Page writes anon 698469 27882
Page reclaim immediate 5245 1745
Page rescued immediate 0 0
Slabs scanned 33664 25216
Direct inode steals 0 0
Kswapd inode steals 19409 778
Kswapd skipped wait 0 0
THP fault alloc 35 30
THP collapse alloc 472 401
THP splits 27 22
THP fault fallback 0 0
THP collapse fail 0 1
Compaction stalls 0 4
Compaction success 0 0
Compaction failures 0 4
Page migrate success 0 0
Page migrate failure 0 0
Compaction pages isolated 0 0
Compaction migrate scanned 0 0
Compaction free scanned 0 0
Compaction cost 0 0
NUMA PTE updates 0 0
NUMA hint faults 0 0
NUMA hint local faults 0 0
NUMA pages migrated 0 0
AutoNUMA cost 0 0

Unfortunately, note that there is a small amount of direct reclaim due to
kswapd no longer reclaiming the world. ftrace indicates that the direct
reclaim stalls are mostly harmless with the vast bulk of the stalls incurred
by dd

23 tclsh-3367
38 memcachetest-13733
49 memcachetest-12443
57 tee-3368
1541 dd-13826
1981 dd-12539

A consequence of the direct reclaim for dd is that the processes for the
IO workload may show a higher system CPU usage. There is also a risk that
kswapd not reclaiming the world may mean that it stays awake balancing
zones, does not stall on the appropriate events and continually scans
pages it cannot reclaim consuming CPU. This will be visible as continued
high CPU usage but in my own tests I only saw a single spike lasting less
than a second and I did not observe any problems related to reclaim while
running the series on my desktop.

include/linux/mmzone.h | 17 ++
mm/vmscan.c | 450 ++++++++++++++++++++++++++++++-------------------
2 files changed, 296 insertions(+), 171 deletions(-)

--
1.8.1.4

2013-05-13 08:12:57

[permalink] [raw]

Subject: [PATCH 9/9] mm: vmscan: Move logic from balance_pgdat() to kswapd_shrink_zone()

balance_pgdat() is very long and some of the logic can and should
be internal to kswapd_shrink_zone(). Move it so the flow of
balance_pgdat() is marginally easier to follow.

Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
---
mm/vmscan.c | 110 +++++++++++++++++++++++++++++-------------------------------
1 file changed, 54 insertions(+), 56 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index e65fe46..0ba9d3a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2705,18 +2705,53 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
* This is used to determine if the scanning priority needs to be raised.
*/
static bool kswapd_shrink_zone(struct zone *zone,
+ int classzone_idx,
struct scan_control *sc,
unsigned long lru_pages,
unsigned long *nr_attempted)
{
unsigned long nr_slab;
+ int testorder = sc->order;
+ unsigned long balance_gap;
struct reclaim_state *reclaim_state = current->reclaim_state;
struct shrink_control shrink = {
.gfp_mask = sc->gfp_mask,
};
+ bool lowmem_pressure;

/* Reclaim above the high watermark. */
sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));
+
+ /*
+ * Kswapd reclaims only single pages with compaction enabled. Trying
+ * too hard to reclaim until contiguous free pages have become
+ * available can hurt performance by evicting too much useful data
+ * from memory. Do not reclaim more than needed for compaction.
+ */
+ if (IS_ENABLED(CONFIG_COMPACTION) && sc->order &&
+ compaction_suitable(zone, sc->order) !=
+ COMPACT_SKIPPED)
+ testorder = 0;
+
+ /*
+ * We put equal pressure on every zone, unless one zone has way too
+ * many pages free already. The "too many pages" is defined as the
+ * high wmark plus a "gap" where the gap is either the low
+ * watermark or 1% of the zone, whichever is smaller.
+ */
+ balance_gap = min(low_wmark_pages(zone),
+ (zone->managed_pages + KSWAPD_ZONE_BALANCE_GAP_RATIO-1) /
+ KSWAPD_ZONE_BALANCE_GAP_RATIO);
+
+ /*
+ * If there is no low memory pressure or the zone is balanced then no
+ * reclaim is necessary
+ */
+ lowmem_pressure = (buffer_heads_over_limit && is_highmem(zone));
+ if (!lowmem_pressure && zone_balanced(zone, testorder,
+ balance_gap, classzone_idx))
+ return true;
+
shrink_zone(zone, sc);

reclaim_state->reclaimed_slab = 0;
@@ -2731,6 +2766,18 @@ static bool kswapd_shrink_zone(struct zone *zone,

zone_clear_flag(zone, ZONE_WRITEBACK);

+ /*
+ * If a zone reaches its high watermark, consider it to be no longer
+ * congested. It's possible there are dirty pages backed by congested
+ * BDIs but as pressure is relieved, speculatively avoid congestion
+ * waits.
+ */
+ if (!zone->all_unreclaimable &&
+ zone_balanced(zone, testorder, 0, classzone_idx)) {
+ zone_clear_flag(zone, ZONE_CONGESTED);
+ zone_clear_flag(zone, ZONE_TAIL_LRU_DIRTY);
+ }
+
return sc->nr_scanned >= sc->nr_to_reclaim;
}

@@ -2866,8 +2913,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
*/
for (i = 0; i <= end_zone; i++) {
struct zone *zone = pgdat->node_zones + i;
- int testorder;
- unsigned long balance_gap;

if (!populated_zone(zone))
continue;
@@ -2888,61 +2933,14 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
sc.nr_reclaimed += nr_soft_reclaimed;

/*
- * We put equal pressure on every zone, unless
- * one zone has way too many pages free
- * already. The "too many pages" is defined
- * as the high wmark plus a "gap" where the
- * gap is either the low watermark or 1%
- * of the zone, whichever is smaller.
- */
- balance_gap = min(low_wmark_pages(zone),
- (zone->managed_pages +
- KSWAPD_ZONE_BALANCE_GAP_RATIO-1) /
- KSWAPD_ZONE_BALANCE_GAP_RATIO);
- /*
- * Kswapd reclaims only single pages with compaction
- * enabled. Trying too hard to reclaim until contiguous
- * free pages have become available can hurt performance
- * by evicting too much useful data from memory.
- * Do not reclaim more than needed for compaction.
+ * There should be no need to raise the scanning
+ * priority if enough pages are already being scanned
+ * that that high watermark would be met at 100%
+ * efficiency.
*/
- testorder = order;
- if (IS_ENABLED(CONFIG_COMPACTION) && order &&
- compaction_suitable(zone, order) !=
- COMPACT_SKIPPED)
- testorder = 0;
-
- if ((buffer_heads_over_limit && is_highmem_idx(i)) ||
- !zone_balanced(zone, testorder,
- balance_gap, end_zone)) {
- /*
- * There should be no need to raise the
- * scanning priority if enough pages are
- * already being scanned that high
- * watermark would be met at 100% efficiency.
- */
- if (kswapd_shrink_zone(zone, &sc, lru_pages,
- &nr_attempted))
- raise_priority = false;
- }
-
- if (zone->all_unreclaimable) {
- if (end_zone && end_zone == i)
- end_zone--;
- continue;
- }
-
- if (zone_balanced(zone, testorder, 0, end_zone))
- /*
- * If a zone reaches its high watermark,
- * consider it to be no longer congested. It's
- * possible there are dirty pages backed by
- * congested BDIs but as pressure is relieved,
- * speculatively avoid congestion waits
- * or writing pages from kswapd context.
- */
- zone_clear_flag(zone, ZONE_CONGESTED);
- zone_clear_flag(zone, ZONE_TAIL_LRU_DIRTY);
+ if (kswapd_shrink_zone(zone, end_zone, &sc,
+ lru_pages, &nr_attempted))
+ raise_priority = false;
}

/*
--
1.8.1.4

2013-05-13 08:12:53

[permalink] [raw]

Subject: [PATCH 6/9] mm: vmscan: Have kswapd writeback pages based on dirty pages encountered, not priority

Currently kswapd queues dirty pages for writeback if scanning at an elevated
priority but the priority kswapd scans at is not related to the number
of unqueued dirty encountered. Since commit "mm: vmscan: Flatten kswapd
priority loop", the priority is related to the size of the LRU and the
zone watermark which is no indication as to whether kswapd should write
pages or not.

This patch tracks if an excessive number of unqueued dirty pages are being
encountered at the end of the LRU. If so, it indicates that dirty pages
are being recycled before flusher threads can clean them and flags the
zone so that kswapd will start writing pages until the zone is balanced.

Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
---
include/linux/mmzone.h | 9 +++++++++
mm/vmscan.c | 31 +++++++++++++++++++++++++------
2 files changed, 34 insertions(+), 6 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 5c76737..2aaf72f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -495,6 +495,10 @@ typedef enum {
ZONE_CONGESTED, /* zone has many dirty pages backed by
* a congested BDI
*/
+ ZONE_TAIL_LRU_DIRTY, /* reclaim scanning has recently found
+ * many dirty file pages at the tail
+ * of the LRU.
+ */
} zone_flags_t;

static inline void zone_set_flag(struct zone *zone, zone_flags_t flag)
@@ -517,6 +521,11 @@ static inline int zone_is_reclaim_congested(const struct zone *zone)
return test_bit(ZONE_CONGESTED, &zone->flags);
}

+static inline int zone_is_reclaim_dirty(const struct zone *zone)
+{
+ return test_bit(ZONE_TAIL_LRU_DIRTY, &zone->flags);
+}
+
static inline int zone_is_reclaim_locked(const struct zone *zone)
{
return test_bit(ZONE_RECLAIM_LOCKED, &zone->flags);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1505c57..d6c916d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -676,13 +676,14 @@ static unsigned long shrink_page_list(struct list_head *page_list,
struct zone *zone,
struct scan_control *sc,
enum ttu_flags ttu_flags,
- unsigned long *ret_nr_dirty,
+ unsigned long *ret_nr_unqueued_dirty,
unsigned long *ret_nr_writeback,
bool force_reclaim)
{
LIST_HEAD(ret_pages);
LIST_HEAD(free_pages);
int pgactivate = 0;
+ unsigned long nr_unqueued_dirty = 0;
unsigned long nr_dirty = 0;
unsigned long nr_congested = 0;
unsigned long nr_reclaimed = 0;
@@ -808,14 +809,17 @@ static unsigned long shrink_page_list(struct list_head *page_list,
if (PageDirty(page)) {
nr_dirty++;

+ if (!PageWriteback(page))
+ nr_unqueued_dirty++;
+
/*
* Only kswapd can writeback filesystem pages to
- * avoid risk of stack overflow but do not writeback
- * unless under significant pressure.
+ * avoid risk of stack overflow but only writeback
+ * if many dirty pages have been encountered.
*/
if (page_is_file_cache(page) &&
(!current_is_kswapd() ||
- sc->priority >= DEF_PRIORITY - 2)) {
+ !zone_is_reclaim_dirty(zone))) {
/*
* Immediately reclaim when written back.
* Similar in principal to deactivate_page()
@@ -960,7 +964,7 @@ keep:
list_splice(&ret_pages, page_list);
count_vm_events(PGACTIVATE, pgactivate);
mem_cgroup_uncharge_end();
- *ret_nr_dirty += nr_dirty;
+ *ret_nr_unqueued_dirty += nr_unqueued_dirty;
*ret_nr_writeback += nr_writeback;
return nr_reclaimed;
}
@@ -1373,6 +1377,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
(nr_taken >> (DEF_PRIORITY - sc->priority)))
wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);

+ /*
+ * Similarly, if many dirty pages are encountered that are not
+ * currently being written then flag that kswapd should start
+ * writing back pages.
+ */
+ if (global_reclaim(sc) && nr_dirty &&
+ nr_dirty >= (nr_taken >> (DEF_PRIORITY - sc->priority)))
+ zone_set_flag(zone, ZONE_TAIL_LRU_DIRTY);
+
trace_mm_vmscan_lru_shrink_inactive(zone->zone_pgdat->node_id,
zone_idx(zone),
nr_scanned, nr_reclaimed,
@@ -2769,8 +2782,12 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
end_zone = i;
break;
} else {
- /* If balanced, clear the congested flag */
+ /*
+ * If balanced, clear the dirty and congested
+ * flags
+ */
zone_clear_flag(zone, ZONE_CONGESTED);
+ zone_clear_flag(zone, ZONE_TAIL_LRU_DIRTY);
}
}

@@ -2888,8 +2905,10 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
* possible there are dirty pages backed by
* congested BDIs but as pressure is relieved,
* speculatively avoid congestion waits
+ * or writing pages from kswapd context.
*/
zone_clear_flag(zone, ZONE_CONGESTED);
+ zone_clear_flag(zone, ZONE_TAIL_LRU_DIRTY);
}

/*
--
1.8.1.4

2013-05-13 08:13:32

[permalink] [raw]

Subject: [PATCH 8/9] mm: vmscan: Check if kswapd should writepage once per pgdat scan

Currently kswapd checks if it should start writepage as it shrinks
each zone without taking into consideration if the zone is balanced or
not. This is not wrong as such but it does not make much sense either.
This patch checks once per pgdat scan if kswapd should be writing pages.

Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Michal Hocko <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
---
mm/vmscan.c | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 911c9cd..e65fe46 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2849,6 +2849,13 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
}

/*
+ * If we're getting trouble reclaiming, start doing writepage
+ * even in laptop mode.
+ */
+ if (sc.priority < DEF_PRIORITY - 2)
+ sc.may_writepage = 1;
+
+ /*
* Now scan the zone in the dma->highmem direction, stopping
* at the last zone which needs scanning.
*
@@ -2919,13 +2926,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
raise_priority = false;
}

- /*
- * If we're getting trouble reclaiming, start doing
- * writepage even in laptop mode.
- */
- if (sc.priority < DEF_PRIORITY - 2)
- sc.may_writepage = 1;
-
if (zone->all_unreclaimable) {
if (end_zone && end_zone == i)
end_zone--;
--
1.8.1.4

2013-05-13 08:13:55

[permalink] [raw]

Subject: [PATCH 7/9] mm: vmscan: Block kswapd if it is encountering pages under writeback

Historically, kswapd used to congestion_wait() at higher priorities if it
was not making forward progress. This made no sense as the failure to make
progress could be completely independent of IO. It was later replaced by
wait_iff_congested() and removed entirely by commit 258401a6 (mm: don't
wait on congested zones in balance_pgdat()) as it was duplicating logic
in shrink_inactive_list().

This is problematic. If kswapd encounters many pages under writeback and
it continues to scan until it reaches the high watermark then it will
quickly skip over the pages under writeback and reclaim clean young
pages or push applications out to swap.

The use of wait_iff_congested() is not suited to kswapd as it will only
stall if the underlying BDI is really congested or a direct reclaimer was
unable to write to the underlying BDI. kswapd bypasses the BDI congestion
as it sets PF_SWAPWRITE but even if this was taken into account then it
would cause direct reclaimers to stall on writeback which is not desirable.

This patch sets a ZONE_WRITEBACK flag if direct reclaim or kswapd is
encountering too many pages under writeback. If this flag is set and
kswapd encounters a PageReclaim page under writeback then it'll assume
that the LRU lists are being recycled too quickly before IO can complete
and block waiting for some IO to complete.

Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Michal Hocko <[email protected]>
---
include/linux/mmzone.h | 8 ++++++
mm/vmscan.c | 78 ++++++++++++++++++++++++++++++++++++--------------
2 files changed, 64 insertions(+), 22 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 2aaf72f..fce64af 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -499,6 +499,9 @@ typedef enum {
* many dirty file pages at the tail
* of the LRU.
*/
+ ZONE_WRITEBACK, /* reclaim scanning has recently found
+ * many pages under writeback
+ */
} zone_flags_t;

static inline void zone_set_flag(struct zone *zone, zone_flags_t flag)
@@ -526,6 +529,11 @@ static inline int zone_is_reclaim_dirty(const struct zone *zone)
return test_bit(ZONE_TAIL_LRU_DIRTY, &zone->flags);
}

+static inline int zone_is_reclaim_writeback(const struct zone *zone)
+{
+ return test_bit(ZONE_WRITEBACK, &zone->flags);
+}
+
static inline int zone_is_reclaim_locked(const struct zone *zone)
{
return test_bit(ZONE_RECLAIM_LOCKED, &zone->flags);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d6c916d..911c9cd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -724,25 +724,51 @@ static unsigned long shrink_page_list(struct list_head *page_list,
may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
(PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));

+ /*
+ * If a page at the tail of the LRU is under writeback, there
+ * are three cases to consider.
+ *
+ * 1) If reclaim is encountering an excessive number of pages
+ * under writeback and this page is both under writeback and
+ * PageReclaim then it indicates that pages are being queued
+ * for IO but are being recycled through the LRU before the
+ * IO can complete. In this case, wait on the IO to complete
+ * and then clear the ZONE_WRITEBACK flag to recheck if the
+ * condition exists.
+ *
+ * 2) Global reclaim encounters a page, memcg encounters a
+ * page that is not marked for immediate reclaim or
+ * the caller does not have __GFP_IO. In this case mark
+ * the page for immediate reclaim and continue scanning.
+ *
+ * __GFP_IO is checked because a loop driver thread might
+ * enter reclaim, and deadlock if it waits on a page for
+ * which it is needed to do the write (loop masks off
+ * __GFP_IO|__GFP_FS for this reason); but more thought
+ * would probably show more reasons.
+ *
+ * Don't require __GFP_FS, since we're not going into the
+ * FS, just waiting on its writeback completion. Worryingly,
+ * ext4 gfs2 and xfs allocate pages with
+ * grab_cache_page_write_begin(,,AOP_FLAG_NOFS), so testing
+ * may_enter_fs here is liable to OOM on them.
+ *
+ * 3) memcg encounters a page that is not already marked
+ * PageReclaim. memcg does not have any dirty pages
+ * throttling so we could easily OOM just because too many
+ * pages are in writeback and there is nothing else to
+ * reclaim. Wait for the writeback to complete.
+ */
if (PageWriteback(page)) {
- /*
- * memcg doesn't have any dirty pages throttling so we
- * could easily OOM just because too many pages are in
- * writeback and there is nothing else to reclaim.
- *
- * Check __GFP_IO, certainly because a loop driver
- * thread might enter reclaim, and deadlock if it waits
- * on a page for which it is needed to do the write
- * (loop masks off __GFP_IO|__GFP_FS for this reason);
- * but more thought would probably show more reasons.
- *
- * Don't require __GFP_FS, since we're not going into
- * the FS, just waiting on its writeback completion.
- * Worryingly, ext4 gfs2 and xfs allocate pages with
- * grab_cache_page_write_begin(,,AOP_FLAG_NOFS), so
- * testing may_enter_fs here is liable to OOM on them.
- */
- if (global_reclaim(sc) ||
+ /* Case 1 above */
+ if (current_is_kswapd() &&
+ PageReclaim(page) &&
+ zone_is_reclaim_writeback(zone)) {
+ wait_on_page_writeback(page);
+ zone_clear_flag(zone, ZONE_WRITEBACK);
+
+ /* Case 2 above */
+ } else if (global_reclaim(sc) ||
!PageReclaim(page) || !(sc->gfp_mask & __GFP_IO)) {
/*
* This is slightly racy - end_page_writeback()
@@ -757,9 +783,13 @@ static unsigned long shrink_page_list(struct list_head *page_list,
*/
SetPageReclaim(page);
nr_writeback++;
+
goto keep_locked;
+
+ /* Case 3 above */
+ } else {
+ wait_on_page_writeback(page);
}
- wait_on_page_writeback(page);
}

if (!force_reclaim)
@@ -1374,8 +1404,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
* isolated page is PageWriteback
*/
if (nr_writeback && nr_writeback >=
- (nr_taken >> (DEF_PRIORITY - sc->priority)))
+ (nr_taken >> (DEF_PRIORITY - sc->priority))) {
wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
+ zone_set_flag(zone, ZONE_WRITEBACK);
+ }

/*
* Similarly, if many dirty pages are encountered that are not
@@ -2669,8 +2701,8 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
* the high watermark.
*
* Returns true if kswapd scanned at least the requested number of pages to
- * reclaim. This is used to determine if the scanning priority needs to be
- * raised.
+ * reclaim or if the lack of progress was due to pages under writeback.
+ * This is used to determine if the scanning priority needs to be raised.
*/
static bool kswapd_shrink_zone(struct zone *zone,
struct scan_control *sc,
@@ -2697,6 +2729,8 @@ static bool kswapd_shrink_zone(struct zone *zone,
if (nr_slab == 0 && !zone_reclaimable(zone))
zone->all_unreclaimable = 1;

+ zone_clear_flag(zone, ZONE_WRITEBACK);
+
return sc->nr_scanned >= sc->nr_to_reclaim;
}

--
1.8.1.4

2013-05-13 08:12:50

[permalink] [raw]

Subject: [PATCH 4/9] mm: vmscan: Decide whether to compact the pgdat based on reclaim progress

In the past, kswapd makes a decision on whether to compact memory after the
pgdat was considered balanced. This more or less worked but it is late to
make such a decision and does not fit well now that kswapd makes a decision
whether to exit the zone scanning loop depending on reclaim progress.

This patch will compact a pgdat if at least the requested number of pages
were reclaimed from unbalanced zones for a given priority. If any zone is
currently balanced, kswapd will not call compaction as it is expected the
necessary pages are already available.

Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
---
mm/vmscan.c | 59 ++++++++++++++++++++++++++++++-----------------------------
1 file changed, 30 insertions(+), 29 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1c10ee5..cd09803 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2661,7 +2661,8 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
*/
static bool kswapd_shrink_zone(struct zone *zone,
struct scan_control *sc,
- unsigned long lru_pages)
+ unsigned long lru_pages,
+ unsigned long *nr_attempted)
{
unsigned long nr_slab;
struct reclaim_state *reclaim_state = current->reclaim_state;
@@ -2677,6 +2678,9 @@ static bool kswapd_shrink_zone(struct zone *zone,
nr_slab = shrink_slab(&shrink, sc->nr_scanned, lru_pages);
sc->nr_reclaimed += reclaim_state->reclaimed_slab;

+ /* Account for the number of pages attempted to reclaim */
+ *nr_attempted += sc->nr_to_reclaim;
+
if (nr_slab == 0 && !zone_reclaimable(zone))
zone->all_unreclaimable = 1;

@@ -2724,7 +2728,9 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,

do {
unsigned long lru_pages = 0;
+ unsigned long nr_attempted = 0;
bool raise_priority = true;
+ bool pgdat_needs_compaction = (order > 0);

sc.nr_reclaimed = 0;

@@ -2774,7 +2780,21 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
for (i = 0; i <= end_zone; i++) {
struct zone *zone = pgdat->node_zones + i;

+ if (!populated_zone(zone))
+ continue;
+
lru_pages += zone_reclaimable_pages(zone);
+
+ /*
+ * If any zone is currently balanced then kswapd will
+ * not call compaction as it is expected that the
+ * necessary pages are already available.
+ */
+ if (pgdat_needs_compaction &&
+ zone_watermark_ok(zone, order,
+ low_wmark_pages(zone),
+ *classzone_idx, 0))
+ pgdat_needs_compaction = false;
}

/*
@@ -2843,7 +2863,8 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
* already being scanned that high
* watermark would be met at 100% efficiency.
*/
- if (kswapd_shrink_zone(zone, &sc, lru_pages))
+ if (kswapd_shrink_zone(zone, &sc, lru_pages,
+ &nr_attempted))
raise_priority = false;
}

@@ -2896,6 +2917,13 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
break;

/*
+ * Compact if necessary and kswapd is reclaiming at least the
+ * high watermark number of pages as requsted
+ */
+ if (pgdat_needs_compaction && sc.nr_reclaimed > nr_attempted)
+ compact_pgdat(pgdat, order);
+
+ /*
* Raise priority if scanning rate is too low or there was no
* progress in reclaiming pages
*/
@@ -2904,33 +2932,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
} while (sc.priority >= 0 &&
!pgdat_balanced(pgdat, order, *classzone_idx));

- /*
- * If kswapd was reclaiming at a higher order, it has the option of
- * sleeping without all zones being balanced. Before it does, it must
- * ensure that the watermarks for order-0 on *all* zones are met and
- * that the congestion flags are cleared. The congestion flag must
- * be cleared as kswapd is the only mechanism that clears the flag
- * and it is potentially going to sleep here.
- */
- if (order) {
- int zones_need_compaction = 1;
-
- for (i = 0; i <= end_zone; i++) {
- struct zone *zone = pgdat->node_zones + i;
-
- if (!populated_zone(zone))
- continue;
-
- /* Check if the memory needs to be defragmented. */
- if (zone_watermark_ok(zone, order,
- low_wmark_pages(zone), *classzone_idx, 0))
- zones_need_compaction = 0;
- }
-
- if (zones_need_compaction)
- compact_pgdat(pgdat, order);
- }
-
out:
/*
* Return the order we were reclaiming at so prepare_kswapd_sleep()
--
1.8.1.4

2013-05-13 08:14:36

[permalink] [raw]

Subject: [PATCH 5/9] mm: vmscan: Do not allow kswapd to scan at maximum priority

Page reclaim at priority 0 will scan the entire LRU as priority 0 is
considered to be a near OOM condition. Kswapd can reach priority 0 quite
easily if it is encountering a large number of pages it cannot reclaim
such as pages under writeback. When this happens, kswapd reclaims very
aggressively even though there may be no real risk of allocation failure
or OOM.

This patch prevents kswapd reaching priority 0 and trying to reclaim
the world. Direct reclaimers will still reach priority 0 in the event
of an OOM situation.

Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Reviewed-by: Michal Hocko <[email protected]>
---
mm/vmscan.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index cd09803..1505c57 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2929,7 +2929,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
*/
if (raise_priority || !sc.nr_reclaimed)
sc.priority--;
- } while (sc.priority >= 0 &&
+ } while (sc.priority >= 1 &&
!pgdat_balanced(pgdat, order, *classzone_idx));

out:
--
1.8.1.4

2013-05-13 08:15:01

[permalink] [raw]

Subject: [PATCH 3/9] mm: vmscan: Flatten kswapd priority loop

kswapd stops raising the scanning priority when at least SWAP_CLUSTER_MAX
pages have been reclaimed or the pgdat is considered balanced. It then
rechecks if it needs to restart at DEF_PRIORITY and whether high-order
reclaim needs to be reset. This is not wrong per-se but it is confusing
to follow and forcing kswapd to stay at DEF_PRIORITY may require several
restarts before it has scanned enough pages to meet the high watermark even
at 100% efficiency. This patch irons out the logic a bit by controlling
when priority is raised and removing the "goto loop_again".

This patch has kswapd raise the scanning priority until it is scanning
enough pages that it could meet the high watermark in one shrink of the
LRU lists if it is able to reclaim at 100% efficiency. It will not raise
the scanning prioirty higher unless it is failing to reclaim any pages.

To avoid infinite looping for high-order allocation requests kswapd will
not reclaim for high-order allocations when it has reclaimed at least
twice the number of pages as the allocation request.

Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
---
mm/vmscan.c | 86 +++++++++++++++++++++++++++++--------------------------------
1 file changed, 41 insertions(+), 45 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 26ad67f..1c10ee5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2654,8 +2654,12 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
/*
* kswapd shrinks the zone by the number of pages required to reach
* the high watermark.
+ *
+ * Returns true if kswapd scanned at least the requested number of pages to
+ * reclaim. This is used to determine if the scanning priority needs to be
+ * raised.
*/
-static void kswapd_shrink_zone(struct zone *zone,
+static bool kswapd_shrink_zone(struct zone *zone,
struct scan_control *sc,
unsigned long lru_pages)
{
@@ -2675,6 +2679,8 @@ static void kswapd_shrink_zone(struct zone *zone,

if (nr_slab == 0 && !zone_reclaimable(zone))
zone->all_unreclaimable = 1;
+
+ return sc->nr_scanned >= sc->nr_to_reclaim;
}

/*
@@ -2701,26 +2707,26 @@ static void kswapd_shrink_zone(struct zone *zone,
static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
int *classzone_idx)
{
- bool pgdat_is_balanced = false;
int i;
int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */
unsigned long nr_soft_reclaimed;
unsigned long nr_soft_scanned;
struct scan_control sc = {
.gfp_mask = GFP_KERNEL,
+ .priority = DEF_PRIORITY,
.may_unmap = 1,
.may_swap = 1,
+ .may_writepage = !laptop_mode,
.order = order,
.target_mem_cgroup = NULL,
};
-loop_again:
- sc.priority = DEF_PRIORITY;
- sc.nr_reclaimed = 0;
- sc.may_writepage = !laptop_mode;
count_vm_event(PAGEOUTRUN);

do {
unsigned long lru_pages = 0;
+ bool raise_priority = true;
+
+ sc.nr_reclaimed = 0;

/*
* Scan in the highmem->dma direction for the highest
@@ -2762,10 +2768,8 @@ loop_again:
}
}

- if (i < 0) {
- pgdat_is_balanced = true;
+ if (i < 0)
goto out;
- }

for (i = 0; i <= end_zone; i++) {
struct zone *zone = pgdat->node_zones + i;
@@ -2832,8 +2836,16 @@ loop_again:

if ((buffer_heads_over_limit && is_highmem_idx(i)) ||
!zone_balanced(zone, testorder,
- balance_gap, end_zone))
- kswapd_shrink_zone(zone, &sc, lru_pages);
+ balance_gap, end_zone)) {
+ /*
+ * There should be no need to raise the
+ * scanning priority if enough pages are
+ * already being scanned that high
+ * watermark would be met at 100% efficiency.
+ */
+ if (kswapd_shrink_zone(zone, &sc, lru_pages))
+ raise_priority = false;
+ }

/*
* If we're getting trouble reclaiming, start doing
@@ -2868,46 +2880,29 @@ loop_again:
pfmemalloc_watermark_ok(pgdat))
wake_up(&pgdat->pfmemalloc_wait);

- if (pgdat_balanced(pgdat, order, *classzone_idx)) {
- pgdat_is_balanced = true;
- break; /* kswapd: all done */
- }
-
/*
- * We do this so kswapd doesn't build up large priorities for
- * example when it is freeing in parallel with allocators. It
- * matches the direct reclaim path behaviour in terms of impact
- * on zone->*_priority.
+ * Fragmentation may mean that the system cannot be rebalanced
+ * for high-order allocations in all zones. If twice the
+ * allocation size has been reclaimed and the zones are still
+ * not balanced then recheck the watermarks at order-0 to
+ * prevent kswapd reclaiming excessively. Assume that a
+ * process requested a high-order can direct reclaim/compact.
*/
- if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX)
- break;
- } while (--sc.priority >= 0);
-
-out:
- if (!pgdat_is_balanced) {
- cond_resched();
+ if (order && sc.nr_reclaimed >= 2UL << order)
+ order = sc.order = 0;

- try_to_freeze();
+ /* Check if kswapd should be suspending */
+ if (try_to_freeze() || kthread_should_stop())
+ break;

/*
- * Fragmentation may mean that the system cannot be
- * rebalanced for high-order allocations in all zones.
- * At this point, if nr_reclaimed < SWAP_CLUSTER_MAX,
- * it means the zones have been fully scanned and are still
- * not balanced. For high-order allocations, there is
- * little point trying all over again as kswapd may
- * infinite loop.
- *
- * Instead, recheck all watermarks at order-0 as they
- * are the most important. If watermarks are ok, kswapd will go
- * back to sleep. High-order users can still perform direct
- * reclaim if they wish.
+ * Raise priority if scanning rate is too low or there was no
+ * progress in reclaiming pages
*/
- if (sc.nr_reclaimed < SWAP_CLUSTER_MAX)
- order = sc.order = 0;
-
- goto loop_again;
- }
+ if (raise_priority || !sc.nr_reclaimed)
+ sc.priority--;
+ } while (sc.priority >= 0 &&
+ !pgdat_balanced(pgdat, order, *classzone_idx));

/*
* If kswapd was reclaiming at a higher order, it has the option of
@@ -2936,6 +2931,7 @@ out:
compact_pgdat(pgdat, order);
}

+out:
/*
* Return the order we were reclaiming at so prepare_kswapd_sleep()
* makes a decision on the order we were last reclaiming at. However,
--
1.8.1.4

2013-05-13 08:15:39

[permalink] [raw]

Subject: [PATCH 1/9] mm: vmscan: Limit the number of pages kswapd reclaims at each priority

The number of pages kswapd can reclaim is bound by the number of pages it
scans which is related to the size of the zone and the scanning priority. In
many cases the priority remains low because it's reset every SWAP_CLUSTER_MAX
reclaimed pages but in the event kswapd scans a large number of pages it
cannot reclaim, it will raise the priority and potentially discard a large
percentage of the zone as sc->nr_to_reclaim is ULONG_MAX. The user-visible
effect is a reclaim "spike" where a large percentage of memory is suddenly
freed. It would be bad enough if this was just unused memory but because
of how anon/file pages are balanced it is possible that applications get
pushed to swap unnecessarily.

This patch limits the number of pages kswapd will reclaim to the high
watermark. Reclaim will still overshoot due to it not being a hard limit as
shrink_lruvec() will ignore the sc.nr_to_reclaim at DEF_PRIORITY but it
prevents kswapd reclaiming the world at higher priorities. The number of
pages it reclaims is not adjusted for high-order allocations as kswapd will
reclaim excessively if it is to balance zones for high-order allocations.

Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Reviewed-by: Michal Hocko <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
---
mm/vmscan.c | 49 +++++++++++++++++++++++++++++--------------------
1 file changed, 29 insertions(+), 20 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index fa6a853..cdbc069 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2601,6 +2601,32 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
}

/*
+ * kswapd shrinks the zone by the number of pages required to reach
+ * the high watermark.
+ */
+static void kswapd_shrink_zone(struct zone *zone,
+ struct scan_control *sc,
+ unsigned long lru_pages)
+{
+ unsigned long nr_slab;
+ struct reclaim_state *reclaim_state = current->reclaim_state;
+ struct shrink_control shrink = {
+ .gfp_mask = sc->gfp_mask,
+ };
+
+ /* Reclaim above the high watermark. */
+ sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));
+ shrink_zone(zone, sc);
+
+ reclaim_state->reclaimed_slab = 0;
+ nr_slab = shrink_slab(&shrink, sc->nr_scanned, lru_pages);
+ sc->nr_reclaimed += reclaim_state->reclaimed_slab;
+
+ if (nr_slab == 0 && !zone_reclaimable(zone))
+ zone->all_unreclaimable = 1;
+}
+
+/*
* For kswapd, balance_pgdat() will work across all this node's zones until
* they are all at high_wmark_pages(zone).
*
@@ -2627,24 +2653,15 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
bool pgdat_is_balanced = false;
int i;
int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */
- struct reclaim_state *reclaim_state = current->reclaim_state;
unsigned long nr_soft_reclaimed;
unsigned long nr_soft_scanned;
struct scan_control sc = {
.gfp_mask = GFP_KERNEL,
.may_unmap = 1,
.may_swap = 1,
- /*
- * kswapd doesn't want to be bailed out while reclaim. because
- * we want to put equal scanning pressure on each zone.
- */
- .nr_to_reclaim = ULONG_MAX,
.order = order,
.target_mem_cgroup = NULL,
};
- struct shrink_control shrink = {
- .gfp_mask = sc.gfp_mask,
- };
loop_again:
sc.priority = DEF_PRIORITY;
sc.nr_reclaimed = 0;
@@ -2716,7 +2733,7 @@ loop_again:
*/
for (i = 0; i <= end_zone; i++) {
struct zone *zone = pgdat->node_zones + i;
- int nr_slab, testorder;
+ int testorder;
unsigned long balance_gap;

if (!populated_zone(zone))
@@ -2764,16 +2781,8 @@ loop_again:

if ((buffer_heads_over_limit && is_highmem_idx(i)) ||
!zone_balanced(zone, testorder,
- balance_gap, end_zone)) {
- shrink_zone(zone, &sc);
-
- reclaim_state->reclaimed_slab = 0;
- nr_slab = shrink_slab(&shrink, sc.nr_scanned, lru_pages);
- sc.nr_reclaimed += reclaim_state->reclaimed_slab;
-
- if (nr_slab == 0 && !zone_reclaimable(zone))
- zone->all_unreclaimable = 1;
- }
+ balance_gap, end_zone))
+ kswapd_shrink_zone(zone, &sc, lru_pages);

/*
* If we're getting trouble reclaiming, start doing
--
1.8.1.4

2013-05-13 08:15:38

[permalink] [raw]

Subject: [PATCH 2/9] mm: vmscan: Obey proportional scanning requirements for kswapd

Simplistically, the anon and file LRU lists are scanned proportionally
depending on the value of vm.swappiness although there are other factors
taken into account by get_scan_count(). The patch "mm: vmscan: Limit
the number of pages kswapd reclaims" limits the number of pages kswapd
reclaims but it breaks this proportional scanning and may evenly shrink
anon/file LRUs regardless of vm.swappiness.

This patch preserves the proportional scanning and reclaim. It does mean
that kswapd will reclaim more than requested but the number of pages will
be related to the high watermark.

[[email protected]: Correct proportional reclaim for memcg and simplify]
[[email protected]: Recalculate scan based on target]
[[email protected]: Account for already scanned pages properly]
Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Rik van Riel <[email protected]>
---
mm/vmscan.c | 67 +++++++++++++++++++++++++++++++++++++++++++++++++++++--------
1 file changed, 59 insertions(+), 8 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index cdbc069..26ad67f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1822,17 +1822,25 @@ out:
static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
{
unsigned long nr[NR_LRU_LISTS];
+ unsigned long targets[NR_LRU_LISTS];
unsigned long nr_to_scan;
enum lru_list lru;
unsigned long nr_reclaimed = 0;
unsigned long nr_to_reclaim = sc->nr_to_reclaim;
struct blk_plug plug;
+ bool scan_adjusted = false;

get_scan_count(lruvec, sc, nr);

+ /* Record the original scan target for proportional adjustments later */
+ memcpy(targets, nr, sizeof(nr));
+
blk_start_plug(&plug);
while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
nr[LRU_INACTIVE_FILE]) {
+ unsigned long nr_anon, nr_file, percentage;
+ unsigned long nr_scanned;
+
for_each_evictable_lru(lru) {
if (nr[lru]) {
nr_to_scan = min(nr[lru], SWAP_CLUSTER_MAX);
@@ -1842,17 +1850,60 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
lruvec, sc);
}
}
+
+ if (nr_reclaimed < nr_to_reclaim || scan_adjusted)
+ continue;
+
/*
- * On large memory systems, scan >> priority can become
- * really large. This is fine for the starting priority;
- * we want to put equal scanning pressure on each zone.
- * However, if the VM has a harder time of freeing pages,
- * with multiple processes reclaiming pages, the total
- * freeing target can get unreasonably large.
+ * For global direct reclaim, reclaim only the number of pages
+ * requested. Less care is taken to scan proportionally as it
+ * is more important to minimise direct reclaim stall latency
+ * than it is to properly age the LRU lists.
*/
- if (nr_reclaimed >= nr_to_reclaim &&
- sc->priority < DEF_PRIORITY)
+ if (global_reclaim(sc) && !current_is_kswapd())
break;
+
+ /*
+ * For kswapd and memcg, reclaim at least the number of pages
+ * requested. Ensure that the anon and file LRUs shrink
+ * proportionally what was requested by get_scan_count(). We
+ * stop reclaiming one LRU and reduce the amount scanning
+ * proportional to the original scan target.
+ */
+ nr_file = nr[LRU_INACTIVE_FILE] + nr[LRU_ACTIVE_FILE];
+ nr_anon = nr[LRU_INACTIVE_ANON] + nr[LRU_ACTIVE_ANON];
+
+ if (nr_file > nr_anon) {
+ unsigned long scan_target = targets[LRU_INACTIVE_ANON] +
+ targets[LRU_ACTIVE_ANON] + 1;
+ lru = LRU_BASE;
+ percentage = nr_anon * 100 / scan_target;
+ } else {
+ unsigned long scan_target = targets[LRU_INACTIVE_FILE] +
+ targets[LRU_ACTIVE_FILE] + 1;
+ lru = LRU_FILE;
+ percentage = nr_file * 100 / scan_target;
+ }
+
+ /* Stop scanning the smaller of the LRU */
+ nr[lru] = 0;
+ nr[lru + LRU_ACTIVE] = 0;
+
+ /*
+ * Recalculate the other LRU scan count based on its original
+ * scan target and the percentage scanning already complete
+ */
+ lru = (lru == LRU_FILE) ? LRU_BASE : LRU_FILE;
+ nr_scanned = targets[lru] - nr[lru];
+ nr[lru] = targets[lru] * (100 - percentage) / 100;
+ nr[lru] -= min(nr[lru], nr_scanned);
+
+ lru += LRU_ACTIVE;
+ nr_scanned = targets[lru] - nr[lru];
+ nr[lru] = targets[lru] * (100 - percentage) / 100;
+ nr[lru] -= min(nr[lru], nr_scanned);
+
+ scan_adjusted = true;
}
blk_finish_plug(&plug);
sc->nr_reclaimed += nr_reclaimed;
--
1.8.1.4

2013-05-14 10:21:41

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH 2/9] mm: vmscan: Obey proportional scanning requirements for kswapd

On Mon 13-05-13 09:12:33, Mel Gorman wrote:
> Simplistically, the anon and file LRU lists are scanned proportionally
> depending on the value of vm.swappiness although there are other factors
> taken into account by get_scan_count(). The patch "mm: vmscan: Limit
> the number of pages kswapd reclaims" limits the number of pages kswapd
> reclaims but it breaks this proportional scanning and may evenly shrink
> anon/file LRUs regardless of vm.swappiness.
>
> This patch preserves the proportional scanning and reclaim. It does mean
> that kswapd will reclaim more than requested but the number of pages will
> be related to the high watermark.
>
> [[email protected]: Correct proportional reclaim for memcg and simplify]
> [[email protected]: Recalculate scan based on target]
> [[email protected]: Account for already scanned pages properly]
> Signed-off-by: Mel Gorman <[email protected]>
> Acked-by: Rik van Riel <[email protected]>

active vs. inactive might get skewed a bit AFAICS because both of them
are zeroed but file vs. anon should be scanned proportionally based on
swappiness now which sounds like it is good enough.

Reviewed-by: Michal Hocko <[email protected]>

> ---
> mm/vmscan.c | 67 +++++++++++++++++++++++++++++++++++++++++++++++++++++--------
> 1 file changed, 59 insertions(+), 8 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index cdbc069..26ad67f 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1822,17 +1822,25 @@ out:
> static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> {
> unsigned long nr[NR_LRU_LISTS];
> + unsigned long targets[NR_LRU_LISTS];
> unsigned long nr_to_scan;
> enum lru_list lru;
> unsigned long nr_reclaimed = 0;
> unsigned long nr_to_reclaim = sc->nr_to_reclaim;
> struct blk_plug plug;
> + bool scan_adjusted = false;
>
> get_scan_count(lruvec, sc, nr);
>
> + /* Record the original scan target for proportional adjustments later */
> + memcpy(targets, nr, sizeof(nr));
> +
> blk_start_plug(&plug);
> while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
> nr[LRU_INACTIVE_FILE]) {
> + unsigned long nr_anon, nr_file, percentage;
> + unsigned long nr_scanned;
> +
> for_each_evictable_lru(lru) {
> if (nr[lru]) {
> nr_to_scan = min(nr[lru], SWAP_CLUSTER_MAX);
> @@ -1842,17 +1850,60 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> lruvec, sc);
> }
> }
> +
> + if (nr_reclaimed < nr_to_reclaim || scan_adjusted)
> + continue;
> +
> /*
> - * On large memory systems, scan >> priority can become
> - * really large. This is fine for the starting priority;
> - * we want to put equal scanning pressure on each zone.
> - * However, if the VM has a harder time of freeing pages,
> - * with multiple processes reclaiming pages, the total
> - * freeing target can get unreasonably large.
> + * For global direct reclaim, reclaim only the number of pages
> + * requested. Less care is taken to scan proportionally as it
> + * is more important to minimise direct reclaim stall latency
> + * than it is to properly age the LRU lists.
> */
> - if (nr_reclaimed >= nr_to_reclaim &&
> - sc->priority < DEF_PRIORITY)
> + if (global_reclaim(sc) && !current_is_kswapd())
> break;
> +
> + /*
> + * For kswapd and memcg, reclaim at least the number of pages
> + * requested. Ensure that the anon and file LRUs shrink
> + * proportionally what was requested by get_scan_count(). We
> + * stop reclaiming one LRU and reduce the amount scanning
> + * proportional to the original scan target.
> + */
> + nr_file = nr[LRU_INACTIVE_FILE] + nr[LRU_ACTIVE_FILE];
> + nr_anon = nr[LRU_INACTIVE_ANON] + nr[LRU_ACTIVE_ANON];
> +
> + if (nr_file > nr_anon) {
> + unsigned long scan_target = targets[LRU_INACTIVE_ANON] +
> + targets[LRU_ACTIVE_ANON] + 1;
> + lru = LRU_BASE;
> + percentage = nr_anon * 100 / scan_target;
> + } else {
> + unsigned long scan_target = targets[LRU_INACTIVE_FILE] +
> + targets[LRU_ACTIVE_FILE] + 1;
> + lru = LRU_FILE;
> + percentage = nr_file * 100 / scan_target;
> + }
> +
> + /* Stop scanning the smaller of the LRU */
> + nr[lru] = 0;
> + nr[lru + LRU_ACTIVE] = 0;
> +
> + /*
> + * Recalculate the other LRU scan count based on its original
> + * scan target and the percentage scanning already complete
> + */
> + lru = (lru == LRU_FILE) ? LRU_BASE : LRU_FILE;
> + nr_scanned = targets[lru] - nr[lru];
> + nr[lru] = targets[lru] * (100 - percentage) / 100;
> + nr[lru] -= min(nr[lru], nr_scanned);
> +
> + lru += LRU_ACTIVE;
> + nr_scanned = targets[lru] - nr[lru];
> + nr[lru] = targets[lru] * (100 - percentage) / 100;
> + nr[lru] -= min(nr[lru], nr_scanned);
> +
> + scan_adjusted = true;
> }
> blk_finish_plug(&plug);
> sc->nr_reclaimed += nr_reclaimed;
> --
> 1.8.1.4
>

--
Michal Hocko
SUSE Labs

2013-05-14 10:38:29

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH 3/9] mm: vmscan: Flatten kswapd priority loop

On Mon 13-05-13 09:12:34, Mel Gorman wrote:
> kswapd stops raising the scanning priority when at least SWAP_CLUSTER_MAX
> pages have been reclaimed or the pgdat is considered balanced. It then
> rechecks if it needs to restart at DEF_PRIORITY and whether high-order
> reclaim needs to be reset. This is not wrong per-se but it is confusing
> to follow and forcing kswapd to stay at DEF_PRIORITY may require several
> restarts before it has scanned enough pages to meet the high watermark even
> at 100% efficiency. This patch irons out the logic a bit by controlling
> when priority is raised and removing the "goto loop_again".
>
> This patch has kswapd raise the scanning priority until it is scanning
> enough pages that it could meet the high watermark in one shrink of the
> LRU lists if it is able to reclaim at 100% efficiency. It will not raise
> the scanning prioirty higher unless it is failing to reclaim any pages.
>
> To avoid infinite looping for high-order allocation requests kswapd will
> not reclaim for high-order allocations when it has reclaimed at least
> twice the number of pages as the allocation request.
>
> Signed-off-by: Mel Gorman <[email protected]>
> Acked-by: Johannes Weiner <[email protected]>

Reviewed-by: Michal Hocko <[email protected]>

> ---
> mm/vmscan.c | 86 +++++++++++++++++++++++++++++--------------------------------
> 1 file changed, 41 insertions(+), 45 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 26ad67f..1c10ee5 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2654,8 +2654,12 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
> /*
> * kswapd shrinks the zone by the number of pages required to reach
> * the high watermark.
> + *
> + * Returns true if kswapd scanned at least the requested number of pages to
> + * reclaim. This is used to determine if the scanning priority needs to be
> + * raised.
> */
> -static void kswapd_shrink_zone(struct zone *zone,
> +static bool kswapd_shrink_zone(struct zone *zone,
> struct scan_control *sc,
> unsigned long lru_pages)
> {
> @@ -2675,6 +2679,8 @@ static void kswapd_shrink_zone(struct zone *zone,
>
> if (nr_slab == 0 && !zone_reclaimable(zone))
> zone->all_unreclaimable = 1;
> +
> + return sc->nr_scanned >= sc->nr_to_reclaim;
> }
>
> /*
> @@ -2701,26 +2707,26 @@ static void kswapd_shrink_zone(struct zone *zone,
> static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
> int *classzone_idx)
> {
> - bool pgdat_is_balanced = false;
> int i;
> int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */
> unsigned long nr_soft_reclaimed;
> unsigned long nr_soft_scanned;
> struct scan_control sc = {
> .gfp_mask = GFP_KERNEL,
> + .priority = DEF_PRIORITY,
> .may_unmap = 1,
> .may_swap = 1,
> + .may_writepage = !laptop_mode,
> .order = order,
> .target_mem_cgroup = NULL,
> };
> -loop_again:
> - sc.priority = DEF_PRIORITY;
> - sc.nr_reclaimed = 0;
> - sc.may_writepage = !laptop_mode;
> count_vm_event(PAGEOUTRUN);
>
> do {
> unsigned long lru_pages = 0;
> + bool raise_priority = true;
> +
> + sc.nr_reclaimed = 0;
>
> /*
> * Scan in the highmem->dma direction for the highest
> @@ -2762,10 +2768,8 @@ loop_again:
> }
> }
>
> - if (i < 0) {
> - pgdat_is_balanced = true;
> + if (i < 0)
> goto out;
> - }
>
> for (i = 0; i <= end_zone; i++) {
> struct zone *zone = pgdat->node_zones + i;
> @@ -2832,8 +2836,16 @@ loop_again:
>
> if ((buffer_heads_over_limit && is_highmem_idx(i)) ||
> !zone_balanced(zone, testorder,
> - balance_gap, end_zone))
> - kswapd_shrink_zone(zone, &sc, lru_pages);
> + balance_gap, end_zone)) {
> + /*
> + * There should be no need to raise the
> + * scanning priority if enough pages are
> + * already being scanned that high
> + * watermark would be met at 100% efficiency.
> + */
> + if (kswapd_shrink_zone(zone, &sc, lru_pages))
> + raise_priority = false;
> + }
>
> /*
> * If we're getting trouble reclaiming, start doing
> @@ -2868,46 +2880,29 @@ loop_again:
> pfmemalloc_watermark_ok(pgdat))
> wake_up(&pgdat->pfmemalloc_wait);
>
> - if (pgdat_balanced(pgdat, order, *classzone_idx)) {
> - pgdat_is_balanced = true;
> - break; /* kswapd: all done */
> - }
> -
> /*
> - * We do this so kswapd doesn't build up large priorities for
> - * example when it is freeing in parallel with allocators. It
> - * matches the direct reclaim path behaviour in terms of impact
> - * on zone->*_priority.
> + * Fragmentation may mean that the system cannot be rebalanced
> + * for high-order allocations in all zones. If twice the
> + * allocation size has been reclaimed and the zones are still
> + * not balanced then recheck the watermarks at order-0 to
> + * prevent kswapd reclaiming excessively. Assume that a
> + * process requested a high-order can direct reclaim/compact.
> */
> - if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX)
> - break;
> - } while (--sc.priority >= 0);
> -
> -out:
> - if (!pgdat_is_balanced) {
> - cond_resched();
> + if (order && sc.nr_reclaimed >= 2UL << order)
> + order = sc.order = 0;
>
> - try_to_freeze();
> + /* Check if kswapd should be suspending */
> + if (try_to_freeze() || kthread_should_stop())
> + break;
>
> /*
> - * Fragmentation may mean that the system cannot be
> - * rebalanced for high-order allocations in all zones.
> - * At this point, if nr_reclaimed < SWAP_CLUSTER_MAX,
> - * it means the zones have been fully scanned and are still
> - * not balanced. For high-order allocations, there is
> - * little point trying all over again as kswapd may
> - * infinite loop.
> - *
> - * Instead, recheck all watermarks at order-0 as they
> - * are the most important. If watermarks are ok, kswapd will go
> - * back to sleep. High-order users can still perform direct
> - * reclaim if they wish.
> + * Raise priority if scanning rate is too low or there was no
> + * progress in reclaiming pages
> */
> - if (sc.nr_reclaimed < SWAP_CLUSTER_MAX)
> - order = sc.order = 0;
> -
> - goto loop_again;
> - }
> + if (raise_priority || !sc.nr_reclaimed)
> + sc.priority--;
> + } while (sc.priority >= 0 &&
> + !pgdat_balanced(pgdat, order, *classzone_idx));
>
> /*
> * If kswapd was reclaiming at a higher order, it has the option of
> @@ -2936,6 +2931,7 @@ out:
> compact_pgdat(pgdat, order);
> }
>
> +out:
> /*
> * Return the order we were reclaiming at so prepare_kswapd_sleep()
> * makes a decision on the order we were last reclaiming at. However,
> --
> 1.8.1.4
>

--
Michal Hocko
SUSE Labs

2013-05-14 10:51:59

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH 4/9] mm: vmscan: Decide whether to compact the pgdat based on reclaim progress

On Mon 13-05-13 09:12:35, Mel Gorman wrote:
> In the past, kswapd makes a decision on whether to compact memory after the
> pgdat was considered balanced. This more or less worked but it is late to
> make such a decision and does not fit well now that kswapd makes a decision
> whether to exit the zone scanning loop depending on reclaim progress.
>
> This patch will compact a pgdat if at least the requested number of pages
> were reclaimed from unbalanced zones for a given priority. If any zone is
> currently balanced, kswapd will not call compaction as it is expected the
> necessary pages are already available.
>
> Signed-off-by: Mel Gorman <[email protected]>
> Acked-by: Johannes Weiner <[email protected]>

Reviewed-by: Michal Hocko <[email protected]>

> ---
> mm/vmscan.c | 59 ++++++++++++++++++++++++++++++-----------------------------
> 1 file changed, 30 insertions(+), 29 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 1c10ee5..cd09803 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2661,7 +2661,8 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
> */
> static bool kswapd_shrink_zone(struct zone *zone,
> struct scan_control *sc,
> - unsigned long lru_pages)
> + unsigned long lru_pages,
> + unsigned long *nr_attempted)
> {
> unsigned long nr_slab;
> struct reclaim_state *reclaim_state = current->reclaim_state;
> @@ -2677,6 +2678,9 @@ static bool kswapd_shrink_zone(struct zone *zone,
> nr_slab = shrink_slab(&shrink, sc->nr_scanned, lru_pages);
> sc->nr_reclaimed += reclaim_state->reclaimed_slab;
>
> + /* Account for the number of pages attempted to reclaim */
> + *nr_attempted += sc->nr_to_reclaim;
> +
> if (nr_slab == 0 && !zone_reclaimable(zone))
> zone->all_unreclaimable = 1;
>
> @@ -2724,7 +2728,9 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
>
> do {
> unsigned long lru_pages = 0;
> + unsigned long nr_attempted = 0;
> bool raise_priority = true;
> + bool pgdat_needs_compaction = (order > 0);
>
> sc.nr_reclaimed = 0;
>
> @@ -2774,7 +2780,21 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
> for (i = 0; i <= end_zone; i++) {
> struct zone *zone = pgdat->node_zones + i;
>
> + if (!populated_zone(zone))
> + continue;
> +
> lru_pages += zone_reclaimable_pages(zone);
> +
> + /*
> + * If any zone is currently balanced then kswapd will
> + * not call compaction as it is expected that the
> + * necessary pages are already available.
> + */
> + if (pgdat_needs_compaction &&
> + zone_watermark_ok(zone, order,
> + low_wmark_pages(zone),
> + *classzone_idx, 0))
> + pgdat_needs_compaction = false;
> }
>
> /*
> @@ -2843,7 +2863,8 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
> * already being scanned that high
> * watermark would be met at 100% efficiency.
> */
> - if (kswapd_shrink_zone(zone, &sc, lru_pages))
> + if (kswapd_shrink_zone(zone, &sc, lru_pages,
> + &nr_attempted))
> raise_priority = false;
> }
>
> @@ -2896,6 +2917,13 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
> break;
>
> /*
> + * Compact if necessary and kswapd is reclaiming at least the
> + * high watermark number of pages as requsted
> + */
> + if (pgdat_needs_compaction && sc.nr_reclaimed > nr_attempted)
> + compact_pgdat(pgdat, order);
> +
> + /*
> * Raise priority if scanning rate is too low or there was no
> * progress in reclaiming pages
> */
> @@ -2904,33 +2932,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
> } while (sc.priority >= 0 &&
> !pgdat_balanced(pgdat, order, *classzone_idx));
>
> - /*
> - * If kswapd was reclaiming at a higher order, it has the option of
> - * sleeping without all zones being balanced. Before it does, it must
> - * ensure that the watermarks for order-0 on *all* zones are met and
> - * that the congestion flags are cleared. The congestion flag must
> - * be cleared as kswapd is the only mechanism that clears the flag
> - * and it is potentially going to sleep here.
> - */
> - if (order) {
> - int zones_need_compaction = 1;
> -
> - for (i = 0; i <= end_zone; i++) {
> - struct zone *zone = pgdat->node_zones + i;
> -
> - if (!populated_zone(zone))
> - continue;
> -
> - /* Check if the memory needs to be defragmented. */
> - if (zone_watermark_ok(zone, order,
> - low_wmark_pages(zone), *classzone_idx, 0))
> - zones_need_compaction = 0;
> - }
> -
> - if (zones_need_compaction)
> - compact_pgdat(pgdat, order);
> - }
> -
> out:
> /*
> * Return the order we were reclaiming at so prepare_kswapd_sleep()
> --
> 1.8.1.4
>

--
Michal Hocko
SUSE Labs

2013-05-14 11:25:19

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH 6/9] mm: vmscan: Have kswapd writeback pages based on dirty pages encountered, not priority

On Mon 13-05-13 09:12:37, Mel Gorman wrote:
> Currently kswapd queues dirty pages for writeback if scanning at an elevated
> priority but the priority kswapd scans at is not related to the number
> of unqueued dirty encountered. Since commit "mm: vmscan: Flatten kswapd
> priority loop", the priority is related to the size of the LRU and the
> zone watermark which is no indication as to whether kswapd should write
> pages or not.
>
> This patch tracks if an excessive number of unqueued dirty pages are being
> encountered at the end of the LRU. If so, it indicates that dirty pages
> are being recycled before flusher threads can clean them and flags the
> zone so that kswapd will start writing pages until the zone is balanced.
>
> Signed-off-by: Mel Gorman <[email protected]>
> Acked-by: Johannes Weiner <[email protected]>

I do not see the direct reclaim clearing the flag. Although direct
reclaim ignores the flag it still sets it without clearing it. This
means that you rely on parallel kswapd to clear it.
We do the same thing with ZONE_CONGESTED but I think this should be at
least documented somewhere.

Other than that
Reviewed-by: Michal Hocko <[email protected]>

> ---
> include/linux/mmzone.h | 9 +++++++++
> mm/vmscan.c | 31 +++++++++++++++++++++++++------
> 2 files changed, 34 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 5c76737..2aaf72f 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -495,6 +495,10 @@ typedef enum {
> ZONE_CONGESTED, /* zone has many dirty pages backed by
> * a congested BDI
> */
> + ZONE_TAIL_LRU_DIRTY, /* reclaim scanning has recently found
> + * many dirty file pages at the tail
> + * of the LRU.
> + */
> } zone_flags_t;
>
> static inline void zone_set_flag(struct zone *zone, zone_flags_t flag)
> @@ -517,6 +521,11 @@ static inline int zone_is_reclaim_congested(const struct zone *zone)
> return test_bit(ZONE_CONGESTED, &zone->flags);
> }
>
> +static inline int zone_is_reclaim_dirty(const struct zone *zone)
> +{
> + return test_bit(ZONE_TAIL_LRU_DIRTY, &zone->flags);
> +}
> +
> static inline int zone_is_reclaim_locked(const struct zone *zone)
> {
> return test_bit(ZONE_RECLAIM_LOCKED, &zone->flags);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 1505c57..d6c916d 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -676,13 +676,14 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> struct zone *zone,
> struct scan_control *sc,
> enum ttu_flags ttu_flags,
> - unsigned long *ret_nr_dirty,
> + unsigned long *ret_nr_unqueued_dirty,
> unsigned long *ret_nr_writeback,
> bool force_reclaim)
> {
> LIST_HEAD(ret_pages);
> LIST_HEAD(free_pages);
> int pgactivate = 0;
> + unsigned long nr_unqueued_dirty = 0;
> unsigned long nr_dirty = 0;
> unsigned long nr_congested = 0;
> unsigned long nr_reclaimed = 0;
> @@ -808,14 +809,17 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> if (PageDirty(page)) {
> nr_dirty++;
>
> + if (!PageWriteback(page))
> + nr_unqueued_dirty++;
> +
> /*
> * Only kswapd can writeback filesystem pages to
> - * avoid risk of stack overflow but do not writeback
> - * unless under significant pressure.
> + * avoid risk of stack overflow but only writeback
> + * if many dirty pages have been encountered.
> */
> if (page_is_file_cache(page) &&
> (!current_is_kswapd() ||
> - sc->priority >= DEF_PRIORITY - 2)) {
> + !zone_is_reclaim_dirty(zone))) {
> /*
> * Immediately reclaim when written back.
> * Similar in principal to deactivate_page()
> @@ -960,7 +964,7 @@ keep:
> list_splice(&ret_pages, page_list);
> count_vm_events(PGACTIVATE, pgactivate);
> mem_cgroup_uncharge_end();
> - *ret_nr_dirty += nr_dirty;
> + *ret_nr_unqueued_dirty += nr_unqueued_dirty;
> *ret_nr_writeback += nr_writeback;
> return nr_reclaimed;
> }
> @@ -1373,6 +1377,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
> (nr_taken >> (DEF_PRIORITY - sc->priority)))
> wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
>
> + /*
> + * Similarly, if many dirty pages are encountered that are not
> + * currently being written then flag that kswapd should start
> + * writing back pages.
> + */
> + if (global_reclaim(sc) && nr_dirty &&
> + nr_dirty >= (nr_taken >> (DEF_PRIORITY - sc->priority)))
> + zone_set_flag(zone, ZONE_TAIL_LRU_DIRTY);
> +
> trace_mm_vmscan_lru_shrink_inactive(zone->zone_pgdat->node_id,
> zone_idx(zone),
> nr_scanned, nr_reclaimed,
> @@ -2769,8 +2782,12 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
> end_zone = i;
> break;
> } else {
> - /* If balanced, clear the congested flag */
> + /*
> + * If balanced, clear the dirty and congested
> + * flags
> + */
> zone_clear_flag(zone, ZONE_CONGESTED);
> + zone_clear_flag(zone, ZONE_TAIL_LRU_DIRTY);
> }
> }
>
> @@ -2888,8 +2905,10 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
> * possible there are dirty pages backed by
> * congested BDIs but as pressure is relieved,
> * speculatively avoid congestion waits
> + * or writing pages from kswapd context.
> */
> zone_clear_flag(zone, ZONE_CONGESTED);
> + zone_clear_flag(zone, ZONE_TAIL_LRU_DIRTY);
> }
>
> /*
> --
> 1.8.1.4
>

--
Michal Hocko
SUSE Labs

2013-05-14 12:23:29

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH 9/9] mm: vmscan: Move logic from balance_pgdat() to kswapd_shrink_zone()

On Mon 13-05-13 09:12:40, Mel Gorman wrote:
> balance_pgdat() is very long and some of the logic can and should
> be internal to kswapd_shrink_zone(). Move it so the flow of
> balance_pgdat() is marginally easier to follow.
>
> Signed-off-by: Mel Gorman <[email protected]>
> Acked-by: Johannes Weiner <[email protected]>

Looks good
Reviewed-by: Michal Hocko <[email protected]>

> ---
> mm/vmscan.c | 110 +++++++++++++++++++++++++++++-------------------------------
> 1 file changed, 54 insertions(+), 56 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index e65fe46..0ba9d3a 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2705,18 +2705,53 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
> * This is used to determine if the scanning priority needs to be raised.
> */
> static bool kswapd_shrink_zone(struct zone *zone,
> + int classzone_idx,
> struct scan_control *sc,
> unsigned long lru_pages,
> unsigned long *nr_attempted)
> {
> unsigned long nr_slab;
> + int testorder = sc->order;
> + unsigned long balance_gap;
> struct reclaim_state *reclaim_state = current->reclaim_state;
> struct shrink_control shrink = {
> .gfp_mask = sc->gfp_mask,
> };
> + bool lowmem_pressure;
>
> /* Reclaim above the high watermark. */
> sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));
> +
> + /*
> + * Kswapd reclaims only single pages with compaction enabled. Trying
> + * too hard to reclaim until contiguous free pages have become
> + * available can hurt performance by evicting too much useful data
> + * from memory. Do not reclaim more than needed for compaction.
> + */
> + if (IS_ENABLED(CONFIG_COMPACTION) && sc->order &&
> + compaction_suitable(zone, sc->order) !=
> + COMPACT_SKIPPED)
> + testorder = 0;
> +
> + /*
> + * We put equal pressure on every zone, unless one zone has way too
> + * many pages free already. The "too many pages" is defined as the
> + * high wmark plus a "gap" where the gap is either the low
> + * watermark or 1% of the zone, whichever is smaller.
> + */
> + balance_gap = min(low_wmark_pages(zone),
> + (zone->managed_pages + KSWAPD_ZONE_BALANCE_GAP_RATIO-1) /
> + KSWAPD_ZONE_BALANCE_GAP_RATIO);
> +
> + /*
> + * If there is no low memory pressure or the zone is balanced then no
> + * reclaim is necessary
> + */
> + lowmem_pressure = (buffer_heads_over_limit && is_highmem(zone));
> + if (!lowmem_pressure && zone_balanced(zone, testorder,
> + balance_gap, classzone_idx))
> + return true;
> +
> shrink_zone(zone, sc);
>
> reclaim_state->reclaimed_slab = 0;
> @@ -2731,6 +2766,18 @@ static bool kswapd_shrink_zone(struct zone *zone,
>
> zone_clear_flag(zone, ZONE_WRITEBACK);
>
> + /*
> + * If a zone reaches its high watermark, consider it to be no longer
> + * congested. It's possible there are dirty pages backed by congested
> + * BDIs but as pressure is relieved, speculatively avoid congestion
> + * waits.
> + */
> + if (!zone->all_unreclaimable &&
> + zone_balanced(zone, testorder, 0, classzone_idx)) {
> + zone_clear_flag(zone, ZONE_CONGESTED);
> + zone_clear_flag(zone, ZONE_TAIL_LRU_DIRTY);
> + }
> +
> return sc->nr_scanned >= sc->nr_to_reclaim;
> }
>
> @@ -2866,8 +2913,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
> */
> for (i = 0; i <= end_zone; i++) {
> struct zone *zone = pgdat->node_zones + i;
> - int testorder;
> - unsigned long balance_gap;
>
> if (!populated_zone(zone))
> continue;
> @@ -2888,61 +2933,14 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
> sc.nr_reclaimed += nr_soft_reclaimed;
>
> /*
> - * We put equal pressure on every zone, unless
> - * one zone has way too many pages free
> - * already. The "too many pages" is defined
> - * as the high wmark plus a "gap" where the
> - * gap is either the low watermark or 1%
> - * of the zone, whichever is smaller.
> - */
> - balance_gap = min(low_wmark_pages(zone),
> - (zone->managed_pages +
> - KSWAPD_ZONE_BALANCE_GAP_RATIO-1) /
> - KSWAPD_ZONE_BALANCE_GAP_RATIO);
> - /*
> - * Kswapd reclaims only single pages with compaction
> - * enabled. Trying too hard to reclaim until contiguous
> - * free pages have become available can hurt performance
> - * by evicting too much useful data from memory.
> - * Do not reclaim more than needed for compaction.
> + * There should be no need to raise the scanning
> + * priority if enough pages are already being scanned
> + * that that high watermark would be met at 100%
> + * efficiency.
> */
> - testorder = order;
> - if (IS_ENABLED(CONFIG_COMPACTION) && order &&
> - compaction_suitable(zone, order) !=
> - COMPACT_SKIPPED)
> - testorder = 0;
> -
> - if ((buffer_heads_over_limit && is_highmem_idx(i)) ||
> - !zone_balanced(zone, testorder,
> - balance_gap, end_zone)) {
> - /*
> - * There should be no need to raise the
> - * scanning priority if enough pages are
> - * already being scanned that high
> - * watermark would be met at 100% efficiency.
> - */
> - if (kswapd_shrink_zone(zone, &sc, lru_pages,
> - &nr_attempted))
> - raise_priority = false;
> - }
> -
> - if (zone->all_unreclaimable) {
> - if (end_zone && end_zone == i)
> - end_zone--;
> - continue;
> - }
> -
> - if (zone_balanced(zone, testorder, 0, end_zone))
> - /*
> - * If a zone reaches its high watermark,
> - * consider it to be no longer congested. It's
> - * possible there are dirty pages backed by
> - * congested BDIs but as pressure is relieved,
> - * speculatively avoid congestion waits
> - * or writing pages from kswapd context.
> - */
> - zone_clear_flag(zone, ZONE_CONGESTED);
> - zone_clear_flag(zone, ZONE_TAIL_LRU_DIRTY);
> + if (kswapd_shrink_zone(zone, end_zone, &sc,
> + lru_pages, &nr_attempted))
> + raise_priority = false;
> }
>
> /*
> --
> 1.8.1.4
>

--
Michal Hocko
SUSE Labs

2013-05-14 21:08:37

by Rik van Riel

[permalink] [raw]

Subject: Re: [PATCH 9/9] mm: vmscan: Move logic from balance_pgdat() to kswapd_shrink_zone()

On 05/13/2013 04:12 AM, Mel Gorman wrote:
> balance_pgdat() is very long and some of the logic can and should
> be internal to kswapd_shrink_zone(). Move it so the flow of
> balance_pgdat() is marginally easier to follow.
>
> Signed-off-by: Mel Gorman <[email protected]>
> Acked-by: Johannes Weiner <[email protected]>

Acked-by: Rik van Riel <[email protected]>

2013-05-14 21:09:13

by Rik van Riel

[permalink] [raw]

Subject: Re: [PATCH 7/9] mm: vmscan: Block kswapd if it is encountering pages under writeback

On 05/13/2013 04:12 AM, Mel Gorman wrote:
> Historically, kswapd used to congestion_wait() at higher priorities if it
> was not making forward progress. This made no sense as the failure to make
> progress could be completely independent of IO. It was later replaced by
> wait_iff_congested() and removed entirely by commit 258401a6 (mm: don't
> wait on congested zones in balance_pgdat()) as it was duplicating logic
> in shrink_inactive_list().
>
> This is problematic. If kswapd encounters many pages under writeback and
> it continues to scan until it reaches the high watermark then it will
> quickly skip over the pages under writeback and reclaim clean young
> pages or push applications out to swap.
>
> The use of wait_iff_congested() is not suited to kswapd as it will only
> stall if the underlying BDI is really congested or a direct reclaimer was
> unable to write to the underlying BDI. kswapd bypasses the BDI congestion
> as it sets PF_SWAPWRITE but even if this was taken into account then it
> would cause direct reclaimers to stall on writeback which is not desirable.
>
> This patch sets a ZONE_WRITEBACK flag if direct reclaim or kswapd is
> encountering too many pages under writeback. If this flag is set and
> kswapd encounters a PageReclaim page under writeback then it'll assume
> that the LRU lists are being recycled too quickly before IO can complete
> and block waiting for some IO to complete.
>
> Signed-off-by: Mel Gorman <[email protected]>
> Reviewed-by: Michal Hocko <[email protected]>

Acked-by: Rik van Riel <[email protected]>

2013-05-15 20:37:51

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH 0/9] Reduce system disruption due to kswapd V4

On Mon, 13 May 2013 09:12:31 +0100 Mel Gorman <[email protected]> wrote:

> This series does not fix all the current known problems with reclaim but
> it addresses one important swapping bug when there is background IO.
>
> ...
>
> This was tested using memcached+memcachetest while some background IO
> was in progress as implemented by the parallel IO tests implement in MM
> Tests. memcachetest benchmarks how many operations/second memcached can
> service and it is run multiple times. It starts with no background IO and
> then re-runs the test with larger amounts of IO in the background to roughly
> simulate a large copy in progress. The expectation is that the IO should
> have little or no impact on memcachetest which is running entirely in memory.
>
> 3.10.0-rc1 3.10.0-rc1
> vanilla lessdisrupt-v4
> Ops memcachetest-0M 22155.00 ( 0.00%) 22180.00 ( 0.11%)
> Ops memcachetest-715M 22720.00 ( 0.00%) 22355.00 ( -1.61%)
> Ops memcachetest-2385M 3939.00 ( 0.00%) 23450.00 (495.33%)
> Ops memcachetest-4055M 3628.00 ( 0.00%) 24341.00 (570.92%)
> Ops io-duration-0M 0.00 ( 0.00%) 0.00 ( 0.00%)
> Ops io-duration-715M 12.00 ( 0.00%) 7.00 ( 41.67%)
> Ops io-duration-2385M 118.00 ( 0.00%) 21.00 ( 82.20%)
> Ops io-duration-4055M 162.00 ( 0.00%) 36.00 ( 77.78%)
> Ops swaptotal-0M 0.00 ( 0.00%) 0.00 ( 0.00%)
> Ops swaptotal-715M 140134.00 ( 0.00%) 18.00 ( 99.99%)
> Ops swaptotal-2385M 392438.00 ( 0.00%) 0.00 ( 0.00%)
> Ops swaptotal-4055M 449037.00 ( 0.00%) 27864.00 ( 93.79%)
> Ops swapin-0M 0.00 ( 0.00%) 0.00 ( 0.00%)
> Ops swapin-715M 0.00 ( 0.00%) 0.00 ( 0.00%)
> Ops swapin-2385M 148031.00 ( 0.00%) 0.00 ( 0.00%)
> Ops swapin-4055M 135109.00 ( 0.00%) 0.00 ( 0.00%)
> Ops minorfaults-0M 1529984.00 ( 0.00%) 1530235.00 ( -0.02%)
> Ops minorfaults-715M 1794168.00 ( 0.00%) 1613750.00 ( 10.06%)
> Ops minorfaults-2385M 1739813.00 ( 0.00%) 1609396.00 ( 7.50%)
> Ops minorfaults-4055M 1754460.00 ( 0.00%) 1614810.00 ( 7.96%)
> Ops majorfaults-0M 0.00 ( 0.00%) 0.00 ( 0.00%)
> Ops majorfaults-715M 185.00 ( 0.00%) 180.00 ( 2.70%)
> Ops majorfaults-2385M 24472.00 ( 0.00%) 101.00 ( 99.59%)
> Ops majorfaults-4055M 22302.00 ( 0.00%) 229.00 ( 98.97%)

I doubt if many people have the context to understand what these
numbers really mean. I don't.

> Note how the vanilla kernels performance collapses when there is enough
> IO taking place in the background. This drop in performance is part of
> what users complain of when they start backups. Note how the swapin and
> major fault figures indicate that processes were being pushed to swap
> prematurely. With the series applied, there is no noticable performance
> drop and while there is still some swap activity, it's tiny.
>
> 3.10.0-rc1 3.10.0-rc1
> vanilla lessdisrupt-v4
> Page Ins 1234608 101892
> Page Outs 12446272 11810468
> Swap Ins 283406 0
> Swap Outs 698469 27882
> Direct pages scanned 0 136480
> Kswapd pages scanned 6266537 5369364
> Kswapd pages reclaimed 1088989 930832
> Direct pages reclaimed 0 120901
> Kswapd efficiency 17% 17%
> Kswapd velocity 5398.371 4635.115
> Direct efficiency 100% 88%
> Direct velocity 0.000 117.817
> Percentage direct scans 0% 2%
> Page writes by reclaim 1655843 4009929
> Page writes file 957374 3982047
> Page writes anon 698469 27882
> Page reclaim immediate 5245 1745
> Page rescued immediate 0 0
> Slabs scanned 33664 25216
> Direct inode steals 0 0
> Kswapd inode steals 19409 778

The reduction in inode steals might be a significant thing?
prune_icache_sb() does invalidate_mapping_pages() and can have the bad
habit of shooting down a vast number of pagecache pages (for a large
file) in a single hit. Did this workload use large (and clean) files?
Did you run any test which would expose this effect?

> ...

2013-05-15 21:39:07

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH 7/9] mm: vmscan: Block kswapd if it is encountering pages under writeback

On Mon, 13 May 2013 09:12:38 +0100 Mel Gorman <[email protected]> wrote:

> Historically, kswapd used to congestion_wait() at higher priorities if it
> was not making forward progress. This made no sense as the failure to make
> progress could be completely independent of IO. It was later replaced by
> wait_iff_congested() and removed entirely by commit 258401a6 (mm: don't
> wait on congested zones in balance_pgdat()) as it was duplicating logic
> in shrink_inactive_list().
>
> This is problematic. If kswapd encounters many pages under writeback and
> it continues to scan until it reaches the high watermark then it will
> quickly skip over the pages under writeback and reclaim clean young
> pages or push applications out to swap.
>
> The use of wait_iff_congested() is not suited to kswapd as it will only
> stall if the underlying BDI is really congested or a direct reclaimer was
> unable to write to the underlying BDI. kswapd bypasses the BDI congestion
> as it sets PF_SWAPWRITE but even if this was taken into account then it
> would cause direct reclaimers to stall on writeback which is not desirable.
>
> This patch sets a ZONE_WRITEBACK flag if direct reclaim or kswapd is
> encountering too many pages under writeback. If this flag is set and
> kswapd encounters a PageReclaim page under writeback then it'll assume
> that the LRU lists are being recycled too quickly before IO can complete
> and block waiting for some IO to complete.
>
>
> ...
>
> if (PageWriteback(page)) {
> - /*
> - * memcg doesn't have any dirty pages throttling so we
> - * could easily OOM just because too many pages are in
> - * writeback and there is nothing else to reclaim.
> - *
> - * Check __GFP_IO, certainly because a loop driver
> - * thread might enter reclaim, and deadlock if it waits
> - * on a page for which it is needed to do the write
> - * (loop masks off __GFP_IO|__GFP_FS for this reason);
> - * but more thought would probably show more reasons.
> - *
> - * Don't require __GFP_FS, since we're not going into
> - * the FS, just waiting on its writeback completion.
> - * Worryingly, ext4 gfs2 and xfs allocate pages with
> - * grab_cache_page_write_begin(,,AOP_FLAG_NOFS), so
> - * testing may_enter_fs here is liable to OOM on them.
> - */
> - if (global_reclaim(sc) ||
> + /* Case 1 above */
> + if (current_is_kswapd() &&
> + PageReclaim(page) &&
> + zone_is_reclaim_writeback(zone)) {
> + wait_on_page_writeback(page);

wait_on_page_writeback() is problematic.

- The page could be against data which is at the remote end of the
disk and the wait takes far too long.

- The page could be against a really slow device, perhaps one which
has a (relatively!) large amount of dirty data pending.

- (What happens if the wait is against a page which is backed by a
device which is failing or was unplugged or is taking 60 seconds per
-EIO or whatever?)

- (Can the wait be against an NFS/NBD/whatever page whose ethernet
cable got unplugged?)

- The termination of wait_on_page_writeback() simply doesn't tell us
what we want to know here: that there has been a useful amount of
writeback completion against the pages on the tail of this LRU.

We really don't care when *this* page's write completes. What we
want to know is whether reclaim can usefully restart polling the LRU.
These are different things, and can sometimes be very different.

These problems were observed in testing and this is why the scanner's
wait_on_page() (and, iirc, wait_on_buffer()) calls were replaced with
congestion_wait() sometime back in the 17th century.

2013-05-16 10:33:54

[permalink] [raw]

Subject: Re: [PATCH 0/9] Reduce system disruption due to kswapd V4

On Wed, May 15, 2013 at 01:37:48PM -0700, Andrew Morton wrote:
> On Mon, 13 May 2013 09:12:31 +0100 Mel Gorman <[email protected]> wrote:
>
> > This series does not fix all the current known problems with reclaim but
> > it addresses one important swapping bug when there is background IO.
> >
> > ...
> >
> > This was tested using memcached+memcachetest while some background IO
> > was in progress as implemented by the parallel IO tests implement in MM
> > Tests. memcachetest benchmarks how many operations/second memcached can
> > service and it is run multiple times. It starts with no background IO and
> > then re-runs the test with larger amounts of IO in the background to roughly
> > simulate a large copy in progress. The expectation is that the IO should
> > have little or no impact on memcachetest which is running entirely in memory.
> >
> > 3.10.0-rc1 3.10.0-rc1
> > vanilla lessdisrupt-v4
> > Ops memcachetest-0M 22155.00 ( 0.00%) 22180.00 ( 0.11%)
> > Ops memcachetest-715M 22720.00 ( 0.00%) 22355.00 ( -1.61%)
> > Ops memcachetest-2385M 3939.00 ( 0.00%) 23450.00 (495.33%)
> > Ops memcachetest-4055M 3628.00 ( 0.00%) 24341.00 (570.92%)
> > Ops io-duration-0M 0.00 ( 0.00%) 0.00 ( 0.00%)
> > Ops io-duration-715M 12.00 ( 0.00%) 7.00 ( 41.67%)
> > Ops io-duration-2385M 118.00 ( 0.00%) 21.00 ( 82.20%)
> > Ops io-duration-4055M 162.00 ( 0.00%) 36.00 ( 77.78%)
> > Ops swaptotal-0M 0.00 ( 0.00%) 0.00 ( 0.00%)
> > Ops swaptotal-715M 140134.00 ( 0.00%) 18.00 ( 99.99%)
> > Ops swaptotal-2385M 392438.00 ( 0.00%) 0.00 ( 0.00%)
> > Ops swaptotal-4055M 449037.00 ( 0.00%) 27864.00 ( 93.79%)
> > Ops swapin-0M 0.00 ( 0.00%) 0.00 ( 0.00%)
> > Ops swapin-715M 0.00 ( 0.00%) 0.00 ( 0.00%)
> > Ops swapin-2385M 148031.00 ( 0.00%) 0.00 ( 0.00%)
> > Ops swapin-4055M 135109.00 ( 0.00%) 0.00 ( 0.00%)
> > Ops minorfaults-0M 1529984.00 ( 0.00%) 1530235.00 ( -0.02%)
> > Ops minorfaults-715M 1794168.00 ( 0.00%) 1613750.00 ( 10.06%)
> > Ops minorfaults-2385M 1739813.00 ( 0.00%) 1609396.00 ( 7.50%)
> > Ops minorfaults-4055M 1754460.00 ( 0.00%) 1614810.00 ( 7.96%)
> > Ops majorfaults-0M 0.00 ( 0.00%) 0.00 ( 0.00%)
> > Ops majorfaults-715M 185.00 ( 0.00%) 180.00 ( 2.70%)
> > Ops majorfaults-2385M 24472.00 ( 0.00%) 101.00 ( 99.59%)
> > Ops majorfaults-4055M 22302.00 ( 0.00%) 229.00 ( 98.97%)
>
> I doubt if many people have the context to understand what these
> numbers really mean. I don't.
>

I should have stuck in a Sad Face/Happy Face index. You're right though,
there isn't much help explaining the figures here. Do you want to replace
the brief paragraph talking about these figures with the following?

20 iterations of this test were run in total and averaged. Every 5
iterations, additional IO was generated in the background using dd to
measure how the workload was impacted. The 0M, 715M, 2385M and 4055M subblock
refer to the amount of IO going on in the background at each iteration. So
memcachetest-2385M is reporting how many transactions/second memcachetest
recorded on average over 5 iterations while there was 2385M of IO going
on in the ground. There are six blocks of information reported here

memcachetest is the transactions/second reported by memcachetest. In
the vanilla kernel note that performance drops from around
22K/sec to just under 4K/second when there is 2385M of IO going
on in the background. This is one type of performance collapse
users complain about if a large cp or backup starts in the
background

io-duration refers to how long it takes for the background IO to
complete. It's showing that with the patched kernel that the IO
completes faster while not interfering with the memcache
workload

swaptotal is the total amount of swap traffic. With the patched kernel,
the total amount of swapping is much reduced although it is
still not zero.

swapin in this case is an indication as to whether we are swap trashing.
The closer the swapin/swapout ratio is to 0, the worse the
trashing is. Note with the patched kernel that there is no swapin
activity indicating that all the pages swapped were really inactive
unused pages.

minorfaults are just minor faults. An increased number of minor faults
can indicate that page reclaim is unmapping the pages but not
swapping them out before they are faulted back in. With the
patched kernel, there is only a small change in minor faults

majorfaults are just major faults in the target workload and a high
number can indicate that a workload is being prematurely
swapped. With the patched kernel, major faults are much reduced. As
there are no swapin's recorded so it's not being swapped. The likely
explanation is that that libraries or configuration files used by
the workload during startup get paged out by the background IO.

Overall with the series applied, there is no noticable performance drop due
to background IO and while there is still some swap activity, it's tiny and
the lack of swapins imply that the swapped pages were inactive and unused.

> > Note how the vanilla kernels performance collapses when there is enough
> > IO taking place in the background. This drop in performance is part of
> > what users complain of when they start backups. Note how the swapin and
> > major fault figures indicate that processes were being pushed to swap
> > prematurely. With the series applied, there is no noticable performance
> > drop and while there is still some swap activity, it's tiny.
> >
> > 3.10.0-rc1 3.10.0-rc1
> > vanilla lessdisrupt-v4
> > Page Ins 1234608 101892
> > Page Outs 12446272 11810468
> > Swap Ins 283406 0
> > Swap Outs 698469 27882
> > Direct pages scanned 0 136480
> > Kswapd pages scanned 6266537 5369364
> > Kswapd pages reclaimed 1088989 930832
> > Direct pages reclaimed 0 120901
> > Kswapd efficiency 17% 17%
> > Kswapd velocity 5398.371 4635.115
> > Direct efficiency 100% 88%
> > Direct velocity 0.000 117.817
> > Percentage direct scans 0% 2%
> > Page writes by reclaim 1655843 4009929
> > Page writes file 957374 3982047
> > Page writes anon 698469 27882
> > Page reclaim immediate 5245 1745
> > Page rescued immediate 0 0
> > Slabs scanned 33664 25216
> > Direct inode steals 0 0
> > Kswapd inode steals 19409 778
>
> The reduction in inode steals might be a significant thing?

It might. It could either be a reflection of kswap writing fewer swap
pages, reaching the high watermark more quicky and calling shrink_slab()
fewer times overall. This is semi-supported by the reduced slabs scanned
figures.

It could also be a reflection of the IO completing faster. The IO is
generated with dd conv=fdatasync to a single dirty file. If it's getting
pruned during the IO then there will be further delay while the metadata
is re-read from disk. With the series applied, the IO completes faster, it
gets cleaned sooner and when prune_icache_sb invalidates it, it does not get
re-read from disk again -- or at least it gets read back in fewer times..
satisfactory solid explanation.

> prune_icache_sb() does invalidate_mapping_pages() and can have the bad
> habit of shooting down a vast number of pagecache pages (for a large
> file) in a single hit. Did this workload use large (and clean) files?
> Did you run any test which would expose this effect?
>

It uses a single large file for writing so how clean it is depends on
the flushers and how long before dd calls fdatasync

I ran with fsmark in single threaded mode for large numbers of 30M files
filling memory, postmark tuned to fill memory and a basic largedd test --
all mixed read/write workloads. The performance was not obviously affected
by the series. The overall number of slabs scanned and inodes reclaimed
varied between the tests. Some reclaimed more, some less. I graphed the
slabs scanned over time and found

postmark - single large spike with the series applied at the start,
otherwise almost identicial levels of scanning. inode reclaimed
from kswapd were slightly higher over time but not by much

largedd - patched series had a few reclaim spikes but again it was more
reclaiming overall but broadly similar behaviour to the vanilla
kernel

fsmark - the patched series showed steady slab scanning throughout the
lifetime of the test unlike the vanilla kernel which had a
single large spike at the start. However, very few inodes were
actually reclaimed, it was scanning activity only and actual
performance of the benchmark was unchanged.

Overall nothing horrible fell out. I'll run a sysbench test in read-only
mode which would be closer to the workload you have in mind and see what
falls out.

Thanks Andrew.

--
Mel Gorman
SUSE Labs

2013-05-16 13:07:31

[permalink] [raw]

Subject: Re: [PATCH 7/9] mm: vmscan: Block kswapd if it is encountering pages under writeback

On Wed, May 15, 2013 at 02:39:02PM -0700, Andrew Morton wrote:
> On Mon, 13 May 2013 09:12:38 +0100 Mel Gorman <[email protected]> wrote:
>
> > Historically, kswapd used to congestion_wait() at higher priorities if it
> > was not making forward progress. This made no sense as the failure to make
> > progress could be completely independent of IO. It was later replaced by
> > wait_iff_congested() and removed entirely by commit 258401a6 (mm: don't
> > wait on congested zones in balance_pgdat()) as it was duplicating logic
> > in shrink_inactive_list().
> >
> > This is problematic. If kswapd encounters many pages under writeback and
> > it continues to scan until it reaches the high watermark then it will
> > quickly skip over the pages under writeback and reclaim clean young
> > pages or push applications out to swap.
> >
> > The use of wait_iff_congested() is not suited to kswapd as it will only
> > stall if the underlying BDI is really congested or a direct reclaimer was
> > unable to write to the underlying BDI. kswapd bypasses the BDI congestion
> > as it sets PF_SWAPWRITE but even if this was taken into account then it
> > would cause direct reclaimers to stall on writeback which is not desirable.
> >
> > This patch sets a ZONE_WRITEBACK flag if direct reclaim or kswapd is
> > encountering too many pages under writeback. If this flag is set and
> > kswapd encounters a PageReclaim page under writeback then it'll assume
> > that the LRU lists are being recycled too quickly before IO can complete
> > and block waiting for some IO to complete.
> >
> >
> > ...
> >
> > if (PageWriteback(page)) {
> > - /*
> > - * memcg doesn't have any dirty pages throttling so we
> > - * could easily OOM just because too many pages are in
> > - * writeback and there is nothing else to reclaim.
> > - *
> > - * Check __GFP_IO, certainly because a loop driver
> > - * thread might enter reclaim, and deadlock if it waits
> > - * on a page for which it is needed to do the write
> > - * (loop masks off __GFP_IO|__GFP_FS for this reason);
> > - * but more thought would probably show more reasons.
> > - *
> > - * Don't require __GFP_FS, since we're not going into
> > - * the FS, just waiting on its writeback completion.
> > - * Worryingly, ext4 gfs2 and xfs allocate pages with
> > - * grab_cache_page_write_begin(,,AOP_FLAG_NOFS), so
> > - * testing may_enter_fs here is liable to OOM on them.
> > - */
> > - if (global_reclaim(sc) ||
> > + /* Case 1 above */
> > + if (current_is_kswapd() &&
> > + PageReclaim(page) &&
> > + zone_is_reclaim_writeback(zone)) {
> > + wait_on_page_writeback(page);
>
> wait_on_page_writeback() is problematic.
>
> - The page could be against data which is at the remote end of the
> disk and the wait takes far too long.
>
> - The page could be against a really slow device, perhaps one which
> has a (relatively!) large amount of dirty data pending.
>

These are both similar points, the page being waited upon could take an
abnormal amount of time to be written due to either slow storage or a
deep writeback queue. This is true.

> - (What happens if the wait is against a page which is backed by a
> device which is failing or was unplugged or is taking 60 seconds per
> -EIO or whatever?)
>
> - (Can the wait be against an NFS/NBD/whatever page whose ethernet
> cable got unplugged?)
>

Yes it can and if it happens, kswapd will halt for long periods of time
deferring all reclaim to direct reclaim. The user-visible impact is that
unplugged storage may result in more stalls due to direct reclaim.

The situation gets worse if dirty_ratio amount of pages are backed by
disconnected storage and the storage is unwilling/unable to discard the
data. Eventually such a system will have every dirtying process halt in
balance_dirty_pages. You're correct in pointing out that this patch makes
the situation slightly worse by indirectly adding kswapd to the list of
processes that gets stalled.

> - The termination of wait_on_page_writeback() simply doesn't tell us
> what we want to know here: that there has been a useful amount of
> writeback completion against the pages on the tail of this LRU.
>

Neither does wait_iff_congested() or congestion_wait() if it waits on the
wrong queue, wakes up due to IO completing on an unrelated backing_dev or
wakes up after the timeout with no IO completed. Even if the congestion
functions wakeup due to IO being complete, there is no guarantee that
the completed IO is for pages at the end of the LRU or even on the same
node. As this was already marked PageReclaim and is under writeback there
is a reasonable assumption it has been on the LRU for some time and that
wait_on_page_writeback() is not necessarily the worst decision. This is
what I was taking into account when choosing wait_on_page_writeback().

However, the unplugged scenario is a good point that would be tricky
to debug and of the choices available, congestion_wait() is better than
wait_iff_congested() in this case. It is guaranteed to stall kswapd and we
*know* at least one page is under writeback so it does not fall foul of the
old situation where we stalled in congestion_wait() when no IO was in flight.

Would you like to replace the patch with this version? It includes a
comment explaining why wait_on_page_writeback() is not used.

---8<---
mm: vmscan: Block kswapd if it is encountering pages under writeback

Historically, kswapd used to congestion_wait() at higher priorities if it
was not making forward progress. This made no sense as the failure to make
progress could be completely independent of IO. It was later replaced by
wait_iff_congested() and removed entirely by commit 258401a6 (mm: don't
wait on congested zones in balance_pgdat()) as it was duplicating logic
in shrink_inactive_list().

This is problematic. If kswapd encounters many pages under writeback and
it continues to scan until it reaches the high watermark then it will
quickly skip over the pages under writeback and reclaim clean young
pages or push applications out to swap.

The use of wait_iff_congested() is not suited to kswapd as it will only
stall if the underlying BDI is really congested or a direct reclaimer was
unable to write to the underlying BDI. kswapd bypasses the BDI congestion
as it sets PF_SWAPWRITE but even if this was taken into account then it
would cause direct reclaimers to stall on writeback which is not desirable.

This patch sets a ZONE_WRITEBACK flag if direct reclaim or kswapd is
encountering too many pages under writeback. If this flag is set and
kswapd encounters a PageReclaim page under writeback then it'll assume
that the LRU lists are being recycled too quickly before IO can complete
and block waiting for some IO to complete.

Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Michal Hocko <[email protected]>
Acked-by: Rik van Riel <[email protected]>
---
include/linux/mmzone.h | 8 +++++
mm/vmscan.c | 80 ++++++++++++++++++++++++++++++++++++--------------
2 files changed, 66 insertions(+), 22 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 2aaf72f..fce64af 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -499,6 +499,9 @@ typedef enum {
* many dirty file pages at the tail
* of the LRU.
*/
+ ZONE_WRITEBACK, /* reclaim scanning has recently found
+ * many pages under writeback
+ */
} zone_flags_t;

static inline void zone_set_flag(struct zone *zone, zone_flags_t flag)
@@ -526,6 +529,11 @@ static inline int zone_is_reclaim_dirty(const struct zone *zone)
return test_bit(ZONE_TAIL_LRU_DIRTY, &zone->flags);
}

+static inline int zone_is_reclaim_writeback(const struct zone *zone)
+{
+ return test_bit(ZONE_WRITEBACK, &zone->flags);
+}
+
static inline int zone_is_reclaim_locked(const struct zone *zone)
{
return test_bit(ZONE_RECLAIM_LOCKED, &zone->flags);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d6c916d..45aee36 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -724,25 +724,53 @@ static unsigned long shrink_page_list(struct list_head *page_list,
may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
(PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));

+ /*
+ * If a page at the tail of the LRU is under writeback, there
+ * are three cases to consider.
+ *
+ * 1) If reclaim is encountering an excessive number of pages
+ * under writeback and this page is both under writeback and
+ * PageReclaim then it indicates that pages are being queued
+ * for IO but are being recycled through the LRU before the
+ * IO can complete. Waiting on the page itself risks an
+ * indefinite stall if it is impossible to writeback the
+ * page due to IO error or disconnected storage so instead
+ * block for HZ/10 or until some IO completes then clear the
+ * ZONE_WRITEBACK flag to recheck if the condition exists.
+ *
+ * 2) Global reclaim encounters a page, memcg encounters a
+ * page that is not marked for immediate reclaim or
+ * the caller does not have __GFP_IO. In this case mark
+ * the page for immediate reclaim and continue scanning.
+ *
+ * __GFP_IO is checked because a loop driver thread might
+ * enter reclaim, and deadlock if it waits on a page for
+ * which it is needed to do the write (loop masks off
+ * __GFP_IO|__GFP_FS for this reason); but more thought
+ * would probably show more reasons.
+ *
+ * Don't require __GFP_FS, since we're not going into the
+ * FS, just waiting on its writeback completion. Worryingly,
+ * ext4 gfs2 and xfs allocate pages with
+ * grab_cache_page_write_begin(,,AOP_FLAG_NOFS), so testing
+ * may_enter_fs here is liable to OOM on them.
+ *
+ * 3) memcg encounters a page that is not already marked
+ * PageReclaim. memcg does not have any dirty pages
+ * throttling so we could easily OOM just because too many
+ * pages are in writeback and there is nothing else to
+ * reclaim. Wait for the writeback to complete.
+ */
if (PageWriteback(page)) {
- /*
- * memcg doesn't have any dirty pages throttling so we
- * could easily OOM just because too many pages are in
- * writeback and there is nothing else to reclaim.
- *
- * Check __GFP_IO, certainly because a loop driver
- * thread might enter reclaim, and deadlock if it waits
- * on a page for which it is needed to do the write
- * (loop masks off __GFP_IO|__GFP_FS for this reason);
- * but more thought would probably show more reasons.
- *
- * Don't require __GFP_FS, since we're not going into
- * the FS, just waiting on its writeback completion.
- * Worryingly, ext4 gfs2 and xfs allocate pages with
- * grab_cache_page_write_begin(,,AOP_FLAG_NOFS), so
- * testing may_enter_fs here is liable to OOM on them.
- */
- if (global_reclaim(sc) ||
+ /* Case 1 above */
+ if (current_is_kswapd() &&
+ PageReclaim(page) &&
+ zone_is_reclaim_writeback(zone)) {
+ congestion_wait(BLK_RW_ASYNC, HZ/10);
+ zone_clear_flag(zone, ZONE_WRITEBACK);
+
+ /* Case 2 above */
+ } else if (global_reclaim(sc) ||
!PageReclaim(page) || !(sc->gfp_mask & __GFP_IO)) {
/*
* This is slightly racy - end_page_writeback()
@@ -757,9 +785,13 @@ static unsigned long shrink_page_list(struct list_head *page_list,
*/
SetPageReclaim(page);
nr_writeback++;
+
goto keep_locked;
+
+ /* Case 3 above */
+ } else {
+ wait_on_page_writeback(page);
}
- wait_on_page_writeback(page);
}

if (!force_reclaim)
@@ -1374,8 +1406,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
* isolated page is PageWriteback
*/
if (nr_writeback && nr_writeback >=
- (nr_taken >> (DEF_PRIORITY - sc->priority)))
+ (nr_taken >> (DEF_PRIORITY - sc->priority))) {
wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
+ zone_set_flag(zone, ZONE_WRITEBACK);
+ }

/*
* Similarly, if many dirty pages are encountered that are not
@@ -2669,8 +2703,8 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
* the high watermark.
*
* Returns true if kswapd scanned at least the requested number of pages to
- * reclaim. This is used to determine if the scanning priority needs to be
- * raised.
+ * reclaim or if the lack of progress was due to pages under writeback.
+ * This is used to determine if the scanning priority needs to be raised.
*/
static bool kswapd_shrink_zone(struct zone *zone,
struct scan_control *sc,
@@ -2697,6 +2731,8 @@ static bool kswapd_shrink_zone(struct zone *zone,
if (nr_slab == 0 && !zone_reclaimable(zone))
zone->all_unreclaimable = 1;

+ zone_clear_flag(zone, ZONE_WRITEBACK);
+
return sc->nr_scanned >= sc->nr_to_reclaim;
}

2013-05-16 13:54:32

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH 0/9] Reduce system disruption due to kswapd V4

On Thu 16-05-13 11:33:45, Mel Gorman wrote:
[...]
> swapin in this case is an indication as to whether we are swap trashing.
> The closer the swapin/swapout ratio is to 0, the worse the

I guess you meant the ratio is closer to 1 not zero.
--
Michal Hocko
SUSE Labs

2013-05-16 14:11:59

[permalink] [raw]

Subject: Re: [PATCH 0/9] Reduce system disruption due to kswapd V4

On Thu, May 16, 2013 at 03:54:28PM +0200, Michal Hocko wrote:
> On Thu 16-05-13 11:33:45, Mel Gorman wrote:
> [...]
> > swapin in this case is an indication as to whether we are swap trashing.
> > The closer the swapin/swapout ratio is to 0, the worse the
>
> I guess you meant the ratio is closer to 1 not zero.

Damnit, yes! I was even thinking 1 at the time I was typing. It's not
like the keys are even near each other.

--
Mel Gorman
SUSE Labs

2013-05-17 03:41:58

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [PATCH 2/9] mm: vmscan: Obey proportional scanning requirements for kswapd

(2013/05/13 17:12), Mel Gorman wrote:
> Simplistically, the anon and file LRU lists are scanned proportionally
> depending on the value of vm.swappiness although there are other factors
> taken into account by get_scan_count(). The patch "mm: vmscan: Limit
> the number of pages kswapd reclaims" limits the number of pages kswapd
> reclaims but it breaks this proportional scanning and may evenly shrink
> anon/file LRUs regardless of vm.swappiness.
>
> This patch preserves the proportional scanning and reclaim. It does mean
> that kswapd will reclaim more than requested but the number of pages will
> be related to the high watermark.
>
> [[email protected]: Correct proportional reclaim for memcg and simplify]
> [[email protected]: Recalculate scan based on target]
> [[email protected]: Account for already scanned pages properly]
> Signed-off-by: Mel Gorman <[email protected]>
> Acked-by: Rik van Riel <[email protected]>

Acked-by: KAMEZAWA Hiroyuki <[email protected]>

2013-05-18 21:15:38

by Zlatko Calusic

[permalink] [raw]

Subject: Re: [PATCH 0/9] Reduce system disruption due to kswapd V4

On 15.05.2013 22:37, Andrew Morton wrote:
>>
>> 3.10.0-rc1 3.10.0-rc1
>> vanilla lessdisrupt-v4
>> Page Ins 1234608 101892
>> Page Outs 12446272 11810468
>> Swap Ins 283406 0
>> Swap Outs 698469 27882
>> Direct pages scanned 0 136480
>> Kswapd pages scanned 6266537 5369364
>> Kswapd pages reclaimed 1088989 930832
>> Direct pages reclaimed 0 120901
>> Kswapd efficiency 17% 17%
>> Kswapd velocity 5398.371 4635.115
>> Direct efficiency 100% 88%
>> Direct velocity 0.000 117.817
>> Percentage direct scans 0% 2%
>> Page writes by reclaim 1655843 4009929
>> Page writes file 957374 3982047
>> Page writes anon 698469 27882
>> Page reclaim immediate 5245 1745
>> Page rescued immediate 0 0
>> Slabs scanned 33664 25216
>> Direct inode steals 0 0
>> Kswapd inode steals 19409 778
>
> The reduction in inode steals might be a significant thing?
> prune_icache_sb() does invalidate_mapping_pages() and can have the bad
> habit of shooting down a vast number of pagecache pages (for a large
> file) in a single hit. Did this workload use large (and clean) files?
> Did you run any test which would expose this effect?
>

I did not run specific tests, but I believe I observed exactly this
issue on the real workload, where even at a moderate load sudden frees
of pagecache happen quite often. I've attached a small graph where it
can be easily seen. The snapshot was taken while the server was running
an unpatched Linus kernel. After the Mel's patch series is applied, I
can't see anything similar. So it seems that this issue is completely
gone, Mel's done a wonderful job.

And BTW, V4 continues to be rock stable, running here on many different
machines, so I look forward seeing this code merged in 3.11.
--
Zlatko

Attachments:

memory-hourly.png (11.67 kB)

2013-05-21 23:14:30

by Dave Chinner

[permalink] [raw]

Subject: Re: [PATCH 0/9] Reduce system disruption due to kswapd V4

On Mon, May 13, 2013 at 09:12:31AM +0100, Mel Gorman wrote:
> This series does not fix all the current known problems with reclaim but
> it addresses one important swapping bug when there is background IO.

....
>
> 3.10.0-rc1 3.10.0-rc1
> vanilla lessdisrupt-v4
> Page Ins 1234608 101892
> Page Outs 12446272 11810468
> Swap Ins 283406 0
> Swap Outs 698469 27882
> Direct pages scanned 0 136480
> Kswapd pages scanned 6266537 5369364
> Kswapd pages reclaimed 1088989 930832
> Direct pages reclaimed 0 120901
> Kswapd efficiency 17% 17%
> Kswapd velocity 5398.371 4635.115
> Direct efficiency 100% 88%
> Direct velocity 0.000 117.817
> Percentage direct scans 0% 2%
> Page writes by reclaim 1655843 4009929
> Page writes file 957374 3982047

Lots more file pages are written by reclaim. Is this from kswapd
or direct reclaim? If it's direct reclaim, what happens when you run
on a filesystem that doesn't allow writeback from direct reclaim?

Also, what does this do to IO patterns and allocation? This tends
to indicate that the background flusher thread is not doing the
writeback work fast enough when memory is low - can you comment on
this at all, Mel?

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-05-22 08:49:14

[permalink] [raw]

Subject: Re: [PATCH 0/9] Reduce system disruption due to kswapd V4

On Wed, May 22, 2013 at 09:13:58AM +1000, Dave Chinner wrote:
> On Mon, May 13, 2013 at 09:12:31AM +0100, Mel Gorman wrote:
> > This series does not fix all the current known problems with reclaim but
> > it addresses one important swapping bug when there is background IO.
>
> ....
> >
> > 3.10.0-rc1 3.10.0-rc1
> > vanilla lessdisrupt-v4
> > Page Ins 1234608 101892
> > Page Outs 12446272 11810468
> > Swap Ins 283406 0
> > Swap Outs 698469 27882
> > Direct pages scanned 0 136480
> > Kswapd pages scanned 6266537 5369364
> > Kswapd pages reclaimed 1088989 930832
> > Direct pages reclaimed 0 120901
> > Kswapd efficiency 17% 17%
> > Kswapd velocity 5398.371 4635.115
> > Direct efficiency 100% 88%
> > Direct velocity 0.000 117.817
> > Percentage direct scans 0% 2%
> > Page writes by reclaim 1655843 4009929
> > Page writes file 957374 3982047
>
> Lots more file pages are written by reclaim. Is this from kswapd
> or direct reclaim? If it's direct reclaim, what happens when you run
> on a filesystem that doesn't allow writeback from direct reclaim?
>

It's from kswapd. There is a check in shrink_page_list that prevents direct
reclaim writing pages out for exactly the reason that some filesystems
ignore it.

> Also, what does this do to IO patterns and allocation? This tends
> to indicate that the background flusher thread is not doing the
> writeback work fast enough when memory is low - can you comment on
> this at all, Mel?
>

There are two aspects to it. As processes are not longer being pushed
to swap but kswapd is still reclaiming a similar number of pages, it is
scanning through the file LRUs faster before flushers have a chance to
flush pages. kswapd starts writing pages if the zone gets marked "reclaim
dirty" which happens if enough dirty pages are encountered at the end of
the LRU that are !PageWriteback. If this flag is set too early then more
writes from kswapd context occur -- I'll look into it.

On a related note, I've found with Jan Kara that the PageWriteback check
does not work in all cases. Some filesystems will have buffer pages that
are PageDirty with all clean buffers or with buffers locked for IO that are
!PageWriteback which will also confuse when "reclaim dirty" gets set. The
patches are still being a work in progress.

--
Mel Gorman
SUSE Labs