Changes since v1:
- Added acks.
- Dropped "mm/vmscan: replace mm_vmscan_lru_shrink_inactive with shrink_page_list tracepoint"
patch. It's better to avoid changing the tracepoint as some people may be used to it.
Removing 'nr_scanned' and 'file' arguments is also not very good. Yes, these numbers could
be obtained from mm_vmscan_lru_isolate tracepoint, but it's easier when it's all in one place.
- Compare with nr_writeback,dirty, etc only isolated file pages as it always was.
- Minor changelog tweaks.
Andrey Ryabinin (4):
mm/vmscan: Update stale comments
mm/vmscan: remove redundant current_may_throttle() check
mm/vmscan: Don't change pgdat state on base of a single LRU list
state.
mm/vmscan: Don't mess with pgdat->flags in memcg reclaim.
include/linux/backing-dev.h | 2 +-
include/linux/memcontrol.h | 2 +
mm/backing-dev.c | 19 ++---
mm/vmscan.c | 166 ++++++++++++++++++++++++++++++--------------
4 files changed, 122 insertions(+), 67 deletions(-)
--
2.16.1
memcg reclaim may alter pgdat->flags based on the state of LRU lists
in cgroup and its children. PGDAT_WRITEBACK may force kswapd to sleep
congested_wait(), PGDAT_DIRTY may force kswapd to writeback filesystem
pages. But the worst here is PGDAT_CONGESTED, since it may force all
direct reclaims to stall in wait_iff_congested(). Note that only kswapd
have powers to clear any of these bits. This might just never happen if
cgroup limits configured that way. So all direct reclaims will stall
as long as we have some congested bdi in the system.
Leave all pgdat->flags manipulations to kswapd. kswapd scans the whole
pgdat, only kswapd can clear pgdat->flags once node is balance, thus
it's reasonable to leave all decisions about node state to kswapd.
Moving pgdat->flags manipulation to kswapd, means that cgroup2 recalim
now loses its congestion throttling mechanism. Add per-cgroup congestion
state and throttle cgroup2 reclaimers if memcg is in congestion state.
Currently there is no need in per-cgroup PGDAT_WRITEBACK and PGDAT_DIRTY
bits since they alter only kswapd behavior.
The problem could be easily demonstrated by creating heavy congestion
in one cgroup:
echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control
mkdir -p /sys/fs/cgroup/congester
echo 512M > /sys/fs/cgroup/congester/memory.max
echo $$ > /sys/fs/cgroup/congester/cgroup.procs
/* generate a lot of diry data on slow HDD */
while true; do dd if=/dev/zero of=/mnt/sdb/zeroes bs=1M count=1024; done &
....
while true; do dd if=/dev/zero of=/mnt/sdb/zeroes bs=1M count=1024; done &
and some job in another cgroup:
mkdir /sys/fs/cgroup/victim
echo 128M > /sys/fs/cgroup/victim/memory.max
# time cat /dev/sda > /dev/null
real 10m15.054s
user 0m0.487s
sys 1m8.505s
According to the tracepoint in wait_iff_congested(), the 'cat' spent 50%
of the time sleeping there.
With the patch, cat don't waste time anymore:
# time cat /dev/sda > /dev/null
real 5m32.911s
user 0m0.411s
sys 0m56.664s
Signed-off-by: Andrey Ryabinin <[email protected]>
---
include/linux/backing-dev.h | 2 +-
include/linux/memcontrol.h | 2 ++
mm/backing-dev.c | 19 ++++------
mm/vmscan.c | 86 ++++++++++++++++++++++++++++++++-------------
4 files changed, 71 insertions(+), 38 deletions(-)
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index f8894dbc0b19..539a5cf94fe2 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -175,7 +175,7 @@ static inline int wb_congested(struct bdi_writeback *wb, int cong_bits)
}
long congestion_wait(int sync, long timeout);
-long wait_iff_congested(struct pglist_data *pgdat, int sync, long timeout);
+long wait_iff_congested(int sync, long timeout);
static inline bool bdi_cap_synchronous_io(struct backing_dev_info *bdi)
{
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 4525b4404a9e..44422e1d3def 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -190,6 +190,8 @@ struct mem_cgroup {
/* vmpressure notifications */
struct vmpressure vmpressure;
+ unsigned long flags;
+
/*
* Should the accounting and control be hierarchical, per subtree?
*/
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 2eba1f54b1d3..2fc3f38e4c4f 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -1055,23 +1055,18 @@ EXPORT_SYMBOL(congestion_wait);
/**
* wait_iff_congested - Conditionally wait for a backing_dev to become uncongested or a pgdat to complete writes
- * @pgdat: A pgdat to check if it is heavily congested
* @sync: SYNC or ASYNC IO
* @timeout: timeout in jiffies
*
- * In the event of a congested backing_dev (any backing_dev) and the given
- * @pgdat has experienced recent congestion, this waits for up to @timeout
- * jiffies for either a BDI to exit congestion of the given @sync queue
- * or a write to complete.
- *
- * In the absence of pgdat congestion, cond_resched() is called to yield
- * the processor if necessary but otherwise does not sleep.
+ * In the event of a congested backing_dev (any backing_dev) this waits
+ * for up to @timeout jiffies for either a BDI to exit congestion of the
+ * given @sync queue or a write to complete.
*
* The return value is 0 if the sleep is for the full timeout. Otherwise,
* it is the number of jiffies that were still remaining when the function
* returned. return_value == timeout implies the function did not sleep.
*/
-long wait_iff_congested(struct pglist_data *pgdat, int sync, long timeout)
+long wait_iff_congested(int sync, long timeout)
{
long ret;
unsigned long start = jiffies;
@@ -1079,12 +1074,10 @@ long wait_iff_congested(struct pglist_data *pgdat, int sync, long timeout)
wait_queue_head_t *wqh = &congestion_wqh[sync];
/*
- * If there is no congestion, or heavy congestion is not being
- * encountered in the current pgdat, yield if necessary instead
+ * If there is no congestion, yield if necessary instead
* of sleeping on the congestion queue
*/
- if (atomic_read(&nr_wb_congested[sync]) == 0 ||
- !test_bit(PGDAT_CONGESTED, &pgdat->flags)) {
+ if (atomic_read(&nr_wb_congested[sync]) == 0) {
cond_resched();
/* In case we scheduled, work out time remaining */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2134b3ac8fa0..1e6e047e10fd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -199,6 +199,18 @@ static bool sane_reclaim(struct scan_control *sc)
#endif
return false;
}
+
+static void set_memcg_bit(enum pgdat_flags flag,
+ struct mem_cgroup *memcg)
+{
+ set_bit(flag, &memcg->flags);
+}
+
+static int test_memcg_bit(enum pgdat_flags flag,
+ struct mem_cgroup *memcg)
+{
+ return test_bit(flag, &memcg->flags);
+}
#else
static bool global_reclaim(struct scan_control *sc)
{
@@ -209,6 +221,17 @@ static bool sane_reclaim(struct scan_control *sc)
{
return true;
}
+
+static inline void set_memcg_bit(enum pgdat_flags flag,
+ struct mem_cgroup *memcg)
+{
+}
+
+static inline int test_memcg_bit(enum pgdat_flags flag,
+ struct mem_cgroup *memcg)
+{
+ return 0;
+}
#endif
/*
@@ -2472,6 +2495,12 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
return true;
}
+static bool pgdat_memcg_congested(pg_data_t *pgdat, struct mem_cgroup *memcg)
+{
+ return test_bit(PGDAT_CONGESTED, &pgdat->flags) ||
+ (memcg && test_memcg_bit(PGDAT_CONGESTED, memcg));
+}
+
static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
{
struct reclaim_state *reclaim_state = current->reclaim_state;
@@ -2554,29 +2583,28 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
if (sc->nr_reclaimed - nr_reclaimed)
reclaimable = true;
- /*
- * If reclaim is isolating dirty pages under writeback, it
- * implies that the long-lived page allocation rate is exceeding
- * the page laundering rate. Either the global limits are not
- * being effective at throttling processes due to the page
- * distribution throughout zones or there is heavy usage of a
- * slow backing device. The only option is to throttle from
- * reclaim context which is not ideal as there is no guarantee
- * the dirtying process is throttled in the same way
- * balance_dirty_pages() manages.
- *
- * Once a node is flagged PGDAT_WRITEBACK, kswapd will count the
- * number of pages under pages flagged for immediate reclaim and
- * stall if any are encountered in the nr_immediate check below.
- */
- if (sc->nr.writeback && sc->nr.writeback == sc->nr.file_taken)
- set_bit(PGDAT_WRITEBACK, &pgdat->flags);
+ if (current_is_kswapd()) {
+ /*
+ * If reclaim is isolating dirty pages under writeback,
+ * it implies that the long-lived page allocation rate
+ * is exceeding the page laundering rate. Either the
+ * global limits are not being effective at throttling
+ * processes due to the page distribution throughout
+ * zones or there is heavy usage of a slow backing
+ * device. The only option is to throttle from reclaim
+ * context which is not ideal as there is no guarantee
+ * the dirtying process is throttled in the same way
+ * balance_dirty_pages() manages.
+ *
+ * Once a node is flagged PGDAT_WRITEBACK, kswapd will
+ * count the number of pages under pages flagged for
+ * immediate reclaim and stall if any are encountered
+ * in the nr_immediate check below.
+ */
+ if (sc->nr.writeback &&
+ sc->nr.writeback == sc->nr.file_taken)
+ set_bit(PGDAT_WRITEBACK, &pgdat->flags);
- /*
- * Legacy memcg will stall in page writeback so avoid forcibly
- * stalling here.
- */
- if (sane_reclaim(sc)) {
/*
* Tag a node as congested if all the dirty pages
* scanned were backed by a congested BDI and
@@ -2599,6 +2627,14 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
congestion_wait(BLK_RW_ASYNC, HZ/10);
}
+ /*
+ * Legacy memcg will stall in page writeback so avoid forcibly
+ * stalling in wait_iff_congested().
+ */
+ if (!global_reclaim(sc) && sane_reclaim(sc) &&
+ sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
+ set_memcg_bit(PGDAT_CONGESTED, root);
+
/*
* Stall direct reclaim for IO completions if underlying BDIs
* and node is congested. Allow kswapd to continue until it
@@ -2606,8 +2642,8 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
* the LRU too quickly.
*/
if (!sc->hibernation_mode && !current_is_kswapd() &&
- current_may_throttle())
- wait_iff_congested(pgdat, BLK_RW_ASYNC, HZ/10);
+ current_may_throttle() && pgdat_memcg_congested(pgdat, root))
+ wait_iff_congested(BLK_RW_ASYNC, HZ/10);
} while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
sc->nr_scanned - nr_scanned, sc));
@@ -3047,6 +3083,7 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
* the priority and make it zero.
*/
shrink_node_memcg(pgdat, memcg, &sc, &lru_pages);
+ clear_bit(PGDAT_CONGESTED, &memcg->flags);
trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);
@@ -3092,6 +3129,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
noreclaim_flag = memalloc_noreclaim_save();
nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
memalloc_noreclaim_restore(noreclaim_flag);
+ clear_bit(PGDAT_CONGESTED, &memcg->flags);
trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
--
2.16.1
Only kswapd can have non-zero nr_immediate, and current_may_throttle() is
always true for kswapd (PF_LESS_THROTTLE bit is never set) thus it's
enough to check stat.nr_immediate only.
Signed-off-by: Andrey Ryabinin <[email protected]>
Acked-by: Michal Hocko <[email protected]>
---
mm/vmscan.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6d74b12099bd..403f59edd53e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1807,7 +1807,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
* that pages are cycling through the LRU faster than
* they are written so also forcibly stall.
*/
- if (stat.nr_immediate && current_may_throttle())
+ if (stat.nr_immediate)
congestion_wait(BLK_RW_ASYNC, HZ/10);
}
--
2.16.1
We have separate LRU list for each memory cgroup. Memory reclaim iterates
over cgroups and calls shrink_inactive_list() every inactive LRU list.
Based on the state of a single LRU shrink_inactive_list() may flag
the whole node as dirty,congested or under writeback. This is obviously
wrong and hurtful. It's especially hurtful when we have possibly
small congested cgroup in system. Than *all* direct reclaims waste time
by sleeping in wait_iff_congested(). And the more memcgs in the system
we have the longer memory allocation stall is, because
wait_iff_congested() called on each lru-list scan.
Sum reclaim stats across all visited LRUs on node and flag node as dirty,
congested or under writeback based on that sum. Also call
congestion_wait(), wait_iff_congested() once per pgdat scan, instead of
once per lru-list scan.
This only fixes the problem for global reclaim case. Per-cgroup reclaim
may alter global pgdat flags too, which is wrong. But that is separate
issue and will be addressed in the next patch.
This change will not have any effect on a systems with all workload
concentrated in a single cgroup.
Signed-off-by: Andrey Ryabinin <[email protected]>
---
mm/vmscan.c | 124 +++++++++++++++++++++++++++++++++++-------------------------
1 file changed, 73 insertions(+), 51 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 403f59edd53e..2134b3ac8fa0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -116,6 +116,15 @@ struct scan_control {
/* Number of pages freed so far during a call to shrink_zones() */
unsigned long nr_reclaimed;
+
+ struct {
+ unsigned int dirty;
+ unsigned int unqueued_dirty;
+ unsigned int congested;
+ unsigned int writeback;
+ unsigned int immediate;
+ unsigned int file_taken;
+ } nr;
};
#ifdef ARCH_HAS_PREFETCH
@@ -1754,23 +1763,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
mem_cgroup_uncharge_list(&page_list);
free_unref_page_list(&page_list);
- /*
- * If reclaim is isolating dirty pages under writeback, it implies
- * that the long-lived page allocation rate is exceeding the page
- * laundering rate. Either the global limits are not being effective
- * at throttling processes due to the page distribution throughout
- * zones or there is heavy usage of a slow backing device. The
- * only option is to throttle from reclaim context which is not ideal
- * as there is no guarantee the dirtying process is throttled in the
- * same way balance_dirty_pages() manages.
- *
- * Once a node is flagged PGDAT_WRITEBACK, kswapd will count the number
- * of pages under pages flagged for immediate reclaim and stall if any
- * are encountered in the nr_immediate check below.
- */
- if (stat.nr_writeback && stat.nr_writeback == nr_taken)
- set_bit(PGDAT_WRITEBACK, &pgdat->flags);
-
/*
* If dirty pages are scanned that are not queued for IO, it
* implies that flushers are not doing their job. This can
@@ -1785,40 +1777,13 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
if (stat.nr_unqueued_dirty == nr_taken)
wakeup_flusher_threads(WB_REASON_VMSCAN);
- /*
- * Legacy memcg will stall in page writeback so avoid forcibly
- * stalling here.
- */
- if (sane_reclaim(sc)) {
- /*
- * Tag a node as congested if all the dirty pages scanned were
- * backed by a congested BDI and wait_iff_congested will stall.
- */
- if (stat.nr_dirty && stat.nr_dirty == stat.nr_congested)
- set_bit(PGDAT_CONGESTED, &pgdat->flags);
-
- /* Allow kswapd to start writing pages during reclaim. */
- if (stat.nr_unqueued_dirty == nr_taken)
- set_bit(PGDAT_DIRTY, &pgdat->flags);
-
- /*
- * If kswapd scans pages marked marked for immediate
- * reclaim and under writeback (nr_immediate), it implies
- * that pages are cycling through the LRU faster than
- * they are written so also forcibly stall.
- */
- if (stat.nr_immediate)
- congestion_wait(BLK_RW_ASYNC, HZ/10);
- }
-
- /*
- * Stall direct reclaim for IO completions if underlying BDIs and node
- * is congested. Allow kswapd to continue until it starts encountering
- * unqueued dirty pages or cycling through the LRU too quickly.
- */
- if (!sc->hibernation_mode && !current_is_kswapd() &&
- current_may_throttle())
- wait_iff_congested(pgdat, BLK_RW_ASYNC, HZ/10);
+ sc->nr.dirty += stat.nr_dirty;
+ sc->nr.congested += stat.nr_congested;
+ sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
+ sc->nr.writeback += stat.nr_writeback;
+ sc->nr.immediate += stat.nr_immediate;
+ if (file)
+ sc->nr.file_taken += nr_taken;
trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
nr_scanned, nr_reclaimed,
@@ -2522,6 +2487,8 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
unsigned long node_lru_pages = 0;
struct mem_cgroup *memcg;
+ memset(&sc->nr, 0, sizeof(sc->nr));
+
nr_reclaimed = sc->nr_reclaimed;
nr_scanned = sc->nr_scanned;
@@ -2587,6 +2554,61 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
if (sc->nr_reclaimed - nr_reclaimed)
reclaimable = true;
+ /*
+ * If reclaim is isolating dirty pages under writeback, it
+ * implies that the long-lived page allocation rate is exceeding
+ * the page laundering rate. Either the global limits are not
+ * being effective at throttling processes due to the page
+ * distribution throughout zones or there is heavy usage of a
+ * slow backing device. The only option is to throttle from
+ * reclaim context which is not ideal as there is no guarantee
+ * the dirtying process is throttled in the same way
+ * balance_dirty_pages() manages.
+ *
+ * Once a node is flagged PGDAT_WRITEBACK, kswapd will count the
+ * number of pages under pages flagged for immediate reclaim and
+ * stall if any are encountered in the nr_immediate check below.
+ */
+ if (sc->nr.writeback && sc->nr.writeback == sc->nr.file_taken)
+ set_bit(PGDAT_WRITEBACK, &pgdat->flags);
+
+ /*
+ * Legacy memcg will stall in page writeback so avoid forcibly
+ * stalling here.
+ */
+ if (sane_reclaim(sc)) {
+ /*
+ * Tag a node as congested if all the dirty pages
+ * scanned were backed by a congested BDI and
+ * wait_iff_congested will stall.
+ */
+ if (sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
+ set_bit(PGDAT_CONGESTED, &pgdat->flags);
+
+ /* Allow kswapd to start writing pages during reclaim.*/
+ if (sc->nr.unqueued_dirty == sc->nr.file_taken)
+ set_bit(PGDAT_DIRTY, &pgdat->flags);
+
+ /*
+ * If kswapd scans pages marked marked for immediate
+ * reclaim and under writeback (nr_immediate), it
+ * implies that pages are cycling through the LRU
+ * faster than they are written so also forcibly stall.
+ */
+ if (sc->nr.immediate)
+ congestion_wait(BLK_RW_ASYNC, HZ/10);
+ }
+
+ /*
+ * Stall direct reclaim for IO completions if underlying BDIs
+ * and node is congested. Allow kswapd to continue until it
+ * starts encountering unqueued dirty pages or cycling through
+ * the LRU too quickly.
+ */
+ if (!sc->hibernation_mode && !current_is_kswapd() &&
+ current_may_throttle())
+ wait_iff_congested(pgdat, BLK_RW_ASYNC, HZ/10);
+
} while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
sc->nr_scanned - nr_scanned, sc));
--
2.16.1
Update some comments that become stale since transiton from per-zone
to per-node reclaim.
Signed-off-by: Andrey Ryabinin <[email protected]>
Acked-by: Michal Hocko <[email protected]>
---
mm/vmscan.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4390a8d5be41..6d74b12099bd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -926,7 +926,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
(PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
/*
- * The number of dirty pages determines if a zone is marked
+ * The number of dirty pages determines if a node is marked
* reclaim_congested which affects wait_iff_congested. kswapd
* will stall and start writing pages if the tail of the LRU
* is all dirty unqueued pages.
@@ -1764,7 +1764,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
* as there is no guarantee the dirtying process is throttled in the
* same way balance_dirty_pages() manages.
*
- * Once a zone is flagged ZONE_WRITEBACK, kswapd will count the number
+ * Once a node is flagged PGDAT_WRITEBACK, kswapd will count the number
* of pages under pages flagged for immediate reclaim and stall if any
* are encountered in the nr_immediate check below.
*/
@@ -1791,7 +1791,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
*/
if (sane_reclaim(sc)) {
/*
- * Tag a zone as congested if all the dirty pages scanned were
+ * Tag a node as congested if all the dirty pages scanned were
* backed by a congested BDI and wait_iff_congested will stall.
*/
if (stat.nr_dirty && stat.nr_dirty == stat.nr_congested)
@@ -1812,7 +1812,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
}
/*
- * Stall direct reclaim for IO completions if underlying BDIs or zone
+ * Stall direct reclaim for IO completions if underlying BDIs and node
* is congested. Allow kswapd to continue until it starts encountering
* unqueued dirty pages or cycling through the LRU too quickly.
*/
@@ -3808,7 +3808,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
if (node_pagecache_reclaimable(pgdat) > pgdat->min_unmapped_pages) {
/*
- * Free memory by calling shrink zone with increasing
+ * Free memory by calling shrink node with increasing
* priorities until we have enough memory freed.
*/
do {
--
2.16.1
On Fri, 23 Mar 2018 18:20:29 +0300 Andrey Ryabinin <[email protected]> wrote:
> memcg reclaim may alter pgdat->flags based on the state of LRU lists
> in cgroup and its children. PGDAT_WRITEBACK may force kswapd to sleep
> congested_wait(), PGDAT_DIRTY may force kswapd to writeback filesystem
> pages. But the worst here is PGDAT_CONGESTED, since it may force all
> direct reclaims to stall in wait_iff_congested(). Note that only kswapd
> have powers to clear any of these bits. This might just never happen if
> cgroup limits configured that way. So all direct reclaims will stall
> as long as we have some congested bdi in the system.
>
> Leave all pgdat->flags manipulations to kswapd. kswapd scans the whole
> pgdat, only kswapd can clear pgdat->flags once node is balance, thus
> it's reasonable to leave all decisions about node state to kswapd.
>
> Moving pgdat->flags manipulation to kswapd, means that cgroup2 recalim
> now loses its congestion throttling mechanism. Add per-cgroup congestion
> state and throttle cgroup2 reclaimers if memcg is in congestion state.
>
> Currently there is no need in per-cgroup PGDAT_WRITEBACK and PGDAT_DIRTY
> bits since they alter only kswapd behavior.
>
> The problem could be easily demonstrated by creating heavy congestion
> in one cgroup:
>
> echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control
> mkdir -p /sys/fs/cgroup/congester
> echo 512M > /sys/fs/cgroup/congester/memory.max
> echo $$ > /sys/fs/cgroup/congester/cgroup.procs
> /* generate a lot of diry data on slow HDD */
> while true; do dd if=/dev/zero of=/mnt/sdb/zeroes bs=1M count=1024; done &
> ....
> while true; do dd if=/dev/zero of=/mnt/sdb/zeroes bs=1M count=1024; done &
>
> and some job in another cgroup:
>
> mkdir /sys/fs/cgroup/victim
> echo 128M > /sys/fs/cgroup/victim/memory.max
>
> # time cat /dev/sda > /dev/null
> real 10m15.054s
> user 0m0.487s
> sys 1m8.505s
>
> According to the tracepoint in wait_iff_congested(), the 'cat' spent 50%
> of the time sleeping there.
>
> With the patch, cat don't waste time anymore:
>
> # time cat /dev/sda > /dev/null
> real 5m32.911s
> user 0m0.411s
> sys 0m56.664s
>
Reviewers, please?
On Fri, 23 Mar 2018 18:20:28 +0300 Andrey Ryabinin <[email protected]> wrote:
> We have separate LRU list for each memory cgroup. Memory reclaim iterates
> over cgroups and calls shrink_inactive_list() every inactive LRU list.
> Based on the state of a single LRU shrink_inactive_list() may flag
> the whole node as dirty,congested or under writeback. This is obviously
> wrong and hurtful. It's especially hurtful when we have possibly
> small congested cgroup in system. Than *all* direct reclaims waste time
> by sleeping in wait_iff_congested(). And the more memcgs in the system
> we have the longer memory allocation stall is, because
> wait_iff_congested() called on each lru-list scan.
>
> Sum reclaim stats across all visited LRUs on node and flag node as dirty,
> congested or under writeback based on that sum. Also call
> congestion_wait(), wait_iff_congested() once per pgdat scan, instead of
> once per lru-list scan.
>
> This only fixes the problem for global reclaim case. Per-cgroup reclaim
> may alter global pgdat flags too, which is wrong. But that is separate
> issue and will be addressed in the next patch.
>
> This change will not have any effect on a systems with all workload
> concentrated in a single cgroup.
>
Could we please get this reviewed?
On Fri, Mar 23, 2018 at 8:20 AM, Andrey Ryabinin
<[email protected]> wrote:
> We have separate LRU list for each memory cgroup. Memory reclaim iterates
> over cgroups and calls shrink_inactive_list() every inactive LRU list.
> Based on the state of a single LRU shrink_inactive_list() may flag
> the whole node as dirty,congested or under writeback. This is obviously
> wrong and hurtful. It's especially hurtful when we have possibly
> small congested cgroup in system. Than *all* direct reclaims waste time
> by sleeping in wait_iff_congested(). And the more memcgs in the system
> we have the longer memory allocation stall is, because
> wait_iff_congested() called on each lru-list scan.
>
> Sum reclaim stats across all visited LRUs on node and flag node as dirty,
> congested or under writeback based on that sum. Also call
> congestion_wait(), wait_iff_congested() once per pgdat scan, instead of
> once per lru-list scan.
>
> This only fixes the problem for global reclaim case. Per-cgroup reclaim
> may alter global pgdat flags too, which is wrong. But that is separate
> issue and will be addressed in the next patch.
>
> This change will not have any effect on a systems with all workload
> concentrated in a single cgroup.
>
> Signed-off-by: Andrey Ryabinin <[email protected]>
Seems reasonable.
Reviewed-by: Shakeel Butt <[email protected]>
> ---
> mm/vmscan.c | 124 +++++++++++++++++++++++++++++++++++-------------------------
> 1 file changed, 73 insertions(+), 51 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 403f59edd53e..2134b3ac8fa0 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -116,6 +116,15 @@ struct scan_control {
>
> /* Number of pages freed so far during a call to shrink_zones() */
> unsigned long nr_reclaimed;
> +
> + struct {
> + unsigned int dirty;
> + unsigned int unqueued_dirty;
> + unsigned int congested;
> + unsigned int writeback;
> + unsigned int immediate;
> + unsigned int file_taken;
> + } nr;
> };
>
> #ifdef ARCH_HAS_PREFETCH
> @@ -1754,23 +1763,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
> mem_cgroup_uncharge_list(&page_list);
> free_unref_page_list(&page_list);
>
> - /*
> - * If reclaim is isolating dirty pages under writeback, it implies
> - * that the long-lived page allocation rate is exceeding the page
> - * laundering rate. Either the global limits are not being effective
> - * at throttling processes due to the page distribution throughout
> - * zones or there is heavy usage of a slow backing device. The
> - * only option is to throttle from reclaim context which is not ideal
> - * as there is no guarantee the dirtying process is throttled in the
> - * same way balance_dirty_pages() manages.
> - *
> - * Once a node is flagged PGDAT_WRITEBACK, kswapd will count the number
> - * of pages under pages flagged for immediate reclaim and stall if any
> - * are encountered in the nr_immediate check below.
> - */
> - if (stat.nr_writeback && stat.nr_writeback == nr_taken)
> - set_bit(PGDAT_WRITEBACK, &pgdat->flags);
> -
> /*
> * If dirty pages are scanned that are not queued for IO, it
> * implies that flushers are not doing their job. This can
> @@ -1785,40 +1777,13 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
> if (stat.nr_unqueued_dirty == nr_taken)
> wakeup_flusher_threads(WB_REASON_VMSCAN);
>
> - /*
> - * Legacy memcg will stall in page writeback so avoid forcibly
> - * stalling here.
> - */
> - if (sane_reclaim(sc)) {
> - /*
> - * Tag a node as congested if all the dirty pages scanned were
> - * backed by a congested BDI and wait_iff_congested will stall.
> - */
> - if (stat.nr_dirty && stat.nr_dirty == stat.nr_congested)
> - set_bit(PGDAT_CONGESTED, &pgdat->flags);
> -
> - /* Allow kswapd to start writing pages during reclaim. */
> - if (stat.nr_unqueued_dirty == nr_taken)
> - set_bit(PGDAT_DIRTY, &pgdat->flags);
> -
> - /*
> - * If kswapd scans pages marked marked for immediate
> - * reclaim and under writeback (nr_immediate), it implies
> - * that pages are cycling through the LRU faster than
> - * they are written so also forcibly stall.
> - */
> - if (stat.nr_immediate)
> - congestion_wait(BLK_RW_ASYNC, HZ/10);
> - }
> -
> - /*
> - * Stall direct reclaim for IO completions if underlying BDIs and node
> - * is congested. Allow kswapd to continue until it starts encountering
> - * unqueued dirty pages or cycling through the LRU too quickly.
> - */
> - if (!sc->hibernation_mode && !current_is_kswapd() &&
> - current_may_throttle())
> - wait_iff_congested(pgdat, BLK_RW_ASYNC, HZ/10);
> + sc->nr.dirty += stat.nr_dirty;
> + sc->nr.congested += stat.nr_congested;
> + sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
> + sc->nr.writeback += stat.nr_writeback;
> + sc->nr.immediate += stat.nr_immediate;
> + if (file)
> + sc->nr.file_taken += nr_taken;
>
> trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
> nr_scanned, nr_reclaimed,
> @@ -2522,6 +2487,8 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> unsigned long node_lru_pages = 0;
> struct mem_cgroup *memcg;
>
> + memset(&sc->nr, 0, sizeof(sc->nr));
> +
> nr_reclaimed = sc->nr_reclaimed;
> nr_scanned = sc->nr_scanned;
>
> @@ -2587,6 +2554,61 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> if (sc->nr_reclaimed - nr_reclaimed)
> reclaimable = true;
>
> + /*
> + * If reclaim is isolating dirty pages under writeback, it
> + * implies that the long-lived page allocation rate is exceeding
> + * the page laundering rate. Either the global limits are not
> + * being effective at throttling processes due to the page
> + * distribution throughout zones or there is heavy usage of a
> + * slow backing device. The only option is to throttle from
> + * reclaim context which is not ideal as there is no guarantee
> + * the dirtying process is throttled in the same way
> + * balance_dirty_pages() manages.
> + *
> + * Once a node is flagged PGDAT_WRITEBACK, kswapd will count the
> + * number of pages under pages flagged for immediate reclaim and
> + * stall if any are encountered in the nr_immediate check below.
> + */
> + if (sc->nr.writeback && sc->nr.writeback == sc->nr.file_taken)
> + set_bit(PGDAT_WRITEBACK, &pgdat->flags);
> +
> + /*
> + * Legacy memcg will stall in page writeback so avoid forcibly
> + * stalling here.
> + */
> + if (sane_reclaim(sc)) {
> + /*
> + * Tag a node as congested if all the dirty pages
> + * scanned were backed by a congested BDI and
> + * wait_iff_congested will stall.
> + */
> + if (sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
> + set_bit(PGDAT_CONGESTED, &pgdat->flags);
> +
> + /* Allow kswapd to start writing pages during reclaim.*/
> + if (sc->nr.unqueued_dirty == sc->nr.file_taken)
> + set_bit(PGDAT_DIRTY, &pgdat->flags);
> +
> + /*
> + * If kswapd scans pages marked marked for immediate
> + * reclaim and under writeback (nr_immediate), it
> + * implies that pages are cycling through the LRU
> + * faster than they are written so also forcibly stall.
> + */
> + if (sc->nr.immediate)
> + congestion_wait(BLK_RW_ASYNC, HZ/10);
> + }
> +
> + /*
> + * Stall direct reclaim for IO completions if underlying BDIs
> + * and node is congested. Allow kswapd to continue until it
> + * starts encountering unqueued dirty pages or cycling through
> + * the LRU too quickly.
> + */
> + if (!sc->hibernation_mode && !current_is_kswapd() &&
> + current_may_throttle())
> + wait_iff_congested(pgdat, BLK_RW_ASYNC, HZ/10);
> +
> } while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
> sc->nr_scanned - nr_scanned, sc));
>
> --
> 2.16.1
>
On Fri, Mar 23, 2018 at 8:20 AM, Andrey Ryabinin
<[email protected]> wrote:
> memcg reclaim may alter pgdat->flags based on the state of LRU lists
> in cgroup and its children. PGDAT_WRITEBACK may force kswapd to sleep
> congested_wait(), PGDAT_DIRTY may force kswapd to writeback filesystem
> pages. But the worst here is PGDAT_CONGESTED, since it may force all
> direct reclaims to stall in wait_iff_congested(). Note that only kswapd
> have powers to clear any of these bits. This might just never happen if
> cgroup limits configured that way. So all direct reclaims will stall
> as long as we have some congested bdi in the system.
>
> Leave all pgdat->flags manipulations to kswapd. kswapd scans the whole
> pgdat, only kswapd can clear pgdat->flags once node is balance, thus
> it's reasonable to leave all decisions about node state to kswapd.
What about global reclaimers? Is the assumption that when global
reclaimers hit such condition, kswapd will be running and correctly
set PGDAT_CONGESTED?
>
> Moving pgdat->flags manipulation to kswapd, means that cgroup2 recalim
> now loses its congestion throttling mechanism. Add per-cgroup congestion
> state and throttle cgroup2 reclaimers if memcg is in congestion state.
>
> Currently there is no need in per-cgroup PGDAT_WRITEBACK and PGDAT_DIRTY
> bits since they alter only kswapd behavior.
>
> The problem could be easily demonstrated by creating heavy congestion
> in one cgroup:
>
> echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control
> mkdir -p /sys/fs/cgroup/congester
> echo 512M > /sys/fs/cgroup/congester/memory.max
> echo $$ > /sys/fs/cgroup/congester/cgroup.procs
> /* generate a lot of diry data on slow HDD */
> while true; do dd if=/dev/zero of=/mnt/sdb/zeroes bs=1M count=1024; done &
> ....
> while true; do dd if=/dev/zero of=/mnt/sdb/zeroes bs=1M count=1024; done &
>
> and some job in another cgroup:
>
> mkdir /sys/fs/cgroup/victim
> echo 128M > /sys/fs/cgroup/victim/memory.max
>
> # time cat /dev/sda > /dev/null
> real 10m15.054s
> user 0m0.487s
> sys 1m8.505s
>
> According to the tracepoint in wait_iff_congested(), the 'cat' spent 50%
> of the time sleeping there.
>
> With the patch, cat don't waste time anymore:
>
> # time cat /dev/sda > /dev/null
> real 5m32.911s
> user 0m0.411s
> sys 0m56.664s
>
> Signed-off-by: Andrey Ryabinin <[email protected]>
> ---
> include/linux/backing-dev.h | 2 +-
> include/linux/memcontrol.h | 2 ++
> mm/backing-dev.c | 19 ++++------
> mm/vmscan.c | 86 ++++++++++++++++++++++++++++++++-------------
> 4 files changed, 71 insertions(+), 38 deletions(-)
>
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index f8894dbc0b19..539a5cf94fe2 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -175,7 +175,7 @@ static inline int wb_congested(struct bdi_writeback *wb, int cong_bits)
> }
>
> long congestion_wait(int sync, long timeout);
> -long wait_iff_congested(struct pglist_data *pgdat, int sync, long timeout);
> +long wait_iff_congested(int sync, long timeout);
>
> static inline bool bdi_cap_synchronous_io(struct backing_dev_info *bdi)
> {
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 4525b4404a9e..44422e1d3def 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -190,6 +190,8 @@ struct mem_cgroup {
> /* vmpressure notifications */
> struct vmpressure vmpressure;
>
> + unsigned long flags;
> +
nit(you can ignore it): The name 'flags' is too general IMO. Something
more specific would be helpful.
Question: Does this 'flags' has any hierarchical meaning? Does
congested parent means all descendents are congested?
Question: Should this 'flags' be per-node? Is it ok for a congested
memcg to call wait_iff_congested for all nodes?
> /*
> * Should the accounting and control be hierarchical, per subtree?
> */
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 2eba1f54b1d3..2fc3f38e4c4f 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -1055,23 +1055,18 @@ EXPORT_SYMBOL(congestion_wait);
>
> /**
> * wait_iff_congested - Conditionally wait for a backing_dev to become uncongested or a pgdat to complete writes
> - * @pgdat: A pgdat to check if it is heavily congested
> * @sync: SYNC or ASYNC IO
> * @timeout: timeout in jiffies
> *
> - * In the event of a congested backing_dev (any backing_dev) and the given
> - * @pgdat has experienced recent congestion, this waits for up to @timeout
> - * jiffies for either a BDI to exit congestion of the given @sync queue
> - * or a write to complete.
> - *
> - * In the absence of pgdat congestion, cond_resched() is called to yield
> - * the processor if necessary but otherwise does not sleep.
> + * In the event of a congested backing_dev (any backing_dev) this waits
> + * for up to @timeout jiffies for either a BDI to exit congestion of the
> + * given @sync queue or a write to complete.
> *
> * The return value is 0 if the sleep is for the full timeout. Otherwise,
> * it is the number of jiffies that were still remaining when the function
> * returned. return_value == timeout implies the function did not sleep.
> */
> -long wait_iff_congested(struct pglist_data *pgdat, int sync, long timeout)
> +long wait_iff_congested(int sync, long timeout)
> {
> long ret;
> unsigned long start = jiffies;
> @@ -1079,12 +1074,10 @@ long wait_iff_congested(struct pglist_data *pgdat, int sync, long timeout)
> wait_queue_head_t *wqh = &congestion_wqh[sync];
>
> /*
> - * If there is no congestion, or heavy congestion is not being
> - * encountered in the current pgdat, yield if necessary instead
> + * If there is no congestion, yield if necessary instead
> * of sleeping on the congestion queue
> */
> - if (atomic_read(&nr_wb_congested[sync]) == 0 ||
> - !test_bit(PGDAT_CONGESTED, &pgdat->flags)) {
> + if (atomic_read(&nr_wb_congested[sync]) == 0) {
> cond_resched();
>
> /* In case we scheduled, work out time remaining */
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2134b3ac8fa0..1e6e047e10fd 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -199,6 +199,18 @@ static bool sane_reclaim(struct scan_control *sc)
> #endif
> return false;
> }
> +
> +static void set_memcg_bit(enum pgdat_flags flag,
> + struct mem_cgroup *memcg)
> +{
> + set_bit(flag, &memcg->flags);
> +}
> +
> +static int test_memcg_bit(enum pgdat_flags flag,
> + struct mem_cgroup *memcg)
> +{
> + return test_bit(flag, &memcg->flags);
> +}
> #else
> static bool global_reclaim(struct scan_control *sc)
> {
> @@ -209,6 +221,17 @@ static bool sane_reclaim(struct scan_control *sc)
> {
> return true;
> }
> +
> +static inline void set_memcg_bit(enum pgdat_flags flag,
> + struct mem_cgroup *memcg)
> +{
> +}
> +
> +static inline int test_memcg_bit(enum pgdat_flags flag,
> + struct mem_cgroup *memcg)
> +{
> + return 0;
> +}
> #endif
>
> /*
> @@ -2472,6 +2495,12 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
> return true;
> }
>
> +static bool pgdat_memcg_congested(pg_data_t *pgdat, struct mem_cgroup *memcg)
> +{
> + return test_bit(PGDAT_CONGESTED, &pgdat->flags) ||
> + (memcg && test_memcg_bit(PGDAT_CONGESTED, memcg));
> +}
> +
> static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> {
> struct reclaim_state *reclaim_state = current->reclaim_state;
> @@ -2554,29 +2583,28 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> if (sc->nr_reclaimed - nr_reclaimed)
> reclaimable = true;
>
> - /*
> - * If reclaim is isolating dirty pages under writeback, it
> - * implies that the long-lived page allocation rate is exceeding
> - * the page laundering rate. Either the global limits are not
> - * being effective at throttling processes due to the page
> - * distribution throughout zones or there is heavy usage of a
> - * slow backing device. The only option is to throttle from
> - * reclaim context which is not ideal as there is no guarantee
> - * the dirtying process is throttled in the same way
> - * balance_dirty_pages() manages.
> - *
> - * Once a node is flagged PGDAT_WRITEBACK, kswapd will count the
> - * number of pages under pages flagged for immediate reclaim and
> - * stall if any are encountered in the nr_immediate check below.
> - */
> - if (sc->nr.writeback && sc->nr.writeback == sc->nr.file_taken)
> - set_bit(PGDAT_WRITEBACK, &pgdat->flags);
> + if (current_is_kswapd()) {
> + /*
> + * If reclaim is isolating dirty pages under writeback,
> + * it implies that the long-lived page allocation rate
> + * is exceeding the page laundering rate. Either the
> + * global limits are not being effective at throttling
> + * processes due to the page distribution throughout
> + * zones or there is heavy usage of a slow backing
> + * device. The only option is to throttle from reclaim
> + * context which is not ideal as there is no guarantee
> + * the dirtying process is throttled in the same way
> + * balance_dirty_pages() manages.
> + *
> + * Once a node is flagged PGDAT_WRITEBACK, kswapd will
> + * count the number of pages under pages flagged for
> + * immediate reclaim and stall if any are encountered
> + * in the nr_immediate check below.
> + */
> + if (sc->nr.writeback &&
> + sc->nr.writeback == sc->nr.file_taken)
> + set_bit(PGDAT_WRITEBACK, &pgdat->flags);
>
> - /*
> - * Legacy memcg will stall in page writeback so avoid forcibly
> - * stalling here.
> - */
> - if (sane_reclaim(sc)) {
> /*
> * Tag a node as congested if all the dirty pages
> * scanned were backed by a congested BDI and
> @@ -2599,6 +2627,14 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> congestion_wait(BLK_RW_ASYNC, HZ/10);
> }
>
> + /*
> + * Legacy memcg will stall in page writeback so avoid forcibly
> + * stalling in wait_iff_congested().
> + */
> + if (!global_reclaim(sc) && sane_reclaim(sc) &&
> + sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
> + set_memcg_bit(PGDAT_CONGESTED, root);
> +
> /*
> * Stall direct reclaim for IO completions if underlying BDIs
> * and node is congested. Allow kswapd to continue until it
> @@ -2606,8 +2642,8 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> * the LRU too quickly.
> */
> if (!sc->hibernation_mode && !current_is_kswapd() &&
> - current_may_throttle())
> - wait_iff_congested(pgdat, BLK_RW_ASYNC, HZ/10);
> + current_may_throttle() && pgdat_memcg_congested(pgdat, root))
> + wait_iff_congested(BLK_RW_ASYNC, HZ/10);
>
> } while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
> sc->nr_scanned - nr_scanned, sc));
> @@ -3047,6 +3083,7 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
> * the priority and make it zero.
> */
> shrink_node_memcg(pgdat, memcg, &sc, &lru_pages);
> + clear_bit(PGDAT_CONGESTED, &memcg->flags);
>
> trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);
>
> @@ -3092,6 +3129,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
> noreclaim_flag = memalloc_noreclaim_save();
> nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
> memalloc_noreclaim_restore(noreclaim_flag);
> + clear_bit(PGDAT_CONGESTED, &memcg->flags);
>
> trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
>
> --
> 2.16.1
>
On 04/06/2018 05:13 AM, Shakeel Butt wrote:
> On Fri, Mar 23, 2018 at 8:20 AM, Andrey Ryabinin
> <[email protected]> wrote:
>> memcg reclaim may alter pgdat->flags based on the state of LRU lists
>> in cgroup and its children. PGDAT_WRITEBACK may force kswapd to sleep
>> congested_wait(), PGDAT_DIRTY may force kswapd to writeback filesystem
>> pages. But the worst here is PGDAT_CONGESTED, since it may force all
>> direct reclaims to stall in wait_iff_congested(). Note that only kswapd
>> have powers to clear any of these bits. This might just never happen if
>> cgroup limits configured that way. So all direct reclaims will stall
>> as long as we have some congested bdi in the system.
>>
>> Leave all pgdat->flags manipulations to kswapd. kswapd scans the whole
>> pgdat, only kswapd can clear pgdat->flags once node is balance, thus
>> it's reasonable to leave all decisions about node state to kswapd.
>
> What about global reclaimers? Is the assumption that when global
> reclaimers hit such condition, kswapd will be running and correctly
> set PGDAT_CONGESTED?
>
The reason I moved this under if(current_is_kswapd()) is because only kswapd
can clear these flags. I'm less worried about the case when PGDAT_CONGESTED falsely
not set, and more worried about the case when it falsely set. If direct reclaimer sets
PGDAT_CONGESTED, do we have guarantee that, after congestion problem is sorted, kswapd
ill be woken up and clear the flag? It seems like there is no such guarantee.
E.g. direct reclaimers may eventually balance pgdat and kswapd simply won't wake up
(see wakeup_kswapd()).
>> static inline bool bdi_cap_synchronous_io(struct backing_dev_info *bdi)
>> {
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 4525b4404a9e..44422e1d3def 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -190,6 +190,8 @@ struct mem_cgroup {
>> /* vmpressure notifications */
>> struct vmpressure vmpressure;
>>
>> + unsigned long flags;
>> +
>
> nit(you can ignore it): The name 'flags' is too general IMO. Something
> more specific would be helpful.
>
> Question: Does this 'flags' has any hierarchical meaning? Does
> congested parent means all descendents are congested?
It's the same as with pgdat->flags. Cgroup (or pgdat) is congested if at least one cgroup
in the hierarchy (in pgdat) is congested and the rest are all either also congested or don't have
any file pages to reclaim (nr_file_taken == 0).
> Question: Should this 'flags' be per-node? Is it ok for a congested
> memcg to call wait_iff_congested for all nodes?
>
Yes, it should be per-node. I'll try to replace ->flags with 'bool congested'
in struct mem_cgroup_per_node.
On 04/06/2018 05:13 AM, Shakeel Butt wrote:
> Question: Should this 'flags' be per-node? Is it ok for a congested
> memcg to call wait_iff_congested for all nodes?
Indeed, congestion state should be pre-node. If memcg on node A is
congested, there is no point is stalling memcg reclaim from node B.
Make congestion state per-cgroup-per-node and record it in
'struct mem_cgroup_per_node'.
Signed-off-by: Andrey Ryabinin <[email protected]>
---
include/linux/memcontrol.h | 5 +++--
mm/vmscan.c | 39 +++++++++++++++++++++++++--------------
2 files changed, 28 insertions(+), 16 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 8b394bbf1c86..af9eed2e3e04 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -120,6 +120,9 @@ struct mem_cgroup_per_node {
unsigned long usage_in_excess;/* Set to the value by which */
/* the soft limit is exceeded*/
bool on_tree;
+ bool congested; /* memcg has many dirty pages */
+ /* backed by a congested BDI */
+
struct mem_cgroup *memcg; /* Back pointer, we cannot */
/* use container_of */
};
@@ -189,8 +192,6 @@ struct mem_cgroup {
/* vmpressure notifications */
struct vmpressure vmpressure;
- unsigned long flags;
-
/*
* Should the accounting and control be hierarchical, per subtree?
*/
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 99688299eba8..78214c899710 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -200,16 +200,27 @@ static bool sane_reclaim(struct scan_control *sc)
return false;
}
-static void set_memcg_bit(enum pgdat_flags flag,
- struct mem_cgroup *memcg)
+static void set_memcg_congestion(pg_data_t *pgdat,
+ struct mem_cgroup *memcg,
+ bool congested)
{
- set_bit(flag, &memcg->flags);
+ struct mem_cgroup_per_node *mz;
+
+ if (!memcg)
+ return;
+
+ mz = mem_cgroup_nodeinfo(memcg, pgdat->node_id);
+ WRITE_ONCE(mz->congested, congested);
}
-static int test_memcg_bit(enum pgdat_flags flag,
+static bool memcg_congested(pg_data_t *pgdat,
struct mem_cgroup *memcg)
{
- return test_bit(flag, &memcg->flags);
+ struct mem_cgroup_per_node *mz;
+
+ mz = mem_cgroup_nodeinfo(memcg, pgdat->node_id);
+ return READ_ONCE(mz->congested);
+
}
#else
static bool global_reclaim(struct scan_control *sc)
@@ -222,15 +233,16 @@ static bool sane_reclaim(struct scan_control *sc)
return true;
}
-static inline void set_memcg_bit(enum pgdat_flags flag,
- struct mem_cgroup *memcg)
+static inline void set_memcg_congestion(struct pglist_data *pgdat,
+ struct mem_cgroup *memcg, bool congested)
{
}
-static inline int test_memcg_bit(enum pgdat_flags flag,
- struct mem_cgroup *memcg)
+static inline bool memcg_congested(struct pglist_data *pgdat,
+ struct mem_cgroup *memcg)
{
- return 0;
+ return false;
+
}
#endif
@@ -2482,7 +2494,7 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
static bool pgdat_memcg_congested(pg_data_t *pgdat, struct mem_cgroup *memcg)
{
return test_bit(PGDAT_CONGESTED, &pgdat->flags) ||
- (memcg && test_memcg_bit(PGDAT_CONGESTED, memcg));
+ (memcg && memcg_congested(pgdat, memcg));
}
static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
@@ -2617,7 +2629,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
*/
if (!global_reclaim(sc) && sane_reclaim(sc) &&
sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
- set_memcg_bit(PGDAT_CONGESTED, root);
+ set_memcg_congestion(pgdat, root, true);
/*
* Stall direct reclaim for IO completions if underlying BDIs
@@ -2844,6 +2856,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
continue;
last_pgdat = zone->zone_pgdat;
snapshot_refaults(sc->target_mem_cgroup, zone->zone_pgdat);
+ set_memcg_congestion(last_pgdat, sc->target_mem_cgroup, false);
}
delayacct_freepages_end();
@@ -3067,7 +3080,6 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
* the priority and make it zero.
*/
shrink_node_memcg(pgdat, memcg, &sc, &lru_pages);
- clear_bit(PGDAT_CONGESTED, &memcg->flags);
trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);
@@ -3113,7 +3125,6 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
noreclaim_flag = memalloc_noreclaim_save();
nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
memalloc_noreclaim_restore(noreclaim_flag);
- clear_bit(PGDAT_CONGESTED, &memcg->flags);
trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
--
2.16.1
On Fri, Apr 6, 2018 at 4:44 AM, Andrey Ryabinin <[email protected]> wrote:
> On 04/06/2018 05:13 AM, Shakeel Butt wrote:
>> On Fri, Mar 23, 2018 at 8:20 AM, Andrey Ryabinin
>> <[email protected]> wrote:
>>> memcg reclaim may alter pgdat->flags based on the state of LRU lists
>>> in cgroup and its children. PGDAT_WRITEBACK may force kswapd to sleep
>>> congested_wait(), PGDAT_DIRTY may force kswapd to writeback filesystem
>>> pages. But the worst here is PGDAT_CONGESTED, since it may force all
>>> direct reclaims to stall in wait_iff_congested(). Note that only kswapd
>>> have powers to clear any of these bits. This might just never happen if
>>> cgroup limits configured that way. So all direct reclaims will stall
>>> as long as we have some congested bdi in the system.
>>>
>>> Leave all pgdat->flags manipulations to kswapd. kswapd scans the whole
>>> pgdat, only kswapd can clear pgdat->flags once node is balance, thus
>>> it's reasonable to leave all decisions about node state to kswapd.
>>
>> What about global reclaimers? Is the assumption that when global
>> reclaimers hit such condition, kswapd will be running and correctly
>> set PGDAT_CONGESTED?
>>
>
> The reason I moved this under if(current_is_kswapd()) is because only kswapd
> can clear these flags. I'm less worried about the case when PGDAT_CONGESTED falsely
> not set, and more worried about the case when it falsely set. If direct reclaimer sets
> PGDAT_CONGESTED, do we have guarantee that, after congestion problem is sorted, kswapd
> ill be woken up and clear the flag? It seems like there is no such guarantee.
> E.g. direct reclaimers may eventually balance pgdat and kswapd simply won't wake up
> (see wakeup_kswapd()).
>
>
Thanks for the explanation, I think it should be in the commit message.
On Fri, Apr 6, 2018 at 6:52 AM, Andrey Ryabinin <[email protected]> wrote:
> On 04/06/2018 05:13 AM, Shakeel Butt wrote:
>> Question: Should this 'flags' be per-node? Is it ok for a congested
>> memcg to call wait_iff_congested for all nodes?
>
> Indeed, congestion state should be pre-node. If memcg on node A is
> congested, there is no point is stalling memcg reclaim from node B.
>
> Make congestion state per-cgroup-per-node and record it in
> 'struct mem_cgroup_per_node'.
>
> Signed-off-by: Andrey Ryabinin <[email protected]>
> ---
> include/linux/memcontrol.h | 5 +++--
> mm/vmscan.c | 39 +++++++++++++++++++++++++--------------
> 2 files changed, 28 insertions(+), 16 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 8b394bbf1c86..af9eed2e3e04 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -120,6 +120,9 @@ struct mem_cgroup_per_node {
> unsigned long usage_in_excess;/* Set to the value by which */
> /* the soft limit is exceeded*/
> bool on_tree;
> + bool congested; /* memcg has many dirty pages */
> + /* backed by a congested BDI */
> +
> struct mem_cgroup *memcg; /* Back pointer, we cannot */
> /* use container_of */
> };
> @@ -189,8 +192,6 @@ struct mem_cgroup {
> /* vmpressure notifications */
> struct vmpressure vmpressure;
>
> - unsigned long flags;
> -
> /*
> * Should the accounting and control be hierarchical, per subtree?
> */
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 99688299eba8..78214c899710 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -200,16 +200,27 @@ static bool sane_reclaim(struct scan_control *sc)
> return false;
> }
>
> -static void set_memcg_bit(enum pgdat_flags flag,
> - struct mem_cgroup *memcg)
> +static void set_memcg_congestion(pg_data_t *pgdat,
> + struct mem_cgroup *memcg,
> + bool congested)
> {
> - set_bit(flag, &memcg->flags);
> + struct mem_cgroup_per_node *mz;
> +
> + if (!memcg)
> + return;
> +
> + mz = mem_cgroup_nodeinfo(memcg, pgdat->node_id);
> + WRITE_ONCE(mz->congested, congested);
> }
>
> -static int test_memcg_bit(enum pgdat_flags flag,
> +static bool memcg_congested(pg_data_t *pgdat,
> struct mem_cgroup *memcg)
> {
> - return test_bit(flag, &memcg->flags);
> + struct mem_cgroup_per_node *mz;
> +
> + mz = mem_cgroup_nodeinfo(memcg, pgdat->node_id);
> + return READ_ONCE(mz->congested);
> +
> }
> #else
> static bool global_reclaim(struct scan_control *sc)
> @@ -222,15 +233,16 @@ static bool sane_reclaim(struct scan_control *sc)
> return true;
> }
>
> -static inline void set_memcg_bit(enum pgdat_flags flag,
> - struct mem_cgroup *memcg)
> +static inline void set_memcg_congestion(struct pglist_data *pgdat,
> + struct mem_cgroup *memcg, bool congested)
> {
> }
>
> -static inline int test_memcg_bit(enum pgdat_flags flag,
> - struct mem_cgroup *memcg)
> +static inline bool memcg_congested(struct pglist_data *pgdat,
> + struct mem_cgroup *memcg)
> {
> - return 0;
> + return false;
> +
> }
> #endif
>
> @@ -2482,7 +2494,7 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
> static bool pgdat_memcg_congested(pg_data_t *pgdat, struct mem_cgroup *memcg)
> {
> return test_bit(PGDAT_CONGESTED, &pgdat->flags) ||
> - (memcg && test_memcg_bit(PGDAT_CONGESTED, memcg));
> + (memcg && memcg_congested(pgdat, memcg));
I am wondering if we should check all ancestors for congestion as
well. Maybe a parallel memcg reclaimer might have set some ancestor of
this memcg to congested.
> }
>
> static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> @@ -2617,7 +2629,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> */
> if (!global_reclaim(sc) && sane_reclaim(sc) &&
> sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
> - set_memcg_bit(PGDAT_CONGESTED, root);
> + set_memcg_congestion(pgdat, root, true);
>
> /*
> * Stall direct reclaim for IO completions if underlying BDIs
> @@ -2844,6 +2856,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
> continue;
> last_pgdat = zone->zone_pgdat;
> snapshot_refaults(sc->target_mem_cgroup, zone->zone_pgdat);
> + set_memcg_congestion(last_pgdat, sc->target_mem_cgroup, false);
> }
>
> delayacct_freepages_end();
> @@ -3067,7 +3080,6 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
> * the priority and make it zero.
> */
> shrink_node_memcg(pgdat, memcg, &sc, &lru_pages);
> - clear_bit(PGDAT_CONGESTED, &memcg->flags);
>
> trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);
>
> @@ -3113,7 +3125,6 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
> noreclaim_flag = memalloc_noreclaim_save();
> nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
> memalloc_noreclaim_restore(noreclaim_flag);
> - clear_bit(PGDAT_CONGESTED, &memcg->flags);
>
> trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
>
> --
> 2.16.1
>
On 04/06/2018 05:37 PM, Shakeel Butt wrote:
>>
>> @@ -2482,7 +2494,7 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
>> static bool pgdat_memcg_congested(pg_data_t *pgdat, struct mem_cgroup *memcg)
>> {
>> return test_bit(PGDAT_CONGESTED, &pgdat->flags) ||
>> - (memcg && test_memcg_bit(PGDAT_CONGESTED, memcg));
>> + (memcg && memcg_congested(pgdat, memcg));
>
> I am wondering if we should check all ancestors for congestion as
> well. Maybe a parallel memcg reclaimer might have set some ancestor of
> this memcg to congested.
>
Why? If ancestor is congested but its child (the one we currently reclaim) is not,
it could mean only 2 things:
- Either child use mostly anon and inactive file lru is small (file_lru >> priority == 0)
so it's not congested.
- Or the child was congested recently (at the time when ancestor scanned this group),
but not anymore. So the information from ancestor is simply outdated.
On Fri, Apr 6, 2018 at 8:09 AM, Andrey Ryabinin <[email protected]> wrote:
>
>
> On 04/06/2018 05:37 PM, Shakeel Butt wrote:
>
>>>
>>> @@ -2482,7 +2494,7 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
>>> static bool pgdat_memcg_congested(pg_data_t *pgdat, struct mem_cgroup *memcg)
>>> {
>>> return test_bit(PGDAT_CONGESTED, &pgdat->flags) ||
>>> - (memcg && test_memcg_bit(PGDAT_CONGESTED, memcg));
>>> + (memcg && memcg_congested(pgdat, memcg));
>>
>> I am wondering if we should check all ancestors for congestion as
>> well. Maybe a parallel memcg reclaimer might have set some ancestor of
>> this memcg to congested.
>>
>
> Why? If ancestor is congested but its child (the one we currently reclaim) is not,
> it could mean only 2 things:
> - Either child use mostly anon and inactive file lru is small (file_lru >> priority == 0)
> so it's not congested.
> - Or the child was congested recently (at the time when ancestor scanned this group),
> but not anymore. So the information from ancestor is simply outdated.
>
Oh yeah, you explained in the other email as well. Thanks.
I think Andrew will squash this patch with the previous one. Andrew,
please add following in the squashed patch.
Reviewed-by: Shakeel Butt <[email protected]>
On Fri, Mar 23, 2018 at 06:20:26PM +0300, Andrey Ryabinin wrote:
> Update some comments that become stale since transiton from per-zone
> to per-node reclaim.
>
> Signed-off-by: Andrey Ryabinin <[email protected]>
> Acked-by: Michal Hocko <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
On Fri, Mar 23, 2018 at 06:20:27PM +0300, Andrey Ryabinin wrote:
> Only kswapd can have non-zero nr_immediate, and current_may_throttle() is
> always true for kswapd (PF_LESS_THROTTLE bit is never set) thus it's
> enough to check stat.nr_immediate only.
>
> Signed-off-by: Andrey Ryabinin <[email protected]>
> Acked-by: Michal Hocko <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
On Fri, Mar 23, 2018 at 06:20:28PM +0300, Andrey Ryabinin wrote:
> We have separate LRU list for each memory cgroup. Memory reclaim iterates
> over cgroups and calls shrink_inactive_list() every inactive LRU list.
> Based on the state of a single LRU shrink_inactive_list() may flag
> the whole node as dirty,congested or under writeback. This is obviously
> wrong and hurtful. It's especially hurtful when we have possibly
> small congested cgroup in system. Than *all* direct reclaims waste time
> by sleeping in wait_iff_congested(). And the more memcgs in the system
> we have the longer memory allocation stall is, because
> wait_iff_congested() called on each lru-list scan.
>
> Sum reclaim stats across all visited LRUs on node and flag node as dirty,
> congested or under writeback based on that sum. Also call
> congestion_wait(), wait_iff_congested() once per pgdat scan, instead of
> once per lru-list scan.
>
> This only fixes the problem for global reclaim case. Per-cgroup reclaim
> may alter global pgdat flags too, which is wrong. But that is separate
> issue and will be addressed in the next patch.
>
> This change will not have any effect on a systems with all workload
> concentrated in a single cgroup.
This makes a ton of sense, and I'm going to ack the patch, but here is
one issue here:
> @@ -2587,6 +2554,61 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> if (sc->nr_reclaimed - nr_reclaimed)
> reclaimable = true;
>
> + /*
> + * If reclaim is isolating dirty pages under writeback, it
> + * implies that the long-lived page allocation rate is exceeding
> + * the page laundering rate. Either the global limits are not
> + * being effective at throttling processes due to the page
> + * distribution throughout zones or there is heavy usage of a
> + * slow backing device. The only option is to throttle from
> + * reclaim context which is not ideal as there is no guarantee
> + * the dirtying process is throttled in the same way
> + * balance_dirty_pages() manages.
> + *
> + * Once a node is flagged PGDAT_WRITEBACK, kswapd will count the
> + * number of pages under pages flagged for immediate reclaim and
> + * stall if any are encountered in the nr_immediate check below.
> + */
> + if (sc->nr.writeback && sc->nr.writeback == sc->nr.file_taken)
> + set_bit(PGDAT_WRITEBACK, &pgdat->flags);
> +
> + /*
> + * Legacy memcg will stall in page writeback so avoid forcibly
> + * stalling here.
> + */
> + if (sane_reclaim(sc)) {
> + /*
> + * Tag a node as congested if all the dirty pages
> + * scanned were backed by a congested BDI and
> + * wait_iff_congested will stall.
> + */
> + if (sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
> + set_bit(PGDAT_CONGESTED, &pgdat->flags);
> +
> + /* Allow kswapd to start writing pages during reclaim.*/
> + if (sc->nr.unqueued_dirty == sc->nr.file_taken)
> + set_bit(PGDAT_DIRTY, &pgdat->flags);
> +
> + /*
> + * If kswapd scans pages marked marked for immediate
> + * reclaim and under writeback (nr_immediate), it
> + * implies that pages are cycling through the LRU
> + * faster than they are written so also forcibly stall.
> + */
> + if (sc->nr.immediate)
> + congestion_wait(BLK_RW_ASYNC, HZ/10);
> + }
This isn't quite equivalent to what we have right now.
Yes, nr_dirty, nr_unqueued_dirty and nr_congested apply to file pages
only. That part is about waking the flushers and avoiding writing
files in 4k chunks from reclaim context. So those numbers do need to
be compared against scanned *file* pages.
But nr_writeback and nr_immediate is about throttling reclaim when we
hit too many pages under writeout, and that applies to both file and
anonymous/swap pages. We do want to throttle on swapout, too.
So nr_writeback needs to check against all nr_taken, not just file.
On Fri, Apr 06, 2018 at 04:52:15PM +0300, Andrey Ryabinin wrote:
> On 04/06/2018 05:13 AM, Shakeel Butt wrote:
> > Question: Should this 'flags' be per-node? Is it ok for a congested
> > memcg to call wait_iff_congested for all nodes?
>
> Indeed, congestion state should be pre-node. If memcg on node A is
> congested, there is no point is stalling memcg reclaim from node B.
>
> Make congestion state per-cgroup-per-node and record it in
> 'struct mem_cgroup_per_node'.
>
> Signed-off-by: Andrey Ryabinin <[email protected]>
Thanks for fixing this, Andrey. This is great.
For the combined patch and this fix:
Acked-by: Johannes Weiner <[email protected]>
On 04/06/2018 07:28 PM, Johannes Weiner wrote:
>
> This isn't quite equivalent to what we have right now.
>
> Yes, nr_dirty, nr_unqueued_dirty and nr_congested apply to file pages
> only. That part is about waking the flushers and avoiding writing
> files in 4k chunks from reclaim context. So those numbers do need to
> be compared against scanned *file* pages.
>
> But nr_writeback and nr_immediate is about throttling reclaim when we
> hit too many pages under writeout, and that applies to both file and
> anonymous/swap pages. We do want to throttle on swapout, too.
>
> So nr_writeback needs to check against all nr_taken, not just file.
>
Agreed, the fix bellow. It causes conflict in the next 4/4 patch,
so I'll just send v3 with all fixes folded.
---
mm/vmscan.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4d848b8df01f..c45497475e84 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -124,6 +124,7 @@ struct scan_control {
unsigned int writeback;
unsigned int immediate;
unsigned int file_taken;
+ unsigned int taken;
} nr;
};
@@ -1771,6 +1772,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
sc->nr.writeback += stat.nr_writeback;
sc->nr.immediate += stat.nr_immediate;
+ sc->nr.taken += nr_taken;
if (file)
sc->nr.file_taken += nr_taken;
@@ -2553,7 +2555,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
* number of pages under pages flagged for immediate reclaim and
* stall if any are encountered in the nr_immediate check below.
*/
- if (sc->nr.writeback && sc->nr.writeback == sc->nr.file_taken)
+ if (sc->nr.writeback && sc->nr.writeback == sc->nr.taken)
set_bit(PGDAT_WRITEBACK, &pgdat->flags);
/*
--
2.16.1
We have separate LRU list for each memory cgroup. Memory reclaim iterates
over cgroups and calls shrink_inactive_list() every inactive LRU list.
Based on the state of a single LRU shrink_inactive_list() may flag the
whole node as dirty,congested or under writeback. This is obviously wrong
and hurtful. It's especially hurtful when we have possibly small
congested cgroup in system. Than *all* direct reclaims waste time by
sleeping in wait_iff_congested(). And the more memcgs in the system we
have the longer memory allocation stall is, because wait_iff_congested()
called on each lru-list scan.
Sum reclaim stats across all visited LRUs on node and flag node as dirty,
congested or under writeback based on that sum. Also call
congestion_wait(), wait_iff_congested() once per pgdat scan, instead of
once per lru-list scan.
This only fixes the problem for global reclaim case. Per-cgroup reclaim
may alter global pgdat flags too, which is wrong. But that is separate
issue and will be addressed in the next patch.
This change will not have any effect on a systems with all workload
concentrated in a single cgroup.
Signed-off-by: Andrey Ryabinin <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
---
Changes since v2:
- Reviewed-by: Shakeel Butt <[email protected]>
- Check nr_writeback against all nr_taken, not just file (Johannes)
mm/vmscan.c | 126 ++++++++++++++++++++++++++++++++++++------------------------
1 file changed, 75 insertions(+), 51 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 403f59edd53e..1ecc648b6191 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -116,6 +116,16 @@ struct scan_control {
/* Number of pages freed so far during a call to shrink_zones() */
unsigned long nr_reclaimed;
+
+ struct {
+ unsigned int dirty;
+ unsigned int unqueued_dirty;
+ unsigned int congested;
+ unsigned int writeback;
+ unsigned int immediate;
+ unsigned int file_taken;
+ unsigned int taken;
+ } nr;
};
#ifdef ARCH_HAS_PREFETCH
@@ -1754,23 +1764,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
mem_cgroup_uncharge_list(&page_list);
free_unref_page_list(&page_list);
- /*
- * If reclaim is isolating dirty pages under writeback, it implies
- * that the long-lived page allocation rate is exceeding the page
- * laundering rate. Either the global limits are not being effective
- * at throttling processes due to the page distribution throughout
- * zones or there is heavy usage of a slow backing device. The
- * only option is to throttle from reclaim context which is not ideal
- * as there is no guarantee the dirtying process is throttled in the
- * same way balance_dirty_pages() manages.
- *
- * Once a node is flagged PGDAT_WRITEBACK, kswapd will count the number
- * of pages under pages flagged for immediate reclaim and stall if any
- * are encountered in the nr_immediate check below.
- */
- if (stat.nr_writeback && stat.nr_writeback == nr_taken)
- set_bit(PGDAT_WRITEBACK, &pgdat->flags);
-
/*
* If dirty pages are scanned that are not queued for IO, it
* implies that flushers are not doing their job. This can
@@ -1785,40 +1778,14 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
if (stat.nr_unqueued_dirty == nr_taken)
wakeup_flusher_threads(WB_REASON_VMSCAN);
- /*
- * Legacy memcg will stall in page writeback so avoid forcibly
- * stalling here.
- */
- if (sane_reclaim(sc)) {
- /*
- * Tag a node as congested if all the dirty pages scanned were
- * backed by a congested BDI and wait_iff_congested will stall.
- */
- if (stat.nr_dirty && stat.nr_dirty == stat.nr_congested)
- set_bit(PGDAT_CONGESTED, &pgdat->flags);
-
- /* Allow kswapd to start writing pages during reclaim. */
- if (stat.nr_unqueued_dirty == nr_taken)
- set_bit(PGDAT_DIRTY, &pgdat->flags);
-
- /*
- * If kswapd scans pages marked marked for immediate
- * reclaim and under writeback (nr_immediate), it implies
- * that pages are cycling through the LRU faster than
- * they are written so also forcibly stall.
- */
- if (stat.nr_immediate)
- congestion_wait(BLK_RW_ASYNC, HZ/10);
- }
-
- /*
- * Stall direct reclaim for IO completions if underlying BDIs and node
- * is congested. Allow kswapd to continue until it starts encountering
- * unqueued dirty pages or cycling through the LRU too quickly.
- */
- if (!sc->hibernation_mode && !current_is_kswapd() &&
- current_may_throttle())
- wait_iff_congested(pgdat, BLK_RW_ASYNC, HZ/10);
+ sc->nr.dirty += stat.nr_dirty;
+ sc->nr.congested += stat.nr_congested;
+ sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
+ sc->nr.writeback += stat.nr_writeback;
+ sc->nr.immediate += stat.nr_immediate;
+ sc->nr.taken += nr_taken;
+ if (file)
+ sc->nr.file_taken += nr_taken;
trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
nr_scanned, nr_reclaimed,
@@ -2522,6 +2489,8 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
unsigned long node_lru_pages = 0;
struct mem_cgroup *memcg;
+ memset(&sc->nr, 0, sizeof(sc->nr));
+
nr_reclaimed = sc->nr_reclaimed;
nr_scanned = sc->nr_scanned;
@@ -2587,6 +2556,61 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
if (sc->nr_reclaimed - nr_reclaimed)
reclaimable = true;
+ /*
+ * If reclaim is isolating dirty pages under writeback, it
+ * implies that the long-lived page allocation rate is exceeding
+ * the page laundering rate. Either the global limits are not
+ * being effective at throttling processes due to the page
+ * distribution throughout zones or there is heavy usage of a
+ * slow backing device. The only option is to throttle from
+ * reclaim context which is not ideal as there is no guarantee
+ * the dirtying process is throttled in the same way
+ * balance_dirty_pages() manages.
+ *
+ * Once a node is flagged PGDAT_WRITEBACK, kswapd will count the
+ * number of pages under pages flagged for immediate reclaim and
+ * stall if any are encountered in the nr_immediate check below.
+ */
+ if (sc->nr.writeback && sc->nr.writeback == sc->nr.taken)
+ set_bit(PGDAT_WRITEBACK, &pgdat->flags);
+
+ /*
+ * Legacy memcg will stall in page writeback so avoid forcibly
+ * stalling here.
+ */
+ if (sane_reclaim(sc)) {
+ /*
+ * Tag a node as congested if all the dirty pages
+ * scanned were backed by a congested BDI and
+ * wait_iff_congested will stall.
+ */
+ if (sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
+ set_bit(PGDAT_CONGESTED, &pgdat->flags);
+
+ /* Allow kswapd to start writing pages during reclaim.*/
+ if (sc->nr.unqueued_dirty == sc->nr.file_taken)
+ set_bit(PGDAT_DIRTY, &pgdat->flags);
+
+ /*
+ * If kswapd scans pages marked marked for immediate
+ * reclaim and under writeback (nr_immediate), it
+ * implies that pages are cycling through the LRU
+ * faster than they are written so also forcibly stall.
+ */
+ if (sc->nr.immediate)
+ congestion_wait(BLK_RW_ASYNC, HZ/10);
+ }
+
+ /*
+ * Stall direct reclaim for IO completions if underlying BDIs
+ * and node is congested. Allow kswapd to continue until it
+ * starts encountering unqueued dirty pages or cycling through
+ * the LRU too quickly.
+ */
+ if (!sc->hibernation_mode && !current_is_kswapd() &&
+ current_may_throttle())
+ wait_iff_congested(pgdat, BLK_RW_ASYNC, HZ/10);
+
} while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
sc->nr_scanned - nr_scanned, sc));
--
2.16.1
memcg reclaim may alter pgdat->flags based on the state of LRU lists in
cgroup and its children. PGDAT_WRITEBACK may force kswapd to sleep
congested_wait(), PGDAT_DIRTY may force kswapd to writeback filesystem
pages. But the worst here is PGDAT_CONGESTED, since it may force all
direct reclaims to stall in wait_iff_congested(). Note that only kswapd
have powers to clear any of these bits. This might just never happen if
cgroup limits configured that way. So all direct reclaims will stall as
long as we have some congested bdi in the system.
Leave all pgdat->flags manipulations to kswapd. kswapd scans the whole
pgdat, only kswapd can clear pgdat->flags once node is balanced, thus it's
reasonable to leave all decisions about node state to kswapd.
Why only kswapd? Why not allow to global direct reclaim change these flags?
It is because currently only kswapd can clear these flags. I'm less worried
about the case when PGDAT_CONGESTED falsely not set, and more worried about
the case when it falsely set. If direct reclaimer sets PGDAT_CONGESTED, do
we have guarantee that after the congestion problem is sorted out, kswapd
will be woken up and clear the flag? It seems like there is no such
guarantee. E.g. direct reclaimers may eventually balance pgdat and kswapd
simply won't wake up (see wakeup_kswapd()).
Moving pgdat->flags manipulation to kswapd, means that cgroup2 recalim now
loses its congestion throttling mechanism. Add per-cgroup congestion
state and throttle cgroup2 reclaimers if memcg is in congestion state.
Currently there is no need in per-cgroup PGDAT_WRITEBACK and PGDAT_DIRTY
bits since they alter only kswapd behavior.
The problem could be easily demonstrated by creating heavy congestion
in one cgroup:
echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control
mkdir -p /sys/fs/cgroup/congester
echo 512M > /sys/fs/cgroup/congester/memory.max
echo $$ > /sys/fs/cgroup/congester/cgroup.procs
/* generate a lot of diry data on slow HDD */
while true; do dd if=/dev/zero of=/mnt/sdb/zeroes bs=1M count=1024; done &
....
while true; do dd if=/dev/zero of=/mnt/sdb/zeroes bs=1M count=1024; done &
and some job in another cgroup:
mkdir /sys/fs/cgroup/victim
echo 128M > /sys/fs/cgroup/victim/memory.max
# time cat /dev/sda > /dev/null
real 10m15.054s
user 0m0.487s
sys 1m8.505s
According to the tracepoint in wait_iff_congested(), the 'cat' spent 50%
of the time sleeping there.
With the patch, cat don't waste time anymore:
# time cat /dev/sda > /dev/null
real 5m32.911s
user 0m0.411s
sys 0m56.664s
Signed-off-by: Andrey Ryabinin <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Michal Hocko <[email protected]>
---
Changes since v2:
- Make congestion state per-cgroup-per-node instead of just per-cgroup. (Shakeel)
- Changelog update. (Shakeel)
- Add Acked-by/Reviewed-by
include/linux/backing-dev.h | 2 +-
include/linux/memcontrol.h | 3 ++
mm/backing-dev.c | 19 +++------
mm/vmscan.c | 96 +++++++++++++++++++++++++++++++++------------
4 files changed, 82 insertions(+), 38 deletions(-)
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 3e4ce54d84ab..e6cbb915ee56 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -175,7 +175,7 @@ static inline int wb_congested(struct bdi_writeback *wb, int cong_bits)
}
long congestion_wait(int sync, long timeout);
-long wait_iff_congested(struct pglist_data *pgdat, int sync, long timeout);
+long wait_iff_congested(int sync, long timeout);
static inline bool bdi_cap_synchronous_io(struct backing_dev_info *bdi)
{
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index c46016bb25eb..f292efac378d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -120,6 +120,9 @@ struct mem_cgroup_per_node {
unsigned long usage_in_excess;/* Set to the value by which */
/* the soft limit is exceeded*/
bool on_tree;
+ bool congested; /* memcg has many dirty pages */
+ /* backed by a congested BDI */
+
struct mem_cgroup *memcg; /* Back pointer, we cannot */
/* use container_of */
};
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index fac66abd5a68..2b23ba08389a 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -1022,23 +1022,18 @@ EXPORT_SYMBOL(congestion_wait);
/**
* wait_iff_congested - Conditionally wait for a backing_dev to become uncongested or a pgdat to complete writes
- * @pgdat: A pgdat to check if it is heavily congested
* @sync: SYNC or ASYNC IO
* @timeout: timeout in jiffies
*
- * In the event of a congested backing_dev (any backing_dev) and the given
- * @pgdat has experienced recent congestion, this waits for up to @timeout
- * jiffies for either a BDI to exit congestion of the given @sync queue
- * or a write to complete.
- *
- * In the absence of pgdat congestion, cond_resched() is called to yield
- * the processor if necessary but otherwise does not sleep.
+ * In the event of a congested backing_dev (any backing_dev) this waits
+ * for up to @timeout jiffies for either a BDI to exit congestion of the
+ * given @sync queue or a write to complete.
*
* The return value is 0 if the sleep is for the full timeout. Otherwise,
* it is the number of jiffies that were still remaining when the function
* returned. return_value == timeout implies the function did not sleep.
*/
-long wait_iff_congested(struct pglist_data *pgdat, int sync, long timeout)
+long wait_iff_congested(int sync, long timeout)
{
long ret;
unsigned long start = jiffies;
@@ -1046,12 +1041,10 @@ long wait_iff_congested(struct pglist_data *pgdat, int sync, long timeout)
wait_queue_head_t *wqh = &congestion_wqh[sync];
/*
- * If there is no congestion, or heavy congestion is not being
- * encountered in the current pgdat, yield if necessary instead
+ * If there is no congestion, yield if necessary instead
* of sleeping on the congestion queue
*/
- if (atomic_read(&nr_wb_congested[sync]) == 0 ||
- !test_bit(PGDAT_CONGESTED, &pgdat->flags)) {
+ if (atomic_read(&nr_wb_congested[sync]) == 0) {
cond_resched();
/* In case we scheduled, work out time remaining */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1ecc648b6191..e411385b304a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -200,6 +200,29 @@ static bool sane_reclaim(struct scan_control *sc)
#endif
return false;
}
+
+static void set_memcg_congestion(pg_data_t *pgdat,
+ struct mem_cgroup *memcg,
+ bool congested)
+{
+ struct mem_cgroup_per_node *mn;
+
+ if (!memcg)
+ return;
+
+ mn = mem_cgroup_nodeinfo(memcg, pgdat->node_id);
+ WRITE_ONCE(mn->congested, congested);
+}
+
+static bool memcg_congested(pg_data_t *pgdat,
+ struct mem_cgroup *memcg)
+{
+ struct mem_cgroup_per_node *mn;
+
+ mn = mem_cgroup_nodeinfo(memcg, pgdat->node_id);
+ return READ_ONCE(mn->congested);
+
+}
#else
static bool global_reclaim(struct scan_control *sc)
{
@@ -210,6 +233,18 @@ static bool sane_reclaim(struct scan_control *sc)
{
return true;
}
+
+static inline void set_memcg_congestion(struct pglist_data *pgdat,
+ struct mem_cgroup *memcg, bool congested)
+{
+}
+
+static inline bool memcg_congested(struct pglist_data *pgdat,
+ struct mem_cgroup *memcg)
+{
+ return false;
+
+}
#endif
/*
@@ -2474,6 +2509,12 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
return true;
}
+static bool pgdat_memcg_congested(pg_data_t *pgdat, struct mem_cgroup *memcg)
+{
+ return test_bit(PGDAT_CONGESTED, &pgdat->flags) ||
+ (memcg && memcg_congested(pgdat, memcg));
+}
+
static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
{
struct reclaim_state *reclaim_state = current->reclaim_state;
@@ -2556,29 +2597,27 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
if (sc->nr_reclaimed - nr_reclaimed)
reclaimable = true;
- /*
- * If reclaim is isolating dirty pages under writeback, it
- * implies that the long-lived page allocation rate is exceeding
- * the page laundering rate. Either the global limits are not
- * being effective at throttling processes due to the page
- * distribution throughout zones or there is heavy usage of a
- * slow backing device. The only option is to throttle from
- * reclaim context which is not ideal as there is no guarantee
- * the dirtying process is throttled in the same way
- * balance_dirty_pages() manages.
- *
- * Once a node is flagged PGDAT_WRITEBACK, kswapd will count the
- * number of pages under pages flagged for immediate reclaim and
- * stall if any are encountered in the nr_immediate check below.
- */
- if (sc->nr.writeback && sc->nr.writeback == sc->nr.taken)
- set_bit(PGDAT_WRITEBACK, &pgdat->flags);
+ if (current_is_kswapd()) {
+ /*
+ * If reclaim is isolating dirty pages under writeback,
+ * it implies that the long-lived page allocation rate
+ * is exceeding the page laundering rate. Either the
+ * global limits are not being effective at throttling
+ * processes due to the page distribution throughout
+ * zones or there is heavy usage of a slow backing
+ * device. The only option is to throttle from reclaim
+ * context which is not ideal as there is no guarantee
+ * the dirtying process is throttled in the same way
+ * balance_dirty_pages() manages.
+ *
+ * Once a node is flagged PGDAT_WRITEBACK, kswapd will
+ * count the number of pages under pages flagged for
+ * immediate reclaim and stall if any are encountered
+ * in the nr_immediate check below.
+ */
+ if (sc->nr.writeback && sc->nr.writeback == sc->nr.taken)
+ set_bit(PGDAT_WRITEBACK, &pgdat->flags);
- /*
- * Legacy memcg will stall in page writeback so avoid forcibly
- * stalling here.
- */
- if (sane_reclaim(sc)) {
/*
* Tag a node as congested if all the dirty pages
* scanned were backed by a congested BDI and
@@ -2601,6 +2640,14 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
congestion_wait(BLK_RW_ASYNC, HZ/10);
}
+ /*
+ * Legacy memcg will stall in page writeback so avoid forcibly
+ * stalling in wait_iff_congested().
+ */
+ if (!global_reclaim(sc) && sane_reclaim(sc) &&
+ sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
+ set_memcg_congestion(pgdat, root, true);
+
/*
* Stall direct reclaim for IO completions if underlying BDIs
* and node is congested. Allow kswapd to continue until it
@@ -2608,8 +2655,8 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
* the LRU too quickly.
*/
if (!sc->hibernation_mode && !current_is_kswapd() &&
- current_may_throttle())
- wait_iff_congested(pgdat, BLK_RW_ASYNC, HZ/10);
+ current_may_throttle() && pgdat_memcg_congested(pgdat, root))
+ wait_iff_congested(BLK_RW_ASYNC, HZ/10);
} while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
sc->nr_scanned - nr_scanned, sc));
@@ -2826,6 +2873,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
continue;
last_pgdat = zone->zone_pgdat;
snapshot_refaults(sc->target_mem_cgroup, zone->zone_pgdat);
+ set_memcg_congestion(last_pgdat, sc->target_mem_cgroup, false);
}
delayacct_freepages_end();
--
2.16.1