Commit 726d061fbd36 ("mm: vmscan: kick flushers when we encounter
dirty pages on the LRU") added flusher invocation to
shrink_inactive_list() when many dirty pages on the LRU are encountered.
However, shrink_inactive_list() doesn't wake up flushers for legacy
cgroup reclaim, so the next commit bbef938429f5 ("mm: vmscan: remove
old flusher wakeup from direct reclaim path") removed the only source
of flusher's wake up in legacy mem cgroup reclaim path.
This leads to premature OOM if there is too many dirty pages in cgroup:
# mkdir /sys/fs/cgroup/memory/test
# echo $$ > /sys/fs/cgroup/memory/test/tasks
# echo 50M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
# dd if=/dev/zero of=tmp_file bs=1M count=100
Killed
dd invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=0
Call Trace:
dump_stack+0x46/0x65
dump_header+0x6b/0x2ac
oom_kill_process+0x21c/0x4a0
out_of_memory+0x2a5/0x4b0
mem_cgroup_out_of_memory+0x3b/0x60
mem_cgroup_oom_synchronize+0x2ed/0x330
pagefault_out_of_memory+0x24/0x54
__do_page_fault+0x521/0x540
page_fault+0x45/0x50
Task in /test killed as a result of limit of /test
memory: usage 51200kB, limit 51200kB, failcnt 73
memory+swap: usage 51200kB, limit 9007199254740988kB, failcnt 0
kmem: usage 296kB, limit 9007199254740988kB, failcnt 0
Memory cgroup stats for /test: cache:49632KB rss:1056KB rss_huge:0KB shmem:0KB
mapped_file:0KB dirty:49500KB writeback:0KB swap:0KB inactive_anon:0KB
active_anon:1168KB inactive_file:24760KB active_file:24960KB unevictable:0KB
Memory cgroup out of memory: Kill process 3861 (bash) score 88 or sacrifice child
Killed process 3876 (dd) total-vm:8484kB, anon-rss:1052kB, file-rss:1720kB, shmem-rss:0kB
oom_reaper: reaped process 3876 (dd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Wake up flushers in legacy cgroup reclaim too.
Fixes: bbef938429f5 ("mm: vmscan: remove old flusher wakeup from direct reclaim path")
Signed-off-by: Andrey Ryabinin <[email protected]>
Cc: <[email protected]>
---
mm/vmscan.c | 31 ++++++++++++++++---------------
1 file changed, 16 insertions(+), 15 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8fcd9f8d7390..4390a8d5be41 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1771,6 +1771,20 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
if (stat.nr_writeback && stat.nr_writeback == nr_taken)
set_bit(PGDAT_WRITEBACK, &pgdat->flags);
+ /*
+ * If dirty pages are scanned that are not queued for IO, it
+ * implies that flushers are not doing their job. This can
+ * happen when memory pressure pushes dirty pages to the end of
+ * the LRU before the dirty limits are breached and the dirty
+ * data has expired. It can also happen when the proportion of
+ * dirty pages grows not through writes but through memory
+ * pressure reclaiming all the clean cache. And in some cases,
+ * the flushers simply cannot keep up with the allocation
+ * rate. Nudge the flusher threads in case they are asleep.
+ */
+ if (stat.nr_unqueued_dirty == nr_taken)
+ wakeup_flusher_threads(WB_REASON_VMSCAN);
+
/*
* Legacy memcg will stall in page writeback so avoid forcibly
* stalling here.
@@ -1783,22 +1797,9 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
if (stat.nr_dirty && stat.nr_dirty == stat.nr_congested)
set_bit(PGDAT_CONGESTED, &pgdat->flags);
- /*
- * If dirty pages are scanned that are not queued for IO, it
- * implies that flushers are not doing their job. This can
- * happen when memory pressure pushes dirty pages to the end of
- * the LRU before the dirty limits are breached and the dirty
- * data has expired. It can also happen when the proportion of
- * dirty pages grows not through writes but through memory
- * pressure reclaiming all the clean cache. And in some cases,
- * the flushers simply cannot keep up with the allocation
- * rate. Nudge the flusher threads in case they are asleep, but
- * also allow kswapd to start writing pages during reclaim.
- */
- if (stat.nr_unqueued_dirty == nr_taken) {
- wakeup_flusher_threads(WB_REASON_VMSCAN);
+ /* Allow kswapd to start writing pages during reclaim. */
+ if (stat.nr_unqueued_dirty == nr_taken)
set_bit(PGDAT_DIRTY, &pgdat->flags);
- }
/*
* If kswapd scans pages marked marked for immediate
--
2.16.1
memcg reclaim may alter pgdat->flags based on the state of LRU lists
in cgroup and its children. PGDAT_WRITEBACK may force kswapd to sleep
congested_wait(), PGDAT_DIRTY may force kswapd to writeback filesystem
pages. But the worst here is PGDAT_CONGESTED, since it may force all
direct reclaims to stall in wait_iff_congested(). Note that only kswapd
have powers to clear any of these bits. This might just never happen if
cgroup limits configured that way. So all direct reclaims will stall
as long as we have some congested bdi in the system.
Leave all pgdat->flags manipulations to kswapd. kswapd scans the whole
pgdat, so it's reasonable to leave all decisions about node stat
to kswapd. Also add per-cgroup congestion state to avoid needlessly
burning CPU in cgroup reclaim if heavy congestion is observed.
Currently there is no need in per-cgroup PGDAT_WRITEBACK and PGDAT_DIRTY
bits since they alter only kswapd behavior.
The problem could be easily demonstrated by creating heavy congestion
in one cgroup:
echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control
mkdir -p /sys/fs/cgroup/congester
echo 512M > /sys/fs/cgroup/congester/memory.max
echo $$ > /sys/fs/cgroup/congester/cgroup.procs
/* generate a lot of diry data on slow HDD */
while true; do dd if=/dev/zero of=/mnt/sdb/zeroes bs=1M count=1024; done &
....
while true; do dd if=/dev/zero of=/mnt/sdb/zeroes bs=1M count=1024; done &
and some job in another cgroup:
mkdir /sys/fs/cgroup/victim
echo 128M > /sys/fs/cgroup/victim/memory.max
# time cat /dev/sda > /dev/null
real 10m15.054s
user 0m0.487s
sys 1m8.505s
According to the tracepoint in wait_iff_congested(), the 'cat' spent 50%
of the time sleeping there.
With the patch, cat don't waste time anymore:
# time cat /dev/sda > /dev/null
real 5m32.911s
user 0m0.411s
sys 0m56.664s
Signed-off-by: Andrey Ryabinin <[email protected]>
---
include/linux/backing-dev.h | 2 +-
include/linux/memcontrol.h | 2 ++
mm/backing-dev.c | 19 ++++------
mm/vmscan.c | 84 ++++++++++++++++++++++++++++++++-------------
4 files changed, 70 insertions(+), 37 deletions(-)
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index f8894dbc0b19..539a5cf94fe2 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -175,7 +175,7 @@ static inline int wb_congested(struct bdi_writeback *wb, int cong_bits)
}
long congestion_wait(int sync, long timeout);
-long wait_iff_congested(struct pglist_data *pgdat, int sync, long timeout);
+long wait_iff_congested(int sync, long timeout);
static inline bool bdi_cap_synchronous_io(struct backing_dev_info *bdi)
{
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 4525b4404a9e..44422e1d3def 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -190,6 +190,8 @@ struct mem_cgroup {
/* vmpressure notifications */
struct vmpressure vmpressure;
+ unsigned long flags;
+
/*
* Should the accounting and control be hierarchical, per subtree?
*/
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 2eba1f54b1d3..2fc3f38e4c4f 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -1055,23 +1055,18 @@ EXPORT_SYMBOL(congestion_wait);
/**
* wait_iff_congested - Conditionally wait for a backing_dev to become uncongested or a pgdat to complete writes
- * @pgdat: A pgdat to check if it is heavily congested
* @sync: SYNC or ASYNC IO
* @timeout: timeout in jiffies
*
- * In the event of a congested backing_dev (any backing_dev) and the given
- * @pgdat has experienced recent congestion, this waits for up to @timeout
- * jiffies for either a BDI to exit congestion of the given @sync queue
- * or a write to complete.
- *
- * In the absence of pgdat congestion, cond_resched() is called to yield
- * the processor if necessary but otherwise does not sleep.
+ * In the event of a congested backing_dev (any backing_dev) this waits
+ * for up to @timeout jiffies for either a BDI to exit congestion of the
+ * given @sync queue or a write to complete.
*
* The return value is 0 if the sleep is for the full timeout. Otherwise,
* it is the number of jiffies that were still remaining when the function
* returned. return_value == timeout implies the function did not sleep.
*/
-long wait_iff_congested(struct pglist_data *pgdat, int sync, long timeout)
+long wait_iff_congested(int sync, long timeout)
{
long ret;
unsigned long start = jiffies;
@@ -1079,12 +1074,10 @@ long wait_iff_congested(struct pglist_data *pgdat, int sync, long timeout)
wait_queue_head_t *wqh = &congestion_wqh[sync];
/*
- * If there is no congestion, or heavy congestion is not being
- * encountered in the current pgdat, yield if necessary instead
+ * If there is no congestion, yield if necessary instead
* of sleeping on the congestion queue
*/
- if (atomic_read(&nr_wb_congested[sync]) == 0 ||
- !test_bit(PGDAT_CONGESTED, &pgdat->flags)) {
+ if (atomic_read(&nr_wb_congested[sync]) == 0) {
cond_resched();
/* In case we scheduled, work out time remaining */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 522b480caeb2..a8690b0ec101 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -201,6 +201,18 @@ static bool sane_reclaim(struct scan_control *sc)
#endif
return false;
}
+
+static void set_memcg_bit(enum pgdat_flags flag,
+ struct mem_cgroup *memcg)
+{
+ set_bit(flag, &memcg->flags);
+}
+
+static int test_memcg_bit(enum pgdat_flags flag,
+ struct mem_cgroup *memcg)
+{
+ return test_bit(flag, &memcg->flags);
+}
#else
static bool global_reclaim(struct scan_control *sc)
{
@@ -211,6 +223,17 @@ static bool sane_reclaim(struct scan_control *sc)
{
return true;
}
+
+static inline void set_memcg_bit(enum pgdat_flags flag,
+ struct mem_cgroup *memcg)
+{
+}
+
+static inline int test_memcg_bit(enum pgdat_flags flag,
+ struct mem_cgroup *memcg)
+{
+ return 0;
+}
#endif
/*
@@ -2459,6 +2482,12 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
return true;
}
+static bool pgdat_memcg_congested(pg_data_t *pgdat, struct mem_cgroup *memcg)
+{
+ return test_bit(PGDAT_CONGESTED, &pgdat->flags) ||
+ (memcg && test_memcg_bit(PGDAT_CONGESTED, memcg));
+}
+
static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
{
struct reclaim_state *reclaim_state = current->reclaim_state;
@@ -2542,28 +2571,27 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
if (sc->nr_reclaimed - nr_reclaimed)
reclaimable = true;
- /*
- * If reclaim is isolating dirty pages under writeback, it implies
- * that the long-lived page allocation rate is exceeding the page
- * laundering rate. Either the global limits are not being effective
- * at throttling processes due to the page distribution throughout
- * zones or there is heavy usage of a slow backing device. The
- * only option is to throttle from reclaim context which is not ideal
- * as there is no guarantee the dirtying process is throttled in the
- * same way balance_dirty_pages() manages.
- *
- * Once a node is flagged PGDAT_WRITEBACK, kswapd will count the number
- * of pages under pages flagged for immediate reclaim and stall if any
- * are encountered in the nr_immediate check below.
- */
- if (stat.nr_writeback && stat.nr_writeback == stat.nr_taken)
- set_bit(PGDAT_WRITEBACK, &pgdat->flags);
+ if (current_is_kswapd()) {
+ /*
+ * If reclaim is isolating dirty pages under writeback,
+ * it implies that the long-lived page allocation rate
+ * is exceeding the page laundering rate. Either the
+ * global limits are not being effective at throttling
+ * processes due to the page distribution throughout
+ * zones or there is heavy usage of a slow backing device.
+ * The only option is to throttle from reclaim context
+ * which is not ideal as there is no guarantee the
+ * dirtying process is throttled in the same way
+ * balance_dirty_pages() manages.
+ *
+ * Once a node is flagged PGDAT_WRITEBACK, kswapd will
+ * count the number of pages under pages flagged for
+ * immediate reclaim and stall if any are encountered
+ * in the nr_immediate check below.
+ */
+ if (stat.nr_writeback && stat.nr_writeback == stat.nr_taken)
+ set_bit(PGDAT_WRITEBACK, &pgdat->flags);
- /*
- * Legacy memcg will stall in page writeback so avoid forcibly
- * stalling here.
- */
- if (sane_reclaim(sc)) {
/*
* Tag a node as congested if all the dirty pages scanned were
* backed by a congested BDI and wait_iff_congested will stall.
@@ -2585,14 +2613,22 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
congestion_wait(BLK_RW_ASYNC, HZ/10);
}
+ /*
+ * Legacy memcg will stall in page writeback so avoid forcibly
+ * stalling in wait_iff_congested().
+ */
+ if (!global_reclaim(sc) && sane_reclaim(sc) &&
+ stat.nr_dirty && stat.nr_dirty == stat.nr_congested)
+ set_memcg_bit(PGDAT_CONGESTED, root);
+
/*
* Stall direct reclaim for IO completions if underlying BDIs and node
* is congested. Allow kswapd to continue until it starts encountering
* unqueued dirty pages or cycling through the LRU too quickly.
*/
if (!sc->hibernation_mode && !current_is_kswapd() &&
- current_may_throttle())
- wait_iff_congested(pgdat, BLK_RW_ASYNC, HZ/10);
+ current_may_throttle() && pgdat_memcg_congested(pgdat, root))
+ wait_iff_congested(BLK_RW_ASYNC, HZ/10);
} while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
sc->nr_scanned - nr_scanned, sc));
@@ -3032,6 +3068,7 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
* the priority and make it zero.
*/
shrink_node_memcg(pgdat, memcg, &sc, &lru_pages);
+ clear_bit(PGDAT_CONGESTED, &memcg->flags);
trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);
@@ -3077,6 +3114,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
noreclaim_flag = memalloc_noreclaim_save();
nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
memalloc_noreclaim_restore(noreclaim_flag);
+ clear_bit(PGDAT_CONGESTED, &memcg->flags);
trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
--
2.16.1
We have separate LRU list for each memory cgroup. Memory reclaim iterates
over cgroups and calls shrink_inactive_list() every inactive LRU list.
Based on the state of a single LRU shrink_inactive_list() may flag
the whole node as dirty,congested or under writeback. This is obviously
wrong and hurtful. It's especially hurtful when we have possibly
small congested cgroup in system. Than *all* direct reclaims waste time
by sleeping in wait_iff_congested().
Sum reclaim stats across all visited LRUs on node and flag node as dirty,
congested or under writeback based on that sum. This only fixes the
problem for global reclaim case. Per-cgroup reclaim will be addressed
separately by the next patch.
This change will also affect systems with no memory cgroups. Reclaimer
now makes decision based on reclaim stats of the both anon and file LRU
lists. E.g. if the file list is in congested state and get_scan_count()
decided to reclaim some anon pages, reclaimer will start shrinking
anon without delay in wait_iff_congested() like it was before. It seems
to be a reasonable thing to do. Why waste time sleeping, before reclaiming
anon given that we going to try to reclaim it anyway?
Signed-off-by: Andrey Ryabinin <[email protected]>
---
mm/vmscan.c | 131 +++++++++++++++++++++++++++++++++---------------------------
1 file changed, 73 insertions(+), 58 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a8f6e4882e00..522b480caeb2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -61,6 +61,15 @@
#define CREATE_TRACE_POINTS
#include <trace/events/vmscan.h>
+struct reclaim_stat {
+ unsigned int nr_dirty;
+ unsigned int nr_unqueued_dirty;
+ unsigned int nr_congested;
+ unsigned int nr_writeback;
+ unsigned int nr_immediate;
+ unsigned int nr_taken;
+};
+
struct scan_control {
/* How many pages shrink_list() should reclaim */
unsigned long nr_to_reclaim;
@@ -116,6 +125,8 @@ struct scan_control {
/* Number of pages freed so far during a call to shrink_zones() */
unsigned long nr_reclaimed;
+
+ struct reclaim_stat *stat;
};
#ifdef ARCH_HAS_PREFETCH
@@ -857,14 +868,6 @@ static void page_check_dirty_writeback(struct page *page,
mapping->a_ops->is_dirty_writeback(page, dirty, writeback);
}
-struct reclaim_stat {
- unsigned nr_dirty;
- unsigned nr_unqueued_dirty;
- unsigned nr_congested;
- unsigned nr_writeback;
- unsigned nr_immediate;
-};
-
/*
* shrink_page_list() returns the number of reclaimed pages
*/
@@ -1753,23 +1756,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
mem_cgroup_uncharge_list(&page_list);
free_unref_page_list(&page_list);
- /*
- * If reclaim is isolating dirty pages under writeback, it implies
- * that the long-lived page allocation rate is exceeding the page
- * laundering rate. Either the global limits are not being effective
- * at throttling processes due to the page distribution throughout
- * zones or there is heavy usage of a slow backing device. The
- * only option is to throttle from reclaim context which is not ideal
- * as there is no guarantee the dirtying process is throttled in the
- * same way balance_dirty_pages() manages.
- *
- * Once a node is flagged PGDAT_WRITEBACK, kswapd will count the number
- * of pages under pages flagged for immediate reclaim and stall if any
- * are encountered in the nr_immediate check below.
- */
- if (stat.nr_writeback && stat.nr_writeback == nr_taken)
- set_bit(PGDAT_WRITEBACK, &pgdat->flags);
-
/*
* If dirty pages are scanned that are not queued for IO, it
* implies that flushers are not doing their job. This can
@@ -1784,41 +1770,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
if (stat.nr_unqueued_dirty == nr_taken)
wakeup_flusher_threads(WB_REASON_VMSCAN);
- /*
- * Legacy memcg will stall in page writeback so avoid forcibly
- * stalling here.
- */
- if (sane_reclaim(sc)) {
- /*
- * Tag a node as congested if all the dirty pages scanned were
- * backed by a congested BDI and wait_iff_congested will stall.
- */
- if (stat.nr_dirty && stat.nr_dirty == stat.nr_congested)
- set_bit(PGDAT_CONGESTED, &pgdat->flags);
-
- /* Allow kswapd to start writing pages during reclaim. */
- if (stat.nr_unqueued_dirty == nr_taken)
- set_bit(PGDAT_DIRTY, &pgdat->flags);
-
- /*
- * If kswapd scans pages marked marked for immediate
- * reclaim and under writeback (nr_immediate), it implies
- * that pages are cycling through the LRU faster than
- * they are written so also forcibly stall.
- */
- if (stat.nr_immediate)
- congestion_wait(BLK_RW_ASYNC, HZ/10);
+ if (sc->stat) {
+ sc->stat->nr_dirty += stat.nr_dirty;
+ sc->stat->nr_congested += stat.nr_congested;
+ sc->stat->nr_unqueued_dirty += stat.nr_unqueued_dirty;
+ sc->stat->nr_writeback += stat.nr_writeback;
+ sc->stat->nr_immediate += stat.nr_immediate;
+ sc->stat->nr_taken += stat.nr_taken;
}
- /*
- * Stall direct reclaim for IO completions if underlying BDIs and node
- * is congested. Allow kswapd to continue until it starts encountering
- * unqueued dirty pages or cycling through the LRU too quickly.
- */
- if (!sc->hibernation_mode && !current_is_kswapd() &&
- current_may_throttle())
- wait_iff_congested(pgdat, BLK_RW_ASYNC, HZ/10);
-
return nr_reclaimed;
}
@@ -2513,6 +2473,9 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
};
unsigned long node_lru_pages = 0;
struct mem_cgroup *memcg;
+ struct reclaim_stat stat = {};
+
+ sc->stat = &stat;
nr_reclaimed = sc->nr_reclaimed;
nr_scanned = sc->nr_scanned;
@@ -2579,6 +2542,58 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
if (sc->nr_reclaimed - nr_reclaimed)
reclaimable = true;
+ /*
+ * If reclaim is isolating dirty pages under writeback, it implies
+ * that the long-lived page allocation rate is exceeding the page
+ * laundering rate. Either the global limits are not being effective
+ * at throttling processes due to the page distribution throughout
+ * zones or there is heavy usage of a slow backing device. The
+ * only option is to throttle from reclaim context which is not ideal
+ * as there is no guarantee the dirtying process is throttled in the
+ * same way balance_dirty_pages() manages.
+ *
+ * Once a node is flagged PGDAT_WRITEBACK, kswapd will count the number
+ * of pages under pages flagged for immediate reclaim and stall if any
+ * are encountered in the nr_immediate check below.
+ */
+ if (stat.nr_writeback && stat.nr_writeback == stat.nr_taken)
+ set_bit(PGDAT_WRITEBACK, &pgdat->flags);
+
+ /*
+ * Legacy memcg will stall in page writeback so avoid forcibly
+ * stalling here.
+ */
+ if (sane_reclaim(sc)) {
+ /*
+ * Tag a node as congested if all the dirty pages scanned were
+ * backed by a congested BDI and wait_iff_congested will stall.
+ */
+ if (stat.nr_dirty && stat.nr_dirty == stat.nr_congested)
+ set_bit(PGDAT_CONGESTED, &pgdat->flags);
+
+ /* Allow kswapd to start writing pages during reclaim. */
+ if (stat.nr_unqueued_dirty == stat.nr_taken)
+ set_bit(PGDAT_DIRTY, &pgdat->flags);
+
+ /*
+ * If kswapd scans pages marked marked for immediate
+ * reclaim and under writeback (nr_immediate), it implies
+ * that pages are cycling through the LRU faster than
+ * they are written so also forcibly stall.
+ */
+ if (stat.nr_immediate)
+ congestion_wait(BLK_RW_ASYNC, HZ/10);
+ }
+
+ /*
+ * Stall direct reclaim for IO completions if underlying BDIs and node
+ * is congested. Allow kswapd to continue until it starts encountering
+ * unqueued dirty pages or cycling through the LRU too quickly.
+ */
+ if (!sc->hibernation_mode && !current_is_kswapd() &&
+ current_may_throttle())
+ wait_iff_congested(pgdat, BLK_RW_ASYNC, HZ/10);
+
} while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
sc->nr_scanned - nr_scanned, sc));
--
2.16.1
With upcoming changes keeping the mm_vmscan_lru_shrink_inactive tracepoint
intact would require some additional code churn. In particular
'struct recalim_stat' will gain more wide usage, but we don't need
'nr_activate', 'nr_ref_keep', 'nr_unmap_fail' counters anywhere besides
tracepoint.
Since mm_vmscan_lru_shrink_inactive tracepoint mostly provide
information collected by shrink_page_list(), we can just replace it
by tracepoint in shrink_page_list(). We don't have 'nr_scanned'
and 'file' arguments there, but user could obtain this information
from mm_vmscan_lru_isolate tracepoint.
Signed-off-by: Andrey Ryabinin <[email protected]>
---
include/trace/events/vmscan.h | 36 +++++++++++++++---------------------
mm/vmscan.c | 18 +++++-------------
2 files changed, 20 insertions(+), 34 deletions(-)
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index 6570c5b45ba1..8743a8113b42 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -342,23 +342,21 @@ TRACE_EVENT(mm_vmscan_writepage,
show_reclaim_flags(__entry->reclaim_flags))
);
-TRACE_EVENT(mm_vmscan_lru_shrink_inactive,
+TRACE_EVENT(mm_vmscan_shrink_page_list,
TP_PROTO(int nid,
- unsigned long nr_scanned, unsigned long nr_reclaimed,
- unsigned long nr_dirty, unsigned long nr_writeback,
- unsigned long nr_congested, unsigned long nr_immediate,
- unsigned long nr_activate, unsigned long nr_ref_keep,
- unsigned long nr_unmap_fail,
- int priority, int file),
-
- TP_ARGS(nid, nr_scanned, nr_reclaimed, nr_dirty, nr_writeback,
+ unsigned long nr_reclaimed, unsigned long nr_dirty,
+ unsigned long nr_writeback, unsigned long nr_congested,
+ unsigned long nr_immediate, unsigned long nr_activate,
+ unsigned long nr_ref_keep, unsigned long nr_unmap_fail,
+ int priority),
+
+ TP_ARGS(nid, nr_reclaimed, nr_dirty, nr_writeback,
nr_congested, nr_immediate, nr_activate, nr_ref_keep,
- nr_unmap_fail, priority, file),
+ nr_unmap_fail, priority),
TP_STRUCT__entry(
__field(int, nid)
- __field(unsigned long, nr_scanned)
__field(unsigned long, nr_reclaimed)
__field(unsigned long, nr_dirty)
__field(unsigned long, nr_writeback)
@@ -368,12 +366,10 @@ TRACE_EVENT(mm_vmscan_lru_shrink_inactive,
__field(unsigned long, nr_ref_keep)
__field(unsigned long, nr_unmap_fail)
__field(int, priority)
- __field(int, reclaim_flags)
),
TP_fast_assign(
__entry->nid = nid;
- __entry->nr_scanned = nr_scanned;
__entry->nr_reclaimed = nr_reclaimed;
__entry->nr_dirty = nr_dirty;
__entry->nr_writeback = nr_writeback;
@@ -383,17 +379,15 @@ TRACE_EVENT(mm_vmscan_lru_shrink_inactive,
__entry->nr_ref_keep = nr_ref_keep;
__entry->nr_unmap_fail = nr_unmap_fail;
__entry->priority = priority;
- __entry->reclaim_flags = trace_shrink_flags(file);
),
- TP_printk("nid=%d nr_scanned=%ld nr_reclaimed=%ld nr_dirty=%ld nr_writeback=%ld nr_congested=%ld nr_immediate=%ld nr_activate=%ld nr_ref_keep=%ld nr_unmap_fail=%ld priority=%d flags=%s",
+ TP_printk("nid=%d nr_reclaimed=%ld nr_dirty=%ld nr_writeback=%ld nr_congested=%ld nr_immediate=%ld nr_activate=%ld nr_ref_keep=%ld nr_unmap_fail=%ld priority=%d",
__entry->nid,
- __entry->nr_scanned, __entry->nr_reclaimed,
- __entry->nr_dirty, __entry->nr_writeback,
- __entry->nr_congested, __entry->nr_immediate,
- __entry->nr_activate, __entry->nr_ref_keep,
- __entry->nr_unmap_fail, __entry->priority,
- show_reclaim_flags(__entry->reclaim_flags))
+ __entry->nr_reclaimed, __entry->nr_dirty,
+ __entry->nr_writeback, __entry->nr_congested,
+ __entry->nr_immediate, __entry->nr_activate,
+ __entry->nr_ref_keep, __entry->nr_unmap_fail,
+ __entry->priority)
);
TRACE_EVENT(mm_vmscan_lru_shrink_active,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6d74b12099bd..0d5ab312a7f4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -863,9 +863,6 @@ struct reclaim_stat {
unsigned nr_congested;
unsigned nr_writeback;
unsigned nr_immediate;
- unsigned nr_activate;
- unsigned nr_ref_keep;
- unsigned nr_unmap_fail;
};
/*
@@ -1271,15 +1268,17 @@ static unsigned long shrink_page_list(struct list_head *page_list,
list_splice(&ret_pages, page_list);
count_vm_events(PGACTIVATE, pgactivate);
+ trace_mm_vmscan_shrink_page_list(pgdat->node_id,
+ nr_reclaimed, nr_dirty, nr_writeback, nr_congested,
+ nr_immediate, pgactivate, nr_ref_keep, nr_unmap_fail,
+ sc->priority);
+
if (stat) {
stat->nr_dirty = nr_dirty;
stat->nr_congested = nr_congested;
stat->nr_unqueued_dirty = nr_unqueued_dirty;
stat->nr_writeback = nr_writeback;
stat->nr_immediate = nr_immediate;
- stat->nr_activate = pgactivate;
- stat->nr_ref_keep = nr_ref_keep;
- stat->nr_unmap_fail = nr_unmap_fail;
}
return nr_reclaimed;
}
@@ -1820,13 +1819,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
current_may_throttle())
wait_iff_congested(pgdat, BLK_RW_ASYNC, HZ/10);
- trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
- nr_scanned, nr_reclaimed,
- stat.nr_dirty, stat.nr_writeback,
- stat.nr_congested, stat.nr_immediate,
- stat.nr_activate, stat.nr_ref_keep,
- stat.nr_unmap_fail,
- sc->priority, file);
return nr_reclaimed;
}
--
2.16.1
Only kswapd can have non-zero nr_immediate, and current_may_throttle() is
always true for kswapd (PF_LESS_THROTTLE bit is never set) thus it's
enough to check stat.nr_immediate only.
Signed-off-by: Andrey Ryabinin <[email protected]>
---
mm/vmscan.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0d5ab312a7f4..a8f6e4882e00 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1806,7 +1806,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
* that pages are cycling through the LRU faster than
* they are written so also forcibly stall.
*/
- if (stat.nr_immediate && current_may_throttle())
+ if (stat.nr_immediate)
congestion_wait(BLK_RW_ASYNC, HZ/10);
}
--
2.16.1
Update some comments that become stale since transiton from per-zone
to per-node reclaim.
Signed-off-by: Andrey Ryabinin <[email protected]>
---
mm/vmscan.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4390a8d5be41..6d74b12099bd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -926,7 +926,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
(PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
/*
- * The number of dirty pages determines if a zone is marked
+ * The number of dirty pages determines if a node is marked
* reclaim_congested which affects wait_iff_congested. kswapd
* will stall and start writing pages if the tail of the LRU
* is all dirty unqueued pages.
@@ -1764,7 +1764,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
* as there is no guarantee the dirtying process is throttled in the
* same way balance_dirty_pages() manages.
*
- * Once a zone is flagged ZONE_WRITEBACK, kswapd will count the number
+ * Once a node is flagged PGDAT_WRITEBACK, kswapd will count the number
* of pages under pages flagged for immediate reclaim and stall if any
* are encountered in the nr_immediate check below.
*/
@@ -1791,7 +1791,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
*/
if (sane_reclaim(sc)) {
/*
- * Tag a zone as congested if all the dirty pages scanned were
+ * Tag a node as congested if all the dirty pages scanned were
* backed by a congested BDI and wait_iff_congested will stall.
*/
if (stat.nr_dirty && stat.nr_dirty == stat.nr_congested)
@@ -1812,7 +1812,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
}
/*
- * Stall direct reclaim for IO completions if underlying BDIs or zone
+ * Stall direct reclaim for IO completions if underlying BDIs and node
* is congested. Allow kswapd to continue until it starts encountering
* unqueued dirty pages or cycling through the LRU too quickly.
*/
@@ -3808,7 +3808,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
if (node_pagecache_reclaimable(pgdat) > pgdat->min_unmapped_pages) {
/*
- * Free memory by calling shrink zone with increasing
+ * Free memory by calling shrink node with increasing
* priorities until we have enough memory freed.
*/
do {
--
2.16.1
On Thu, Mar 15, 2018 at 9:45 AM, Andrey Ryabinin
<[email protected]> wrote:
> Commit 726d061fbd36 ("mm: vmscan: kick flushers when we encounter
> dirty pages on the LRU") added flusher invocation to
> shrink_inactive_list() when many dirty pages on the LRU are encountered.
>
> However, shrink_inactive_list() doesn't wake up flushers for legacy
> cgroup reclaim, so the next commit bbef938429f5 ("mm: vmscan: remove
> old flusher wakeup from direct reclaim path") removed the only source
> of flusher's wake up in legacy mem cgroup reclaim path.
>
> This leads to premature OOM if there is too many dirty pages in cgroup:
> # mkdir /sys/fs/cgroup/memory/test
> # echo $$ > /sys/fs/cgroup/memory/test/tasks
> # echo 50M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
> # dd if=/dev/zero of=tmp_file bs=1M count=100
> Killed
>
> dd invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=0
>
> Call Trace:
> dump_stack+0x46/0x65
> dump_header+0x6b/0x2ac
> oom_kill_process+0x21c/0x4a0
> out_of_memory+0x2a5/0x4b0
> mem_cgroup_out_of_memory+0x3b/0x60
> mem_cgroup_oom_synchronize+0x2ed/0x330
> pagefault_out_of_memory+0x24/0x54
> __do_page_fault+0x521/0x540
> page_fault+0x45/0x50
>
> Task in /test killed as a result of limit of /test
> memory: usage 51200kB, limit 51200kB, failcnt 73
> memory+swap: usage 51200kB, limit 9007199254740988kB, failcnt 0
> kmem: usage 296kB, limit 9007199254740988kB, failcnt 0
> Memory cgroup stats for /test: cache:49632KB rss:1056KB rss_huge:0KB shmem:0KB
> mapped_file:0KB dirty:49500KB writeback:0KB swap:0KB inactive_anon:0KB
> active_anon:1168KB inactive_file:24760KB active_file:24960KB unevictable:0KB
> Memory cgroup out of memory: Kill process 3861 (bash) score 88 or sacrifice child
> Killed process 3876 (dd) total-vm:8484kB, anon-rss:1052kB, file-rss:1720kB, shmem-rss:0kB
> oom_reaper: reaped process 3876 (dd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
>
> Wake up flushers in legacy cgroup reclaim too.
>
> Fixes: bbef938429f5 ("mm: vmscan: remove old flusher wakeup from direct reclaim path")
> Signed-off-by: Andrey Ryabinin <[email protected]>
Indeed I am seeing oom-kills without the patch.
Tested-by: Shakeel Butt <[email protected]>
> Cc: <[email protected]>
> ---
> mm/vmscan.c | 31 ++++++++++++++++---------------
> 1 file changed, 16 insertions(+), 15 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 8fcd9f8d7390..4390a8d5be41 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1771,6 +1771,20 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
> if (stat.nr_writeback && stat.nr_writeback == nr_taken)
> set_bit(PGDAT_WRITEBACK, &pgdat->flags);
>
> + /*
> + * If dirty pages are scanned that are not queued for IO, it
> + * implies that flushers are not doing their job. This can
> + * happen when memory pressure pushes dirty pages to the end of
> + * the LRU before the dirty limits are breached and the dirty
> + * data has expired. It can also happen when the proportion of
> + * dirty pages grows not through writes but through memory
> + * pressure reclaiming all the clean cache. And in some cases,
> + * the flushers simply cannot keep up with the allocation
> + * rate. Nudge the flusher threads in case they are asleep.
> + */
> + if (stat.nr_unqueued_dirty == nr_taken)
> + wakeup_flusher_threads(WB_REASON_VMSCAN);
> +
> /*
> * Legacy memcg will stall in page writeback so avoid forcibly
> * stalling here.
> @@ -1783,22 +1797,9 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
> if (stat.nr_dirty && stat.nr_dirty == stat.nr_congested)
> set_bit(PGDAT_CONGESTED, &pgdat->flags);
>
> - /*
> - * If dirty pages are scanned that are not queued for IO, it
> - * implies that flushers are not doing their job. This can
> - * happen when memory pressure pushes dirty pages to the end of
> - * the LRU before the dirty limits are breached and the dirty
> - * data has expired. It can also happen when the proportion of
> - * dirty pages grows not through writes but through memory
> - * pressure reclaiming all the clean cache. And in some cases,
> - * the flushers simply cannot keep up with the allocation
> - * rate. Nudge the flusher threads in case they are asleep, but
> - * also allow kswapd to start writing pages during reclaim.
> - */
> - if (stat.nr_unqueued_dirty == nr_taken) {
> - wakeup_flusher_threads(WB_REASON_VMSCAN);
> + /* Allow kswapd to start writing pages during reclaim. */
> + if (stat.nr_unqueued_dirty == nr_taken)
> set_bit(PGDAT_DIRTY, &pgdat->flags);
> - }
>
> /*
> * If kswapd scans pages marked marked for immediate
> --
> 2.16.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu 15-03-18 19:45:49, Andrey Ryabinin wrote:
> Update some comments that become stale since transiton from per-zone
> to per-node reclaim.
>
> Signed-off-by: Andrey Ryabinin <[email protected]>
Acked-by: Michal Hocko <[email protected]>
> ---
> mm/vmscan.c | 10 +++++-----
> 1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 4390a8d5be41..6d74b12099bd 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -926,7 +926,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> (PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
>
> /*
> - * The number of dirty pages determines if a zone is marked
> + * The number of dirty pages determines if a node is marked
> * reclaim_congested which affects wait_iff_congested. kswapd
> * will stall and start writing pages if the tail of the LRU
> * is all dirty unqueued pages.
> @@ -1764,7 +1764,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
> * as there is no guarantee the dirtying process is throttled in the
> * same way balance_dirty_pages() manages.
> *
> - * Once a zone is flagged ZONE_WRITEBACK, kswapd will count the number
> + * Once a node is flagged PGDAT_WRITEBACK, kswapd will count the number
> * of pages under pages flagged for immediate reclaim and stall if any
> * are encountered in the nr_immediate check below.
> */
> @@ -1791,7 +1791,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
> */
> if (sane_reclaim(sc)) {
> /*
> - * Tag a zone as congested if all the dirty pages scanned were
> + * Tag a node as congested if all the dirty pages scanned were
> * backed by a congested BDI and wait_iff_congested will stall.
> */
> if (stat.nr_dirty && stat.nr_dirty == stat.nr_congested)
> @@ -1812,7 +1812,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
> }
>
> /*
> - * Stall direct reclaim for IO completions if underlying BDIs or zone
> + * Stall direct reclaim for IO completions if underlying BDIs and node
> * is congested. Allow kswapd to continue until it starts encountering
> * unqueued dirty pages or cycling through the LRU too quickly.
> */
> @@ -3808,7 +3808,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
>
> if (node_pagecache_reclaimable(pgdat) > pgdat->min_unmapped_pages) {
> /*
> - * Free memory by calling shrink zone with increasing
> + * Free memory by calling shrink node with increasing
> * priorities until we have enough memory freed.
> */
> do {
> --
> 2.16.1
--
Michal Hocko
SUSE Labs
On Thu 15-03-18 19:45:48, Andrey Ryabinin wrote:
> Commit 726d061fbd36 ("mm: vmscan: kick flushers when we encounter
> dirty pages on the LRU") added flusher invocation to
> shrink_inactive_list() when many dirty pages on the LRU are encountered.
>
> However, shrink_inactive_list() doesn't wake up flushers for legacy
> cgroup reclaim, so the next commit bbef938429f5 ("mm: vmscan: remove
> old flusher wakeup from direct reclaim path") removed the only source
> of flusher's wake up in legacy mem cgroup reclaim path.
>
> This leads to premature OOM if there is too many dirty pages in cgroup:
> # mkdir /sys/fs/cgroup/memory/test
> # echo $$ > /sys/fs/cgroup/memory/test/tasks
> # echo 50M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
> # dd if=/dev/zero of=tmp_file bs=1M count=100
> Killed
>
> dd invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=0
>
> Call Trace:
> dump_stack+0x46/0x65
> dump_header+0x6b/0x2ac
> oom_kill_process+0x21c/0x4a0
> out_of_memory+0x2a5/0x4b0
> mem_cgroup_out_of_memory+0x3b/0x60
> mem_cgroup_oom_synchronize+0x2ed/0x330
> pagefault_out_of_memory+0x24/0x54
> __do_page_fault+0x521/0x540
> page_fault+0x45/0x50
>
> Task in /test killed as a result of limit of /test
> memory: usage 51200kB, limit 51200kB, failcnt 73
> memory+swap: usage 51200kB, limit 9007199254740988kB, failcnt 0
> kmem: usage 296kB, limit 9007199254740988kB, failcnt 0
> Memory cgroup stats for /test: cache:49632KB rss:1056KB rss_huge:0KB shmem:0KB
> mapped_file:0KB dirty:49500KB writeback:0KB swap:0KB inactive_anon:0KB
> active_anon:1168KB inactive_file:24760KB active_file:24960KB unevictable:0KB
> Memory cgroup out of memory: Kill process 3861 (bash) score 88 or sacrifice child
> Killed process 3876 (dd) total-vm:8484kB, anon-rss:1052kB, file-rss:1720kB, shmem-rss:0kB
> oom_reaper: reaped process 3876 (dd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
>
> Wake up flushers in legacy cgroup reclaim too.
>
> Fixes: bbef938429f5 ("mm: vmscan: remove old flusher wakeup from direct reclaim path")
> Signed-off-by: Andrey Ryabinin <[email protected]>
> Cc: <[email protected]>
Yes, this makes sense. We are stalling on writeback pages for the legacy
memcg but we do not have any way to throttle dirty pages before the
writeback kicks in
Acked-by: Michal Hocko <[email protected]>
> ---
> mm/vmscan.c | 31 ++++++++++++++++---------------
> 1 file changed, 16 insertions(+), 15 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 8fcd9f8d7390..4390a8d5be41 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1771,6 +1771,20 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
> if (stat.nr_writeback && stat.nr_writeback == nr_taken)
> set_bit(PGDAT_WRITEBACK, &pgdat->flags);
>
> + /*
> + * If dirty pages are scanned that are not queued for IO, it
> + * implies that flushers are not doing their job. This can
> + * happen when memory pressure pushes dirty pages to the end of
> + * the LRU before the dirty limits are breached and the dirty
> + * data has expired. It can also happen when the proportion of
> + * dirty pages grows not through writes but through memory
> + * pressure reclaiming all the clean cache. And in some cases,
> + * the flushers simply cannot keep up with the allocation
> + * rate. Nudge the flusher threads in case they are asleep.
> + */
> + if (stat.nr_unqueued_dirty == nr_taken)
> + wakeup_flusher_threads(WB_REASON_VMSCAN);
> +
> /*
> * Legacy memcg will stall in page writeback so avoid forcibly
> * stalling here.
> @@ -1783,22 +1797,9 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
> if (stat.nr_dirty && stat.nr_dirty == stat.nr_congested)
> set_bit(PGDAT_CONGESTED, &pgdat->flags);
>
> - /*
> - * If dirty pages are scanned that are not queued for IO, it
> - * implies that flushers are not doing their job. This can
> - * happen when memory pressure pushes dirty pages to the end of
> - * the LRU before the dirty limits are breached and the dirty
> - * data has expired. It can also happen when the proportion of
> - * dirty pages grows not through writes but through memory
> - * pressure reclaiming all the clean cache. And in some cases,
> - * the flushers simply cannot keep up with the allocation
> - * rate. Nudge the flusher threads in case they are asleep, but
> - * also allow kswapd to start writing pages during reclaim.
> - */
> - if (stat.nr_unqueued_dirty == nr_taken) {
> - wakeup_flusher_threads(WB_REASON_VMSCAN);
> + /* Allow kswapd to start writing pages during reclaim. */
> + if (stat.nr_unqueued_dirty == nr_taken)
> set_bit(PGDAT_DIRTY, &pgdat->flags);
> - }
>
> /*
> * If kswapd scans pages marked marked for immediate
> --
> 2.16.1
--
Michal Hocko
SUSE Labs
On Thu 15-03-18 19:45:51, Andrey Ryabinin wrote:
> Only kswapd can have non-zero nr_immediate, and current_may_throttle() is
> always true for kswapd (PF_LESS_THROTTLE bit is never set) thus it's
> enough to check stat.nr_immediate only.
OK, so this is a result of some code evolution. We used to check for
dirty pages as well. But then b738d764652d ("Don't trigger congestion
wait on dirty-but-not-writeout pages") removed the nr_unqueued_dirty ==
nr_taken part. I was wondering whether we still have the
PF_LESS_THROTTLE protection in place but then noticed that we still have
if (!sc->hibernation_mode && !current_is_kswapd() &&
current_may_throttle())
wait_iff_congested(pgdat, BLK_RW_ASYNC, HZ/10);
in place, so good.
> Signed-off-by: Andrey Ryabinin <[email protected]>
Acked-by: Michal Hocko <[email protected]>
> ---
> mm/vmscan.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 0d5ab312a7f4..a8f6e4882e00 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1806,7 +1806,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
> * that pages are cycling through the LRU faster than
> * they are written so also forcibly stall.
> */
> - if (stat.nr_immediate && current_may_throttle())
> + if (stat.nr_immediate)
> congestion_wait(BLK_RW_ASYNC, HZ/10);
> }
>
> --
> 2.16.1
--
Michal Hocko
SUSE Labs
On Thu 15-03-18 19:45:52, Andrey Ryabinin wrote:
> We have separate LRU list for each memory cgroup. Memory reclaim iterates
> over cgroups and calls shrink_inactive_list() every inactive LRU list.
> Based on the state of a single LRU shrink_inactive_list() may flag
> the whole node as dirty,congested or under writeback. This is obviously
> wrong and hurtful. It's especially hurtful when we have possibly
> small congested cgroup in system. Than *all* direct reclaims waste time
> by sleeping in wait_iff_congested().
I assume you have seen this in real workloads. Could you be more
specific about how you noticed the problem?
> Sum reclaim stats across all visited LRUs on node and flag node as dirty,
> congested or under writeback based on that sum. This only fixes the
> problem for global reclaim case. Per-cgroup reclaim will be addressed
> separately by the next patch.
>
> This change will also affect systems with no memory cgroups. Reclaimer
> now makes decision based on reclaim stats of the both anon and file LRU
> lists. E.g. if the file list is in congested state and get_scan_count()
> decided to reclaim some anon pages, reclaimer will start shrinking
> anon without delay in wait_iff_congested() like it was before. It seems
> to be a reasonable thing to do. Why waste time sleeping, before reclaiming
> anon given that we going to try to reclaim it anyway?
Well, if we have few anon pages in the mix then we stop throttling the
reclaim, I am afraid. I am worried this might get us kswapd hogging CPU
problems back.
[...]
> @@ -2513,6 +2473,9 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> };
> unsigned long node_lru_pages = 0;
> struct mem_cgroup *memcg;
> + struct reclaim_stat stat = {};
> +
> + sc->stat = &stat;
>
> nr_reclaimed = sc->nr_reclaimed;
> nr_scanned = sc->nr_scanned;
> @@ -2579,6 +2542,58 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> if (sc->nr_reclaimed - nr_reclaimed)
> reclaimable = true;
>
> + /*
> + * If reclaim is isolating dirty pages under writeback, it implies
> + * that the long-lived page allocation rate is exceeding the page
> + * laundering rate. Either the global limits are not being effective
> + * at throttling processes due to the page distribution throughout
> + * zones or there is heavy usage of a slow backing device. The
> + * only option is to throttle from reclaim context which is not ideal
> + * as there is no guarantee the dirtying process is throttled in the
> + * same way balance_dirty_pages() manages.
> + *
> + * Once a node is flagged PGDAT_WRITEBACK, kswapd will count the number
> + * of pages under pages flagged for immediate reclaim and stall if any
> + * are encountered in the nr_immediate check below.
> + */
> + if (stat.nr_writeback && stat.nr_writeback == stat.nr_taken)
> + set_bit(PGDAT_WRITEBACK, &pgdat->flags);
> +
> + /*
> + * Legacy memcg will stall in page writeback so avoid forcibly
> + * stalling here.
> + */
> + if (sane_reclaim(sc)) {
> + /*
> + * Tag a node as congested if all the dirty pages scanned were
> + * backed by a congested BDI and wait_iff_congested will stall.
> + */
> + if (stat.nr_dirty && stat.nr_dirty == stat.nr_congested)
> + set_bit(PGDAT_CONGESTED, &pgdat->flags);
> +
> + /* Allow kswapd to start writing pages during reclaim. */
> + if (stat.nr_unqueued_dirty == stat.nr_taken)
> + set_bit(PGDAT_DIRTY, &pgdat->flags);
> +
> + /*
> + * If kswapd scans pages marked marked for immediate
> + * reclaim and under writeback (nr_immediate), it implies
> + * that pages are cycling through the LRU faster than
> + * they are written so also forcibly stall.
> + */
> + if (stat.nr_immediate)
> + congestion_wait(BLK_RW_ASYNC, HZ/10);
> + }
> +
> + /*
> + * Stall direct reclaim for IO completions if underlying BDIs and node
> + * is congested. Allow kswapd to continue until it starts encountering
> + * unqueued dirty pages or cycling through the LRU too quickly.
> + */
> + if (!sc->hibernation_mode && !current_is_kswapd() &&
> + current_may_throttle())
> + wait_iff_congested(pgdat, BLK_RW_ASYNC, HZ/10);
> +
> } while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
> sc->nr_scanned - nr_scanned, sc));
Why didn't you put the whole thing after the loop?
--
Michal Hocko
SUSE Labs
On Thu 15-03-18 19:45:53, Andrey Ryabinin wrote:
> memcg reclaim may alter pgdat->flags based on the state of LRU lists
> in cgroup and its children. PGDAT_WRITEBACK may force kswapd to sleep
> congested_wait(), PGDAT_DIRTY may force kswapd to writeback filesystem
> pages. But the worst here is PGDAT_CONGESTED, since it may force all
> direct reclaims to stall in wait_iff_congested(). Note that only kswapd
> have powers to clear any of these bits. This might just never happen if
> cgroup limits configured that way. So all direct reclaims will stall
> as long as we have some congested bdi in the system.
>
> Leave all pgdat->flags manipulations to kswapd. kswapd scans the whole
> pgdat, so it's reasonable to leave all decisions about node stat
> to kswapd. Also add per-cgroup congestion state to avoid needlessly
> burning CPU in cgroup reclaim if heavy congestion is observed.
>
> Currently there is no need in per-cgroup PGDAT_WRITEBACK and PGDAT_DIRTY
> bits since they alter only kswapd behavior.
>
> The problem could be easily demonstrated by creating heavy congestion
> in one cgroup:
>
> echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control
> mkdir -p /sys/fs/cgroup/congester
> echo 512M > /sys/fs/cgroup/congester/memory.max
> echo $$ > /sys/fs/cgroup/congester/cgroup.procs
> /* generate a lot of diry data on slow HDD */
> while true; do dd if=/dev/zero of=/mnt/sdb/zeroes bs=1M count=1024; done &
> ....
> while true; do dd if=/dev/zero of=/mnt/sdb/zeroes bs=1M count=1024; done &
>
> and some job in another cgroup:
>
> mkdir /sys/fs/cgroup/victim
> echo 128M > /sys/fs/cgroup/victim/memory.max
>
> # time cat /dev/sda > /dev/null
> real 10m15.054s
> user 0m0.487s
> sys 1m8.505s
>
> According to the tracepoint in wait_iff_congested(), the 'cat' spent 50%
> of the time sleeping there.
>
> With the patch, cat don't waste time anymore:
>
> # time cat /dev/sda > /dev/null
> real 5m32.911s
> user 0m0.411s
> sys 0m56.664s
>
> Signed-off-by: Andrey Ryabinin <[email protected]>
> ---
> include/linux/backing-dev.h | 2 +-
> include/linux/memcontrol.h | 2 ++
> mm/backing-dev.c | 19 ++++------
> mm/vmscan.c | 84 ++++++++++++++++++++++++++++++++-------------
> 4 files changed, 70 insertions(+), 37 deletions(-)
This patch seems overly complicated. Why don't you simply reduce the whole
pgdat_flags handling to global_reclaim()?
--
Michal Hocko
SUSE Labs
On 03/20/2018 06:25 PM, Michal Hocko wrote:
> On Thu 15-03-18 19:45:52, Andrey Ryabinin wrote:
>> We have separate LRU list for each memory cgroup. Memory reclaim iterates
>> over cgroups and calls shrink_inactive_list() every inactive LRU list.
>> Based on the state of a single LRU shrink_inactive_list() may flag
>> the whole node as dirty,congested or under writeback. This is obviously
>> wrong and hurtful. It's especially hurtful when we have possibly
>> small congested cgroup in system. Than *all* direct reclaims waste time
>> by sleeping in wait_iff_congested().
>
> I assume you have seen this in real workloads. Could you be more
> specific about how you noticed the problem?
>
Does it matter? One of our userspace processes have some sort of watchdog.
When it doesn't receive some event in time it complains that process stuck.
In this case in-kernel allocation stuck in wait_iff_congested.
>> Sum reclaim stats across all visited LRUs on node and flag node as dirty,
>> congested or under writeback based on that sum. This only fixes the
>> problem for global reclaim case. Per-cgroup reclaim will be addressed
>> separately by the next patch.
>>
>> This change will also affect systems with no memory cgroups. Reclaimer
>> now makes decision based on reclaim stats of the both anon and file LRU
>> lists. E.g. if the file list is in congested state and get_scan_count()
>> decided to reclaim some anon pages, reclaimer will start shrinking
>> anon without delay in wait_iff_congested() like it was before. It seems
>> to be a reasonable thing to do. Why waste time sleeping, before reclaiming
>> anon given that we going to try to reclaim it anyway?
>
> Well, if we have few anon pages in the mix then we stop throttling the
> reclaim, I am afraid. I am worried this might get us kswapd hogging CPU
> problems back.
>
Yeah, it's not ideal choice. If only few anon pages taken than *not* throttling is bad,
and if few file pages taken and many anon than *not* throttling is probably good.
Anyway, such requires more thought,research,justification, etc.
I'll change the patch to take into account file only pages, as it was before the patch.
> [...]
>> @@ -2579,6 +2542,58 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>> if (sc->nr_reclaimed - nr_reclaimed)
>> reclaimable = true;
>>
>> + /*
>> + * If reclaim is isolating dirty pages under writeback, it implies
>> + * that the long-lived page allocation rate is exceeding the page
>> + * laundering rate. Either the global limits are not being effective
>> + * at throttling processes due to the page distribution throughout
>> + * zones or there is heavy usage of a slow backing device. The
>> + * only option is to throttle from reclaim context which is not ideal
>> + * as there is no guarantee the dirtying process is throttled in the
>> + * same way balance_dirty_pages() manages.
>> + *
>> + * Once a node is flagged PGDAT_WRITEBACK, kswapd will count the number
>> + * of pages under pages flagged for immediate reclaim and stall if any
>> + * are encountered in the nr_immediate check below.
>> + */
>> + if (stat.nr_writeback && stat.nr_writeback == stat.nr_taken)
>> + set_bit(PGDAT_WRITEBACK, &pgdat->flags);
>> +
>> + /*
>> + * Legacy memcg will stall in page writeback so avoid forcibly
>> + * stalling here.
>> + */
>> + if (sane_reclaim(sc)) {
>> + /*
>> + * Tag a node as congested if all the dirty pages scanned were
>> + * backed by a congested BDI and wait_iff_congested will stall.
>> + */
>> + if (stat.nr_dirty && stat.nr_dirty == stat.nr_congested)
>> + set_bit(PGDAT_CONGESTED, &pgdat->flags);
>> +
>> + /* Allow kswapd to start writing pages during reclaim. */
>> + if (stat.nr_unqueued_dirty == stat.nr_taken)
>> + set_bit(PGDAT_DIRTY, &pgdat->flags);
>> +
>> + /*
>> + * If kswapd scans pages marked marked for immediate
>> + * reclaim and under writeback (nr_immediate), it implies
>> + * that pages are cycling through the LRU faster than
>> + * they are written so also forcibly stall.
>> + */
>> + if (stat.nr_immediate)
>> + congestion_wait(BLK_RW_ASYNC, HZ/10);
>> + }
>> +
>> + /*
>> + * Stall direct reclaim for IO completions if underlying BDIs and node
>> + * is congested. Allow kswapd to continue until it starts encountering
>> + * unqueued dirty pages or cycling through the LRU too quickly.
>> + */
>> + if (!sc->hibernation_mode && !current_is_kswapd() &&
>> + current_may_throttle())
>> + wait_iff_congested(pgdat, BLK_RW_ASYNC, HZ/10);
>> +
>> } while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
>> sc->nr_scanned - nr_scanned, sc));
>
> Why didn't you put the whole thing after the loop?
>
Why this should be put after the loop? Here we already scanned all LRUs on node and
can decide in what state the node is. If should_countinue_reclaim() decides to continue,
the reclaim will be continued in accordance to the state of the node.
On 03/20/2018 06:29 PM, Michal Hocko wrote:
>> Leave all pgdat->flags manipulations to kswapd. kswapd scans the whole
>> pgdat, so it's reasonable to leave all decisions about node stat
>> to kswapd. Also add per-cgroup congestion state to avoid needlessly
>> burning CPU in cgroup reclaim if heavy congestion is observed.
>>
>> Currently there is no need in per-cgroup PGDAT_WRITEBACK and PGDAT_DIRTY
>> bits since they alter only kswapd behavior.
>>
>> The problem could be easily demonstrated by creating heavy congestion
>> in one cgroup:
>>
>> echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control
>> mkdir -p /sys/fs/cgroup/congester
>> echo 512M > /sys/fs/cgroup/congester/memory.max
>> echo $$ > /sys/fs/cgroup/congester/cgroup.procs
>> /* generate a lot of diry data on slow HDD */
>> while true; do dd if=/dev/zero of=/mnt/sdb/zeroes bs=1M count=1024; done &
>> ....
>> while true; do dd if=/dev/zero of=/mnt/sdb/zeroes bs=1M count=1024; done &
>>
>> and some job in another cgroup:
>>
>> mkdir /sys/fs/cgroup/victim
>> echo 128M > /sys/fs/cgroup/victim/memory.max
>>
>> # time cat /dev/sda > /dev/null
>> real 10m15.054s
>> user 0m0.487s
>> sys 1m8.505s
>>
>> According to the tracepoint in wait_iff_congested(), the 'cat' spent 50%
>> of the time sleeping there.
>>
>> With the patch, cat don't waste time anymore:
>>
>> # time cat /dev/sda > /dev/null
>> real 5m32.911s
>> user 0m0.411s
>> sys 0m56.664s
>>
>> Signed-off-by: Andrey Ryabinin <[email protected]>
>> ---
>> include/linux/backing-dev.h | 2 +-
>> include/linux/memcontrol.h | 2 ++
>> mm/backing-dev.c | 19 ++++------
>> mm/vmscan.c | 84 ++++++++++++++++++++++++++++++++-------------
>> 4 files changed, 70 insertions(+), 37 deletions(-)
>
> This patch seems overly complicated. Why don't you simply reduce the whole
> pgdat_flags handling to global_reclaim()?
>
In that case cgroup2 reclaim wouldn't have any way of throttling if cgroup is full of congested dirty pages.
On Wed 21-03-18 13:40:32, Andrey Ryabinin wrote:
> On 03/20/2018 06:25 PM, Michal Hocko wrote:
> > On Thu 15-03-18 19:45:52, Andrey Ryabinin wrote:
> >> We have separate LRU list for each memory cgroup. Memory reclaim iterates
> >> over cgroups and calls shrink_inactive_list() every inactive LRU list.
> >> Based on the state of a single LRU shrink_inactive_list() may flag
> >> the whole node as dirty,congested or under writeback. This is obviously
> >> wrong and hurtful. It's especially hurtful when we have possibly
> >> small congested cgroup in system. Than *all* direct reclaims waste time
> >> by sleeping in wait_iff_congested().
> >
> > I assume you have seen this in real workloads. Could you be more
> > specific about how you noticed the problem?
> >
>
> Does it matter?
Yes. Having relevant information in the changelog can help other people
to evaluate whether they need to backport the patch. Their symptoms
might be similar or even same.
> One of our userspace processes have some sort of watchdog.
> When it doesn't receive some event in time it complains that process stuck.
> In this case in-kernel allocation stuck in wait_iff_congested.
OK, so normally it would exhibit as a long stall in the page allocator.
Anyway I was more curious about the setup. I assume you have many memcgs
and some of them with a very small hard limit which triggers the
throttling to other memcgs?
> >> Sum reclaim stats across all visited LRUs on node and flag node as dirty,
> >> congested or under writeback based on that sum. This only fixes the
> >> problem for global reclaim case. Per-cgroup reclaim will be addressed
> >> separately by the next patch.
> >>
> >> This change will also affect systems with no memory cgroups. Reclaimer
> >> now makes decision based on reclaim stats of the both anon and file LRU
> >> lists. E.g. if the file list is in congested state and get_scan_count()
> >> decided to reclaim some anon pages, reclaimer will start shrinking
> >> anon without delay in wait_iff_congested() like it was before. It seems
> >> to be a reasonable thing to do. Why waste time sleeping, before reclaiming
> >> anon given that we going to try to reclaim it anyway?
> >
> > Well, if we have few anon pages in the mix then we stop throttling the
> > reclaim, I am afraid. I am worried this might get us kswapd hogging CPU
> > problems back.
> >
>
> Yeah, it's not ideal choice. If only few anon pages taken than *not* throttling is bad,
> and if few file pages taken and many anon than *not* throttling is probably good.
>
> Anyway, such requires more thought,research,justification, etc.
> I'll change the patch to take into account file only pages, as it was before the patch.
Keeping the status quo would be safer for now. Handling all those kswapd
at 100% CPU issues was quite painful.
> > [...]
>
> >> @@ -2579,6 +2542,58 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> >> if (sc->nr_reclaimed - nr_reclaimed)
> >> reclaimable = true;
> >>
> >> + /*
> >> + * If reclaim is isolating dirty pages under writeback, it implies
> >> + * that the long-lived page allocation rate is exceeding the page
> >> + * laundering rate. Either the global limits are not being effective
> >> + * at throttling processes due to the page distribution throughout
> >> + * zones or there is heavy usage of a slow backing device. The
> >> + * only option is to throttle from reclaim context which is not ideal
> >> + * as there is no guarantee the dirtying process is throttled in the
> >> + * same way balance_dirty_pages() manages.
> >> + *
> >> + * Once a node is flagged PGDAT_WRITEBACK, kswapd will count the number
> >> + * of pages under pages flagged for immediate reclaim and stall if any
> >> + * are encountered in the nr_immediate check below.
> >> + */
> >> + if (stat.nr_writeback && stat.nr_writeback == stat.nr_taken)
> >> + set_bit(PGDAT_WRITEBACK, &pgdat->flags);
> >> +
> >> + /*
> >> + * Legacy memcg will stall in page writeback so avoid forcibly
> >> + * stalling here.
> >> + */
> >> + if (sane_reclaim(sc)) {
> >> + /*
> >> + * Tag a node as congested if all the dirty pages scanned were
> >> + * backed by a congested BDI and wait_iff_congested will stall.
> >> + */
> >> + if (stat.nr_dirty && stat.nr_dirty == stat.nr_congested)
> >> + set_bit(PGDAT_CONGESTED, &pgdat->flags);
> >> +
> >> + /* Allow kswapd to start writing pages during reclaim. */
> >> + if (stat.nr_unqueued_dirty == stat.nr_taken)
> >> + set_bit(PGDAT_DIRTY, &pgdat->flags);
> >> +
> >> + /*
> >> + * If kswapd scans pages marked marked for immediate
> >> + * reclaim and under writeback (nr_immediate), it implies
> >> + * that pages are cycling through the LRU faster than
> >> + * they are written so also forcibly stall.
> >> + */
> >> + if (stat.nr_immediate)
> >> + congestion_wait(BLK_RW_ASYNC, HZ/10);
> >> + }
> >> +
> >> + /*
> >> + * Stall direct reclaim for IO completions if underlying BDIs and node
> >> + * is congested. Allow kswapd to continue until it starts encountering
> >> + * unqueued dirty pages or cycling through the LRU too quickly.
> >> + */
> >> + if (!sc->hibernation_mode && !current_is_kswapd() &&
> >> + current_may_throttle())
> >> + wait_iff_congested(pgdat, BLK_RW_ASYNC, HZ/10);
> >> +
> >> } while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
> >> sc->nr_scanned - nr_scanned, sc));
> >
> > Why didn't you put the whole thing after the loop?
> >
>
> Why this should be put after the loop? Here we already scanned all LRUs on node and
> can decide in what state the node is. If should_countinue_reclaim() decides to continue,
> the reclaim will be continued in accordance to the state of the node.
I thought the whole point of the change was to evaluate all these
decisions once per pgdat reclaim which would be after the final loop.
I do not have a strong preference here, though, so I was merely asking
because the choice was not obvious to me and the changelog didn't say
either. I guess both are fine.
--
Michal Hocko
SUSE Labs
On Wed 21-03-18 14:14:35, Andrey Ryabinin wrote:
>
>
> On 03/20/2018 06:29 PM, Michal Hocko wrote:
>
> >> Leave all pgdat->flags manipulations to kswapd. kswapd scans the whole
> >> pgdat, so it's reasonable to leave all decisions about node stat
> >> to kswapd. Also add per-cgroup congestion state to avoid needlessly
> >> burning CPU in cgroup reclaim if heavy congestion is observed.
> >>
> >> Currently there is no need in per-cgroup PGDAT_WRITEBACK and PGDAT_DIRTY
> >> bits since they alter only kswapd behavior.
> >>
> >> The problem could be easily demonstrated by creating heavy congestion
> >> in one cgroup:
> >>
> >> echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control
> >> mkdir -p /sys/fs/cgroup/congester
> >> echo 512M > /sys/fs/cgroup/congester/memory.max
> >> echo $$ > /sys/fs/cgroup/congester/cgroup.procs
> >> /* generate a lot of diry data on slow HDD */
> >> while true; do dd if=/dev/zero of=/mnt/sdb/zeroes bs=1M count=1024; done &
> >> ....
> >> while true; do dd if=/dev/zero of=/mnt/sdb/zeroes bs=1M count=1024; done &
> >>
> >> and some job in another cgroup:
> >>
> >> mkdir /sys/fs/cgroup/victim
> >> echo 128M > /sys/fs/cgroup/victim/memory.max
> >>
> >> # time cat /dev/sda > /dev/null
> >> real 10m15.054s
> >> user 0m0.487s
> >> sys 1m8.505s
> >>
> >> According to the tracepoint in wait_iff_congested(), the 'cat' spent 50%
> >> of the time sleeping there.
> >>
> >> With the patch, cat don't waste time anymore:
> >>
> >> # time cat /dev/sda > /dev/null
> >> real 5m32.911s
> >> user 0m0.411s
> >> sys 0m56.664s
> >>
> >> Signed-off-by: Andrey Ryabinin <[email protected]>
> >> ---
> >> include/linux/backing-dev.h | 2 +-
> >> include/linux/memcontrol.h | 2 ++
> >> mm/backing-dev.c | 19 ++++------
> >> mm/vmscan.c | 84 ++++++++++++++++++++++++++++++++-------------
> >> 4 files changed, 70 insertions(+), 37 deletions(-)
> >
> > This patch seems overly complicated. Why don't you simply reduce the whole
> > pgdat_flags handling to global_reclaim()?
> >
>
> In that case cgroup2 reclaim wouldn't have any way of throttling if
> cgroup is full of congested dirty pages.
It's been some time since I've looked into the throttling code so pardon
my ignorance. Don't cgroup v2 users get throttled in the write path to
not dirty too many pages in the first place?
In other words, is your patch trying to fix two different things? One is
per-memcg reclaim influencing the global pgdat state, which is clearly
wrng, and cgroup v2 reclaim throttling that is not pgdat based?
--
Michal Hocko
SUSE Labs
On 03/21/2018 02:32 PM, Michal Hocko wrote:
> On Wed 21-03-18 13:40:32, Andrey Ryabinin wrote:
>> On 03/20/2018 06:25 PM, Michal Hocko wrote:
>>> On Thu 15-03-18 19:45:52, Andrey Ryabinin wrote:
>>>> We have separate LRU list for each memory cgroup. Memory reclaim iterates
>>>> over cgroups and calls shrink_inactive_list() every inactive LRU list.
>>>> Based on the state of a single LRU shrink_inactive_list() may flag
>>>> the whole node as dirty,congested or under writeback. This is obviously
>>>> wrong and hurtful. It's especially hurtful when we have possibly
>>>> small congested cgroup in system. Than *all* direct reclaims waste time
>>>> by sleeping in wait_iff_congested().
>>>
>>> I assume you have seen this in real workloads. Could you be more
>>> specific about how you noticed the problem?
>>>
>>
>> Does it matter?
>
> Yes. Having relevant information in the changelog can help other people
> to evaluate whether they need to backport the patch. Their symptoms
> might be similar or even same.
>
>> One of our userspace processes have some sort of watchdog.
>> When it doesn't receive some event in time it complains that process stuck.
>> In this case in-kernel allocation stuck in wait_iff_congested.
>
> OK, so normally it would exhibit as a long stall in the page allocator.
> Anyway I was more curious about the setup. I assume you have many memcgs
> and some of them with a very small hard limit which triggers the
> throttling to other memcgs?
Quite some time went since this was observed, so I may don't remember all details by now.
Can't tell you whether there really was many memcgs or just a few, but the more memcgs we have
the more severe the issue is, since wait_iff_congested() called per-lru.
What I've seen was one cgroup A doing a lot of write on NFS. It's easy to congest the NFS
by generating more than nfs_congestion_kb writeback pages.
Other task (the one that with watchdog) from different cgroup B went into *global* direct reclaim
and stalled in wait_iff_congested().
System had dozens gigabytes of clean inactive file pages and relatively few dirty/writeback on NFS.
So, to trigger the issue one must have one memcg with mostly dirty pages on congested device.
It doesn't have to be small or hard limit memcg.
Global reclaim kicks in, sees 'congested' memcg, sets CONGESTED bit, stalls in wait_iff_congested(),
goes to the next memcg stalls again, and so on and on until the reclaim goal is satisfied.
On 03/21/2018 02:43 PM, Michal Hocko wrote:
> On Wed 21-03-18 14:14:35, Andrey Ryabinin wrote:
>>
>>
>> On 03/20/2018 06:29 PM, Michal Hocko wrote:
>>
>>>> Leave all pgdat->flags manipulations to kswapd. kswapd scans the whole
>>>> pgdat, so it's reasonable to leave all decisions about node stat
>>>> to kswapd. Also add per-cgroup congestion state to avoid needlessly
>>>> burning CPU in cgroup reclaim if heavy congestion is observed.
>>>>
>>>> Currently there is no need in per-cgroup PGDAT_WRITEBACK and PGDAT_DIRTY
>>>> bits since they alter only kswapd behavior.
>>>>
>>>> The problem could be easily demonstrated by creating heavy congestion
>>>> in one cgroup:
>>>>
>>>> echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control
>>>> mkdir -p /sys/fs/cgroup/congester
>>>> echo 512M > /sys/fs/cgroup/congester/memory.max
>>>> echo $$ > /sys/fs/cgroup/congester/cgroup.procs
>>>> /* generate a lot of diry data on slow HDD */
>>>> while true; do dd if=/dev/zero of=/mnt/sdb/zeroes bs=1M count=1024; done &
>>>> ....
>>>> while true; do dd if=/dev/zero of=/mnt/sdb/zeroes bs=1M count=1024; done &
>>>>
>>>> and some job in another cgroup:
>>>>
>>>> mkdir /sys/fs/cgroup/victim
>>>> echo 128M > /sys/fs/cgroup/victim/memory.max
>>>>
>>>> # time cat /dev/sda > /dev/null
>>>> real 10m15.054s
>>>> user 0m0.487s
>>>> sys 1m8.505s
>>>>
>>>> According to the tracepoint in wait_iff_congested(), the 'cat' spent 50%
>>>> of the time sleeping there.
>>>>
>>>> With the patch, cat don't waste time anymore:
>>>>
>>>> # time cat /dev/sda > /dev/null
>>>> real 5m32.911s
>>>> user 0m0.411s
>>>> sys 0m56.664s
>>>>
>>>> Signed-off-by: Andrey Ryabinin <[email protected]>
>>>> ---
>>>> include/linux/backing-dev.h | 2 +-
>>>> include/linux/memcontrol.h | 2 ++
>>>> mm/backing-dev.c | 19 ++++------
>>>> mm/vmscan.c | 84 ++++++++++++++++++++++++++++++++-------------
>>>> 4 files changed, 70 insertions(+), 37 deletions(-)
>>>
>>> This patch seems overly complicated. Why don't you simply reduce the whole
>>> pgdat_flags handling to global_reclaim()?
>>>
>>
>> In that case cgroup2 reclaim wouldn't have any way of throttling if
>> cgroup is full of congested dirty pages.
>
> It's been some time since I've looked into the throttling code so pardon
> my ignorance. Don't cgroup v2 users get throttled in the write path to
> not dirty too many pages in the first place?
Yes, they do. The same as no cgroup users. Basically, cgroup v2 mimics
all that global reclaim, dirty ratelimiting, throttling stuff.
However, balance_dirty_pages() can't always protect from too much dirty problem.
E.g. you could mmap() file and start writing to it at the speed of RAM.
balance_dirty_pages() can't stop you there.
> In other words, is your patch trying to fix two different things? One is
> per-memcg reclaim influencing the global pgdat state, which is clearly
> wrng, and cgroup v2 reclaim throttling that is not pgdat based?
>
Kinda, but the second problem is introduced by fixing the first one. Since
without fixing the first problem we have congestion throttling in cgroup v2.
Yes, it's based on global pgdat state, which is wrong, but it should throttle
cgroupv2 reclaim when memcg is congested.
I didn't try to evaluate how useful this whole congestion throttling is,
and whether it's really needed. But if we need for global reclaim, than
we probably need it in cgroup2 reclaim too.
P.S. Some observation: blk-mq seems doesn't have any congestion mechanism.
Dunno whether is this intentional or just an oversight. This means congestion
throttling never happens on such system.