Daniel Dao has reported [1] a regression on workloads that may trigger
a lot of refaults (anon and file). The underlying issue is that flushing
rstat is expensive. Although rstat flush are batched with (nr_cpus *
MEMCG_BATCH) stat updates, it seems like there are workloads which
genuinely do stat updates larger than batch value within short amount of
time. Since the rstat flush can happen in the performance critical
codepaths like page faults, such workload can suffer greatly.
The easiest fix for now is for performance critical codepaths trigger
the rstat flush asynchronously. This patch converts the refault codepath
to use async rstat flush. In addition, this patch has premptively
converted mem_cgroup_wb_stats and shrink_node to also use the async
rstat flush as they may also similar performance regressions.
Link: https://lore.kernel.org/all/CA+wXwBSyO87ZX5PVwdHm-=dBjZYECGmfnydUicUyrQqndgX2MQ@mail.gmail.com [1]
Fixes: 1f828223b799 ("memcg: flush lruvec stats in the refault")
Reported-by: Daniel Dao <[email protected]>
Signed-off-by: Shakeel Butt <[email protected]>
Cc: <[email protected]>
---
include/linux/memcontrol.h | 1 +
mm/memcontrol.c | 10 +++++++++-
mm/vmscan.c | 2 +-
mm/workingset.c | 2 +-
4 files changed, 12 insertions(+), 3 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index ef4b445392a9..bfdd48be60ff 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -998,6 +998,7 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
}
void mem_cgroup_flush_stats(void);
+void mem_cgroup_flush_stats_async(void);
void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
int val);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c695608c521c..4338e8d779b2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -690,6 +690,14 @@ void mem_cgroup_flush_stats(void)
__mem_cgroup_flush_stats();
}
+void mem_cgroup_flush_stats_async(void)
+{
+ if (atomic_read(&stats_flush_threshold) > num_online_cpus()) {
+ atomic_set(&stats_flush_threshold, 0);
+ mod_delayed_work(system_unbound_wq, &stats_flush_dwork, 0);
+ }
+}
+
static void flush_memcg_stats_dwork(struct work_struct *w)
{
__mem_cgroup_flush_stats();
@@ -4522,7 +4530,7 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages,
struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css);
struct mem_cgroup *parent;
- mem_cgroup_flush_stats();
+ mem_cgroup_flush_stats_async();
*pdirty = memcg_page_state(memcg, NR_FILE_DIRTY);
*pwriteback = memcg_page_state(memcg, NR_WRITEBACK);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c6f77e3e6d59..b6c6b165c1ef 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3188,7 +3188,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
* Flush the memory cgroup stats, so that we read accurate per-memcg
* lruvec stats for heuristics.
*/
- mem_cgroup_flush_stats();
+ mem_cgroup_flush_stats_async();
memset(&sc->nr, 0, sizeof(sc->nr));
diff --git a/mm/workingset.c b/mm/workingset.c
index b717eae4e0dd..a4f2b1aa5bcc 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -355,7 +355,7 @@ void workingset_refault(struct folio *folio, void *shadow)
mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr);
- mem_cgroup_flush_stats();
+ mem_cgroup_flush_stats_async();
/*
* Compare the distance to the existing workingset size. We
* don't activate pages that couldn't stay resident even if
--
2.35.1.574.g5d30c73bfb-goog
On Fri, 25 Feb 2022 16:24:12 -0800 Shakeel Butt <[email protected]> wrote:
> Daniel Dao has reported [1] a regression on workloads that may trigger
> a lot of refaults (anon and file). The underlying issue is that flushing
> rstat is expensive. Although rstat flush are batched with (nr_cpus *
> MEMCG_BATCH) stat updates, it seems like there are workloads which
> genuinely do stat updates larger than batch value within short amount of
> time. Since the rstat flush can happen in the performance critical
> codepaths like page faults, such workload can suffer greatly.
>
> The easiest fix for now is for performance critical codepaths trigger
> the rstat flush asynchronously. This patch converts the refault codepath
> to use async rstat flush. In addition, this patch has premptively
> converted mem_cgroup_wb_stats and shrink_node to also use the async
> rstat flush as they may also similar performance regressions.
Gee we do this trick a lot and gee I don't like it :(
a) if we're doing too much work then we're doing too much work.
Punting that work over to a different CPU or thread doesn't alter
that - it in fact adds more work.
b) there's an assumption here that the flusher is able to keep up
with the producer. What happens if that isn't the case? Do we
simply wind up the deferred items until the system goes oom?
What happens if there's a producer running on each CPU? Can the
flushers keep up?
Pathologically, what happens if the producer is running
task_is_realtime() on a single-CPU system? Or if there's a
task_is_realtime() producer running on every CPU? The flusher never
gets to run and we're dead?
An obvious fix is to limit the permissible amount of windup (to what?)
and at some point, do the flushing synchronously anyway.
Or we just don't do any this at all and put up with the cost of the
current code. I mean, this "fix" is kind of fake anyway, isn't it?
Pushing the 4-10ms delay onto a different CPU will just disrupt
something else which wanted to run on that CPU. The overall effect is
to hide the impact from one particular testcase, but is the benefit
really a real one?
Hi Shakeel,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on hnaz-mm/master]
[also build test ERROR on linus/master v5.17-rc5 next-20220224]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]
url: https://github.com/0day-ci/linux/commits/Shakeel-Butt/memcg-async-flush-memcg-stats-from-perf-sensitive-codepaths/20220226-082444
base: https://github.com/hnaz/linux-mm master
config: um-i386_defconfig (https://download.01.org/0day-ci/archive/20220226/[email protected]/config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce (this is a W=1 build):
# https://github.com/0day-ci/linux/commit/5dffeb24975bc4cbe99af650d833eb0183a4882f
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Shakeel-Butt/memcg-async-flush-memcg-stats-from-perf-sensitive-codepaths/20220226-082444
git checkout 5dffeb24975bc4cbe99af650d833eb0183a4882f
# save the config file to linux build tree
mkdir build_dir
make W=1 O=build_dir ARCH=um SUBARCH=i386 SHELL=/bin/bash
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>
All errors (new ones prefixed by >>):
mm/vmscan.c: In function 'shrink_node':
>> mm/vmscan.c:3191:2: error: implicit declaration of function 'mem_cgroup_flush_stats_async'; did you mean 'mem_cgroup_flush_stats'? [-Werror=implicit-function-declaration]
3191 | mem_cgroup_flush_stats_async();
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
| mem_cgroup_flush_stats
cc1: some warnings being treated as errors
--
mm/workingset.c: In function 'workingset_refault':
>> mm/workingset.c:358:2: error: implicit declaration of function 'mem_cgroup_flush_stats_async'; did you mean 'mem_cgroup_flush_stats'? [-Werror=implicit-function-declaration]
358 | mem_cgroup_flush_stats_async();
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
| mem_cgroup_flush_stats
cc1: some warnings being treated as errors
vim +3191 mm/vmscan.c
3175
3176 static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
3177 {
3178 struct reclaim_state *reclaim_state = current->reclaim_state;
3179 unsigned long nr_reclaimed, nr_scanned;
3180 struct lruvec *target_lruvec;
3181 bool reclaimable = false;
3182 unsigned long file;
3183
3184 target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
3185
3186 again:
3187 /*
3188 * Flush the memory cgroup stats, so that we read accurate per-memcg
3189 * lruvec stats for heuristics.
3190 */
> 3191 mem_cgroup_flush_stats_async();
3192
3193 memset(&sc->nr, 0, sizeof(sc->nr));
3194
3195 nr_reclaimed = sc->nr_reclaimed;
3196 nr_scanned = sc->nr_scanned;
3197
3198 /*
3199 * Determine the scan balance between anon and file LRUs.
3200 */
3201 spin_lock_irq(&target_lruvec->lru_lock);
3202 sc->anon_cost = target_lruvec->anon_cost;
3203 sc->file_cost = target_lruvec->file_cost;
3204 spin_unlock_irq(&target_lruvec->lru_lock);
3205
3206 /*
3207 * Target desirable inactive:active list ratios for the anon
3208 * and file LRU lists.
3209 */
3210 if (!sc->force_deactivate) {
3211 unsigned long refaults;
3212
3213 refaults = lruvec_page_state(target_lruvec,
3214 WORKINGSET_ACTIVATE_ANON);
3215 if (refaults != target_lruvec->refaults[0] ||
3216 inactive_is_low(target_lruvec, LRU_INACTIVE_ANON))
3217 sc->may_deactivate |= DEACTIVATE_ANON;
3218 else
3219 sc->may_deactivate &= ~DEACTIVATE_ANON;
3220
3221 /*
3222 * When refaults are being observed, it means a new
3223 * workingset is being established. Deactivate to get
3224 * rid of any stale active pages quickly.
3225 */
3226 refaults = lruvec_page_state(target_lruvec,
3227 WORKINGSET_ACTIVATE_FILE);
3228 if (refaults != target_lruvec->refaults[1] ||
3229 inactive_is_low(target_lruvec, LRU_INACTIVE_FILE))
3230 sc->may_deactivate |= DEACTIVATE_FILE;
3231 else
3232 sc->may_deactivate &= ~DEACTIVATE_FILE;
3233 } else
3234 sc->may_deactivate = DEACTIVATE_ANON | DEACTIVATE_FILE;
3235
3236 /*
3237 * If we have plenty of inactive file pages that aren't
3238 * thrashing, try to reclaim those first before touching
3239 * anonymous pages.
3240 */
3241 file = lruvec_page_state(target_lruvec, NR_INACTIVE_FILE);
3242 if (file >> sc->priority && !(sc->may_deactivate & DEACTIVATE_FILE))
3243 sc->cache_trim_mode = 1;
3244 else
3245 sc->cache_trim_mode = 0;
3246
3247 /*
3248 * Prevent the reclaimer from falling into the cache trap: as
3249 * cache pages start out inactive, every cache fault will tip
3250 * the scan balance towards the file LRU. And as the file LRU
3251 * shrinks, so does the window for rotation from references.
3252 * This means we have a runaway feedback loop where a tiny
3253 * thrashing file LRU becomes infinitely more attractive than
3254 * anon pages. Try to detect this based on file LRU size.
3255 */
3256 if (!cgroup_reclaim(sc)) {
3257 unsigned long total_high_wmark = 0;
3258 unsigned long free, anon;
3259 int z;
3260
3261 free = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES);
3262 file = node_page_state(pgdat, NR_ACTIVE_FILE) +
3263 node_page_state(pgdat, NR_INACTIVE_FILE);
3264
3265 for (z = 0; z < MAX_NR_ZONES; z++) {
3266 struct zone *zone = &pgdat->node_zones[z];
3267 if (!managed_zone(zone))
3268 continue;
3269
3270 total_high_wmark += high_wmark_pages(zone);
3271 }
3272
3273 /*
3274 * Consider anon: if that's low too, this isn't a
3275 * runaway file reclaim problem, but rather just
3276 * extreme pressure. Reclaim as per usual then.
3277 */
3278 anon = node_page_state(pgdat, NR_INACTIVE_ANON);
3279
3280 sc->file_is_tiny =
3281 file + free <= total_high_wmark &&
3282 !(sc->may_deactivate & DEACTIVATE_ANON) &&
3283 anon >> sc->priority;
3284 }
3285
3286 shrink_node_memcgs(pgdat, sc);
3287
3288 if (reclaim_state) {
3289 sc->nr_reclaimed += reclaim_state->reclaimed_slab;
3290 reclaim_state->reclaimed_slab = 0;
3291 }
3292
3293 /* Record the subtree's reclaim efficiency */
3294 vmpressure(sc->gfp_mask, sc->target_mem_cgroup, true,
3295 sc->nr_scanned - nr_scanned,
3296 sc->nr_reclaimed - nr_reclaimed);
3297
3298 if (sc->nr_reclaimed - nr_reclaimed)
3299 reclaimable = true;
3300
3301 if (current_is_kswapd()) {
3302 /*
3303 * If reclaim is isolating dirty pages under writeback,
3304 * it implies that the long-lived page allocation rate
3305 * is exceeding the page laundering rate. Either the
3306 * global limits are not being effective at throttling
3307 * processes due to the page distribution throughout
3308 * zones or there is heavy usage of a slow backing
3309 * device. The only option is to throttle from reclaim
3310 * context which is not ideal as there is no guarantee
3311 * the dirtying process is throttled in the same way
3312 * balance_dirty_pages() manages.
3313 *
3314 * Once a node is flagged PGDAT_WRITEBACK, kswapd will
3315 * count the number of pages under pages flagged for
3316 * immediate reclaim and stall if any are encountered
3317 * in the nr_immediate check below.
3318 */
3319 if (sc->nr.writeback && sc->nr.writeback == sc->nr.taken)
3320 set_bit(PGDAT_WRITEBACK, &pgdat->flags);
3321
3322 /* Allow kswapd to start writing pages during reclaim.*/
3323 if (sc->nr.unqueued_dirty == sc->nr.file_taken)
3324 set_bit(PGDAT_DIRTY, &pgdat->flags);
3325
3326 /*
3327 * If kswapd scans pages marked for immediate
3328 * reclaim and under writeback (nr_immediate), it
3329 * implies that pages are cycling through the LRU
3330 * faster than they are written so forcibly stall
3331 * until some pages complete writeback.
3332 */
3333 if (sc->nr.immediate)
3334 reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
3335 }
3336
3337 /*
3338 * Tag a node/memcg as congested if all the dirty pages were marked
3339 * for writeback and immediate reclaim (counted in nr.congested).
3340 *
3341 * Legacy memcg will stall in page writeback so avoid forcibly
3342 * stalling in reclaim_throttle().
3343 */
3344 if ((current_is_kswapd() ||
3345 (cgroup_reclaim(sc) && writeback_throttling_sane(sc))) &&
3346 sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
3347 set_bit(LRUVEC_CONGESTED, &target_lruvec->flags);
3348
3349 /*
3350 * Stall direct reclaim for IO completions if the lruvec is
3351 * node is congested. Allow kswapd to continue until it
3352 * starts encountering unqueued dirty pages or cycling through
3353 * the LRU too quickly.
3354 */
3355 if (!current_is_kswapd() && current_may_throttle() &&
3356 !sc->hibernation_mode &&
3357 test_bit(LRUVEC_CONGESTED, &target_lruvec->flags))
3358 reclaim_throttle(pgdat, VMSCAN_THROTTLE_CONGESTED);
3359
3360 if (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
3361 sc))
3362 goto again;
3363
3364 /*
3365 * Kswapd gives up on balancing particular nodes after too
3366 * many failures to reclaim anything from them and goes to
3367 * sleep. On reclaim progress, reset the failure counter. A
3368 * successful direct reclaim run will revive a dormant kswapd.
3369 */
3370 if (reclaimable)
3371 pgdat->kswapd_failures = 0;
3372 }
3373
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]
On Fri, Feb 25, 2022 at 4:58 PM Andrew Morton <[email protected]> wrote:
>
> On Fri, 25 Feb 2022 16:24:12 -0800 Shakeel Butt <[email protected]> wrote:
>
> > Daniel Dao has reported [1] a regression on workloads that may trigger
> > a lot of refaults (anon and file). The underlying issue is that flushing
> > rstat is expensive. Although rstat flush are batched with (nr_cpus *
> > MEMCG_BATCH) stat updates, it seems like there are workloads which
> > genuinely do stat updates larger than batch value within short amount of
> > time. Since the rstat flush can happen in the performance critical
> > codepaths like page faults, such workload can suffer greatly.
> >
> > The easiest fix for now is for performance critical codepaths trigger
> > the rstat flush asynchronously. This patch converts the refault codepath
> > to use async rstat flush. In addition, this patch has premptively
> > converted mem_cgroup_wb_stats and shrink_node to also use the async
> > rstat flush as they may also similar performance regressions.
>
> Gee we do this trick a lot and gee I don't like it :(
>
> a) if we're doing too much work then we're doing too much work.
> Punting that work over to a different CPU or thread doesn't alter
> that - it in fact adds more work.
>
Please note that we already have the async worker running every 2
seconds. Normally no consumer would need to flush the stats but if
there were too many stat updates from producers in a short amount of
time then one of the consumers will have to pay the price of the
flush.
We have two types of consumers i.e. performance critical (e.g.
refault) and non-performance critical (e.g. memory.stat or num_stat
readers). I think we can let the performance critical consumer skip
the synchronous flushing and the async worker do the work for
performance reasons.
> b) there's an assumption here that the flusher is able to keep up
> with the producer. What happens if that isn't the case? Do we
> simply wind up the deferred items until the system goes oom?
>
Without a consumer nothing bad can happen even if flusher is slow (or
it has too much work) or too many stats are being updated by many
producers.
With a couple of consumers, in the current kernel, one of them may
have to pay the cost of synch flush. With this patch, we will have two
types of consumers. First, who are ok to pay the price of sync flush
to get the accurate stats and second who are ok with out of sync stats
but bounded by 2 seconds (yes assuming flusher runs every 2 seconds).
BTW there is no concern of the system going into oom due to reading a
bit older stats.
> What happens if there's a producer running on each CPU? Can the
> flushers keep up?
>
> Pathologically, what happens if the producer is running
> task_is_realtime() on a single-CPU system? Or if there's a
> task_is_realtime() producer running on every CPU? The flusher never
> gets to run and we're dead?
>
I think it has to be a mix of (stat) producers and (stat) consumers
which are hogging CPU forever and no, we will not be dead. At worst
the consumers might be making some wrong decisions due to consuming
older stats.
One can argue that since one consumer is reclaim code, some reclaim
heuristic can get messed up due to older stats. Yes, that can happen.
>
> An obvious fix is to limit the permissible amount of windup (to what?)
> and at some point, do the flushing synchronously anyway.
>
That is what we are currently doing. The limit being nr_cpus * MEMCG_BATCH.
> Or we just don't do any this at all and put up with the cost of the
> current code. I mean, this "fix" is kind of fake anyway, isn't it?
> Pushing the 4-10ms delay onto a different CPU will just disrupt
> something else which wanted to run on that CPU. The overall effect is
> to hide the impact from one particular testcase, but is the benefit
> really a real one?
>
Yes, the right fix would be to optimize the flushing code (but that
would require more work/time). However I still think letting
performance critical code paths to skip the sync flush would be good
in general. So, if the current patch is not to your liking we can
remove mem_cgroup_flush_stats() from workingset_refault().
On Fri, 25 Feb 2022 16:58:42 -0800 Andrew Morton <[email protected]> wrote:
> On Fri, 25 Feb 2022 16:24:12 -0800 Shakeel Butt <[email protected]> wrote:
>
> > Daniel Dao has reported [1] a regression on workloads that may trigger
> > a lot of refaults (anon and file). The underlying issue is that flushing
> > rstat is expensive. Although rstat flush are batched with (nr_cpus *
> > MEMCG_BATCH) stat updates, it seems like there are workloads which
> > genuinely do stat updates larger than batch value within short amount of
> > time. Since the rstat flush can happen in the performance critical
> > codepaths like page faults, such workload can suffer greatly.
> >
> > The easiest fix for now is for performance critical codepaths trigger
> > the rstat flush asynchronously. This patch converts the refault codepath
> > to use async rstat flush. In addition, this patch has premptively
> > converted mem_cgroup_wb_stats and shrink_node to also use the async
> > rstat flush as they may also similar performance regressions.
>
> Gee we do this trick a lot and gee I don't like it :(
>
> a) if we're doing too much work then we're doing too much work.
> Punting that work over to a different CPU or thread doesn't alter
> that - it in fact adds more work.
>
> b) there's an assumption here that the flusher is able to keep up
> with the producer. What happens if that isn't the case? Do we
> simply wind up the deferred items until the system goes oom?
>
> What happens if there's a producer running on each CPU? Can the
> flushers keep up?
>
> Pathologically, what happens if the producer is running
> task_is_realtime() on a single-CPU system? Or if there's a
> task_is_realtime() producer running on every CPU? The flusher never
> gets to run and we're dead?
Not some theoretical thing, btw. See how __read_swap_cache_async()
just got its sins exposed by real-time tasks:
https://lkml.kernel.org/r/[email protected]
Hi Shakeel,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on hnaz-mm/master]
[also build test ERROR on linus/master v5.17-rc5 next-20220225]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]
url: https://github.com/0day-ci/linux/commits/Shakeel-Butt/memcg-async-flush-memcg-stats-from-perf-sensitive-codepaths/20220226-082444
base: https://github.com/hnaz/linux-mm master
config: hexagon-randconfig-r023-20220226 (https://download.01.org/0day-ci/archive/20220226/[email protected]/config)
compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project d271fc04d5b97b12e6b797c6067d3c96a8d7470e)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/0day-ci/linux/commit/5dffeb24975bc4cbe99af650d833eb0183a4882f
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Shakeel-Butt/memcg-async-flush-memcg-stats-from-perf-sensitive-codepaths/20220226-082444
git checkout 5dffeb24975bc4cbe99af650d833eb0183a4882f
# save the config file to linux build tree
mkdir build_dir
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=hexagon SHELL=/bin/bash
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>
All errors (new ones prefixed by >>):
>> mm/vmscan.c:3191:2: error: implicit declaration of function 'mem_cgroup_flush_stats_async' [-Werror,-Wimplicit-function-declaration]
mem_cgroup_flush_stats_async();
^
mm/vmscan.c:3191:2: note: did you mean 'mem_cgroup_flush_stats'?
include/linux/memcontrol.h:1441:20: note: 'mem_cgroup_flush_stats' declared here
static inline void mem_cgroup_flush_stats(void)
^
1 error generated.
--
>> mm/workingset.c:358:2: error: implicit declaration of function 'mem_cgroup_flush_stats_async' [-Werror,-Wimplicit-function-declaration]
mem_cgroup_flush_stats_async();
^
mm/workingset.c:358:2: note: did you mean 'mem_cgroup_flush_stats'?
include/linux/memcontrol.h:1441:20: note: 'mem_cgroup_flush_stats' declared here
static inline void mem_cgroup_flush_stats(void)
^
1 error generated.
vim +/mem_cgroup_flush_stats_async +3191 mm/vmscan.c
3175
3176 static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
3177 {
3178 struct reclaim_state *reclaim_state = current->reclaim_state;
3179 unsigned long nr_reclaimed, nr_scanned;
3180 struct lruvec *target_lruvec;
3181 bool reclaimable = false;
3182 unsigned long file;
3183
3184 target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
3185
3186 again:
3187 /*
3188 * Flush the memory cgroup stats, so that we read accurate per-memcg
3189 * lruvec stats for heuristics.
3190 */
> 3191 mem_cgroup_flush_stats_async();
3192
3193 memset(&sc->nr, 0, sizeof(sc->nr));
3194
3195 nr_reclaimed = sc->nr_reclaimed;
3196 nr_scanned = sc->nr_scanned;
3197
3198 /*
3199 * Determine the scan balance between anon and file LRUs.
3200 */
3201 spin_lock_irq(&target_lruvec->lru_lock);
3202 sc->anon_cost = target_lruvec->anon_cost;
3203 sc->file_cost = target_lruvec->file_cost;
3204 spin_unlock_irq(&target_lruvec->lru_lock);
3205
3206 /*
3207 * Target desirable inactive:active list ratios for the anon
3208 * and file LRU lists.
3209 */
3210 if (!sc->force_deactivate) {
3211 unsigned long refaults;
3212
3213 refaults = lruvec_page_state(target_lruvec,
3214 WORKINGSET_ACTIVATE_ANON);
3215 if (refaults != target_lruvec->refaults[0] ||
3216 inactive_is_low(target_lruvec, LRU_INACTIVE_ANON))
3217 sc->may_deactivate |= DEACTIVATE_ANON;
3218 else
3219 sc->may_deactivate &= ~DEACTIVATE_ANON;
3220
3221 /*
3222 * When refaults are being observed, it means a new
3223 * workingset is being established. Deactivate to get
3224 * rid of any stale active pages quickly.
3225 */
3226 refaults = lruvec_page_state(target_lruvec,
3227 WORKINGSET_ACTIVATE_FILE);
3228 if (refaults != target_lruvec->refaults[1] ||
3229 inactive_is_low(target_lruvec, LRU_INACTIVE_FILE))
3230 sc->may_deactivate |= DEACTIVATE_FILE;
3231 else
3232 sc->may_deactivate &= ~DEACTIVATE_FILE;
3233 } else
3234 sc->may_deactivate = DEACTIVATE_ANON | DEACTIVATE_FILE;
3235
3236 /*
3237 * If we have plenty of inactive file pages that aren't
3238 * thrashing, try to reclaim those first before touching
3239 * anonymous pages.
3240 */
3241 file = lruvec_page_state(target_lruvec, NR_INACTIVE_FILE);
3242 if (file >> sc->priority && !(sc->may_deactivate & DEACTIVATE_FILE))
3243 sc->cache_trim_mode = 1;
3244 else
3245 sc->cache_trim_mode = 0;
3246
3247 /*
3248 * Prevent the reclaimer from falling into the cache trap: as
3249 * cache pages start out inactive, every cache fault will tip
3250 * the scan balance towards the file LRU. And as the file LRU
3251 * shrinks, so does the window for rotation from references.
3252 * This means we have a runaway feedback loop where a tiny
3253 * thrashing file LRU becomes infinitely more attractive than
3254 * anon pages. Try to detect this based on file LRU size.
3255 */
3256 if (!cgroup_reclaim(sc)) {
3257 unsigned long total_high_wmark = 0;
3258 unsigned long free, anon;
3259 int z;
3260
3261 free = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES);
3262 file = node_page_state(pgdat, NR_ACTIVE_FILE) +
3263 node_page_state(pgdat, NR_INACTIVE_FILE);
3264
3265 for (z = 0; z < MAX_NR_ZONES; z++) {
3266 struct zone *zone = &pgdat->node_zones[z];
3267 if (!managed_zone(zone))
3268 continue;
3269
3270 total_high_wmark += high_wmark_pages(zone);
3271 }
3272
3273 /*
3274 * Consider anon: if that's low too, this isn't a
3275 * runaway file reclaim problem, but rather just
3276 * extreme pressure. Reclaim as per usual then.
3277 */
3278 anon = node_page_state(pgdat, NR_INACTIVE_ANON);
3279
3280 sc->file_is_tiny =
3281 file + free <= total_high_wmark &&
3282 !(sc->may_deactivate & DEACTIVATE_ANON) &&
3283 anon >> sc->priority;
3284 }
3285
3286 shrink_node_memcgs(pgdat, sc);
3287
3288 if (reclaim_state) {
3289 sc->nr_reclaimed += reclaim_state->reclaimed_slab;
3290 reclaim_state->reclaimed_slab = 0;
3291 }
3292
3293 /* Record the subtree's reclaim efficiency */
3294 vmpressure(sc->gfp_mask, sc->target_mem_cgroup, true,
3295 sc->nr_scanned - nr_scanned,
3296 sc->nr_reclaimed - nr_reclaimed);
3297
3298 if (sc->nr_reclaimed - nr_reclaimed)
3299 reclaimable = true;
3300
3301 if (current_is_kswapd()) {
3302 /*
3303 * If reclaim is isolating dirty pages under writeback,
3304 * it implies that the long-lived page allocation rate
3305 * is exceeding the page laundering rate. Either the
3306 * global limits are not being effective at throttling
3307 * processes due to the page distribution throughout
3308 * zones or there is heavy usage of a slow backing
3309 * device. The only option is to throttle from reclaim
3310 * context which is not ideal as there is no guarantee
3311 * the dirtying process is throttled in the same way
3312 * balance_dirty_pages() manages.
3313 *
3314 * Once a node is flagged PGDAT_WRITEBACK, kswapd will
3315 * count the number of pages under pages flagged for
3316 * immediate reclaim and stall if any are encountered
3317 * in the nr_immediate check below.
3318 */
3319 if (sc->nr.writeback && sc->nr.writeback == sc->nr.taken)
3320 set_bit(PGDAT_WRITEBACK, &pgdat->flags);
3321
3322 /* Allow kswapd to start writing pages during reclaim.*/
3323 if (sc->nr.unqueued_dirty == sc->nr.file_taken)
3324 set_bit(PGDAT_DIRTY, &pgdat->flags);
3325
3326 /*
3327 * If kswapd scans pages marked for immediate
3328 * reclaim and under writeback (nr_immediate), it
3329 * implies that pages are cycling through the LRU
3330 * faster than they are written so forcibly stall
3331 * until some pages complete writeback.
3332 */
3333 if (sc->nr.immediate)
3334 reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
3335 }
3336
3337 /*
3338 * Tag a node/memcg as congested if all the dirty pages were marked
3339 * for writeback and immediate reclaim (counted in nr.congested).
3340 *
3341 * Legacy memcg will stall in page writeback so avoid forcibly
3342 * stalling in reclaim_throttle().
3343 */
3344 if ((current_is_kswapd() ||
3345 (cgroup_reclaim(sc) && writeback_throttling_sane(sc))) &&
3346 sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
3347 set_bit(LRUVEC_CONGESTED, &target_lruvec->flags);
3348
3349 /*
3350 * Stall direct reclaim for IO completions if the lruvec is
3351 * node is congested. Allow kswapd to continue until it
3352 * starts encountering unqueued dirty pages or cycling through
3353 * the LRU too quickly.
3354 */
3355 if (!current_is_kswapd() && current_may_throttle() &&
3356 !sc->hibernation_mode &&
3357 test_bit(LRUVEC_CONGESTED, &target_lruvec->flags))
3358 reclaim_throttle(pgdat, VMSCAN_THROTTLE_CONGESTED);
3359
3360 if (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
3361 sc))
3362 goto again;
3363
3364 /*
3365 * Kswapd gives up on balancing particular nodes after too
3366 * many failures to reclaim anything from them and goes to
3367 * sleep. On reclaim progress, reset the failure counter. A
3368 * successful direct reclaim run will revive a dormant kswapd.
3369 */
3370 if (reclaimable)
3371 pgdat->kswapd_failures = 0;
3372 }
3373
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]
On Fri, Feb 25, 2022 at 05:42:57PM -0800, Shakeel Butt <[email protected]> wrote:
> Yes, the right fix would be to optimize the flushing code (but that
> would require more work/time). However I still think letting
> performance critical code paths to skip the sync flush would be good
> in general. So, if the current patch is not to your liking we can
> remove mem_cgroup_flush_stats() from workingset_refault().
What about flushing just the subtree of the memcg where the refault
happens?
It doesn't reduce the overall work and there's still full-tree
cgroup_rstat_lock but it should make the chunks of work smaller
durations more regular.
Michal
On Mon, Feb 28, 2022 at 10:46 AM Michal Koutný <[email protected]> wrote:
>
> On Fri, Feb 25, 2022 at 05:42:57PM -0800, Shakeel Butt <[email protected]> wrote:
> > Yes, the right fix would be to optimize the flushing code (but that
> > would require more work/time). However I still think letting
> > performance critical code paths to skip the sync flush would be good
> > in general. So, if the current patch is not to your liking we can
> > remove mem_cgroup_flush_stats() from workingset_refault().
>
> What about flushing just the subtree of the memcg where the refault
> happens?
> It doesn't reduce the overall work and there's still full-tree
> cgroup_rstat_lock but it should make the chunks of work smaller
> durations more regular.
>
We can try that and I will send a patch to Ivan and Daniel to try on
their workload to see the real impact of targeted memcg flushing.
However I am not very optimistic about it.
On Fri 25-02-22 16:24:12, Shakeel Butt wrote:
> Daniel Dao has reported [1] a regression on workloads that may trigger
> a lot of refaults (anon and file). The underlying issue is that flushing
> rstat is expensive. Although rstat flush are batched with (nr_cpus *
> MEMCG_BATCH) stat updates, it seems like there are workloads which
> genuinely do stat updates larger than batch value within short amount of
> time. Since the rstat flush can happen in the performance critical
> codepaths like page faults, such workload can suffer greatly.
>
> The easiest fix for now is for performance critical codepaths trigger
> the rstat flush asynchronously. This patch converts the refault codepath
> to use async rstat flush. In addition, this patch has premptively
> converted mem_cgroup_wb_stats and shrink_node to also use the async
> rstat flush as they may also similar performance regressions.
Why do we need to trigger flushing in the first place from those paths.
Later in the thread you are saying there is a regular flushing done
every 2 seconds. What would happen if these paths didn't flush at all?
Also please note that WQ context can be overwhelmed by other work so
these flushes can happen much much later.
So in other words why does async work (that can happen at any time
without any control) make more sense than no flushing?
--
Michal Hocko
SUSE Labs
On Tue, Mar 1, 2022 at 1:05 AM Michal Hocko <[email protected]> wrote:
>
> On Fri 25-02-22 16:24:12, Shakeel Butt wrote:
> > Daniel Dao has reported [1] a regression on workloads that may trigger
> > a lot of refaults (anon and file). The underlying issue is that flushing
> > rstat is expensive. Although rstat flush are batched with (nr_cpus *
> > MEMCG_BATCH) stat updates, it seems like there are workloads which
> > genuinely do stat updates larger than batch value within short amount of
> > time. Since the rstat flush can happen in the performance critical
> > codepaths like page faults, such workload can suffer greatly.
> >
> > The easiest fix for now is for performance critical codepaths trigger
> > the rstat flush asynchronously. This patch converts the refault codepath
> > to use async rstat flush. In addition, this patch has premptively
> > converted mem_cgroup_wb_stats and shrink_node to also use the async
> > rstat flush as they may also similar performance regressions.
>
> Why do we need to trigger flushing in the first place from those paths.
> Later in the thread you are saying there is a regular flushing done
> every 2 seconds. What would happen if these paths didn't flush at all?
> Also please note that WQ context can be overwhelmed by other work so
> these flushes can happen much much later.
>
> So in other words why does async work (that can happen at any time
> without any control) make more sense than no flushing?
> --
Without flushing the worst that can happen in the refault path is
false (or missed) activations of the refaulted page. For reclaim code,
some heuristics (like deactivating active LRU or cache-trim) may act
on old information.
However I don't think these are too much concerning as the kernel can
already missed or do false activations on refault. For the reclaim
code, the kernel does force deactivation if it has skipped it in the
initial iterations, so, not much to worry.
Now, coming to your question, yes, we can remove the flushing from
these performance critical codepaths as the stats at most will be 2
second old due to periodic flush. Now for the worst case scenario
where that periodic flush (WQ) is not getting CPU, I think it is
reasonable to put a sync flush if periodic flush has not happened for,
let's say, 10 seconds.
Making decisions based on up to 2 s old information.
On Tue, Mar 01, 2022 at 09:21:12AM -0800, Shakeel Butt <[email protected]> wrote:
> Without flushing the worst that can happen in the refault path is
> false (or missed) activations of the refaulted page.
Yeah, this may under- or overestimate workingset size (when it's
changing), the result is likely only less efficient reclaim.
> For reclaim code, some heuristics (like deactivating active LRU or
> cache-trim) may act on old information.
Here, I'd be more careful whether such a delay cannot introduce some
unstable behavior (permanent oscillation in the worst case).
> Now, coming to your question, yes, we can remove the flushing from
> these performance critical codepaths as the stats at most will be 2
> second old due to periodic flush.
Another aspect is that people will notice and report such a narrowly
located performance regression more easily than reduced/less predictable
reclaim behavior. (IMO the former is better, OTOH, it can also be
interpreted that noone notices (is able to notice).)
Michal