2023-09-21 22:29:01

by Yosry Ahmed

[permalink] [raw]
Subject: [PATCH 4/5] mm: workingset: move the stats flush into workingset_test_recent()

The workingset code flushes the stats in workingset_refault() to get
accurate stats of the eviction memcg. In preparation for more scoped
flushed and passing the eviction memcg to the flush call, move the call
to workingset_test_recent() where we have a pointer to the eviction
memcg.

The flush call is sleepable, and cannot be made in an rcu read section.
Hence, minimize the rcu read section by also moving it into
workingset_test_recent(). Furthermore, instead of holding the rcu read
lock throughout workingset_test_recent(), only hold it briefly to get a
ref on the eviction memcg. This allows us to make the flush call after
we get the eviction memcg.

As for workingset_refault(), nothing else there appears to be protected
by rcu. The memcg of the faulted folio (which is not necessarily the
same as the eviction memcg) is protected by the folio lock, which is
held from all callsites. Add a VM_BUG_ON() to make sure this doesn't
change from under us.

No functional change intended.

Signed-off-by: Yosry Ahmed <[email protected]>
---
mm/workingset.c | 31 ++++++++++++++++++++-----------
1 file changed, 20 insertions(+), 11 deletions(-)

diff --git a/mm/workingset.c b/mm/workingset.c
index b192e44a0e7c..79b338996088 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -425,8 +425,15 @@ bool workingset_test_recent(void *shadow, bool file, bool *workingset)
struct pglist_data *pgdat;
unsigned long eviction;

- if (lru_gen_enabled())
- return lru_gen_test_recent(shadow, file, &eviction_lruvec, &eviction, workingset);
+ rcu_read_lock();
+
+ if (lru_gen_enabled()) {
+ bool recent = lru_gen_test_recent(shadow, file,
+ &eviction_lruvec, &eviction,
+ workingset);
+ rcu_read_unlock();
+ return recent;
+ }

unpack_shadow(shadow, &memcgid, &pgdat, &eviction, workingset);
eviction <<= bucket_order;
@@ -451,6 +458,12 @@ bool workingset_test_recent(void *shadow, bool file, bool *workingset)
if (!mem_cgroup_disabled() && !eviction_memcg)
return false;

+ css_get(&eviction_memcg->css);
+ rcu_read_unlock();
+
+ /* Flush stats (and potentially sleep) outside the RCU read section */
+ mem_cgroup_flush_stats_ratelimited();
+
eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat);
refault = atomic_long_read(&eviction_lruvec->nonresident_age);

@@ -493,6 +506,7 @@ bool workingset_test_recent(void *shadow, bool file, bool *workingset)
}
}

+ mem_cgroup_put(eviction_memcg);
return refault_distance <= workingset_size;
}

@@ -519,19 +533,16 @@ void workingset_refault(struct folio *folio, void *shadow)
return;
}

- /* Flush stats (and potentially sleep) before holding RCU read lock */
- mem_cgroup_flush_stats_ratelimited();
-
- rcu_read_lock();
-
/*
* The activation decision for this folio is made at the level
* where the eviction occurred, as that is where the LRU order
* during folio reclaim is being determined.
*
* However, the cgroup that will own the folio is the one that
- * is actually experiencing the refault event.
+ * is actually experiencing the refault event. Make sure the folio is
+ * locked to guarantee folio_memcg() stability throughout.
*/
+ VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
nr = folio_nr_pages(folio);
memcg = folio_memcg(folio);
pgdat = folio_pgdat(folio);
@@ -540,7 +551,7 @@ void workingset_refault(struct folio *folio, void *shadow)
mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr);

if (!workingset_test_recent(shadow, file, &workingset))
- goto out;
+ return;

folio_set_active(folio);
workingset_age_nonresident(lruvec, nr);
@@ -556,8 +567,6 @@ void workingset_refault(struct folio *folio, void *shadow)
lru_note_cost_refault(folio);
mod_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + file, nr);
}
-out:
- rcu_read_unlock();
}

/**
--
2.42.0.459.ge4e396fd5e-goog


2023-09-22 01:15:22

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 4/5] mm: workingset: move the stats flush into workingset_test_recent()

Hi Yosry,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]

url: https://github.com/intel-lab-lkp/linux/commits/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20230921-161246
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20230921081057.3440885-5-yosryahmed%40google.com
patch subject: [PATCH 4/5] mm: workingset: move the stats flush into workingset_test_recent()
config: powerpc-allnoconfig (https://download.01.org/0day-ci/archive/20230921/[email protected]/config)
compiler: powerpc-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20230921/[email protected]/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/

All errors (new ones prefixed by >>):

mm/workingset.c: In function 'workingset_test_recent':
>> mm/workingset.c:461:32: error: invalid use of undefined type 'struct mem_cgroup'
461 | css_get(&eviction_memcg->css);
| ^~


vim +461 mm/workingset.c

405
406 /**
407 * workingset_test_recent - tests if the shadow entry is for a folio that was
408 * recently evicted. Also fills in @workingset with the value unpacked from
409 * shadow.
410 * @shadow: the shadow entry to be tested.
411 * @file: whether the corresponding folio is from the file lru.
412 * @workingset: where the workingset value unpacked from shadow should
413 * be stored.
414 *
415 * Return: true if the shadow is for a recently evicted folio; false otherwise.
416 */
417 bool workingset_test_recent(void *shadow, bool file, bool *workingset)
418 {
419 struct mem_cgroup *eviction_memcg;
420 struct lruvec *eviction_lruvec;
421 unsigned long refault_distance;
422 unsigned long workingset_size;
423 unsigned long refault;
424 int memcgid;
425 struct pglist_data *pgdat;
426 unsigned long eviction;
427
428 rcu_read_lock();
429
430 if (lru_gen_enabled()) {
431 bool recent = lru_gen_test_recent(shadow, file,
432 &eviction_lruvec, &eviction,
433 workingset);
434 rcu_read_unlock();
435 return recent;
436 }
437
438 unpack_shadow(shadow, &memcgid, &pgdat, &eviction, workingset);
439 eviction <<= bucket_order;
440
441 /*
442 * Look up the memcg associated with the stored ID. It might
443 * have been deleted since the folio's eviction.
444 *
445 * Note that in rare events the ID could have been recycled
446 * for a new cgroup that refaults a shared folio. This is
447 * impossible to tell from the available data. However, this
448 * should be a rare and limited disturbance, and activations
449 * are always speculative anyway. Ultimately, it's the aging
450 * algorithm's job to shake out the minimum access frequency
451 * for the active cache.
452 *
453 * XXX: On !CONFIG_MEMCG, this will always return NULL; it
454 * would be better if the root_mem_cgroup existed in all
455 * configurations instead.
456 */
457 eviction_memcg = mem_cgroup_from_id(memcgid);
458 if (!mem_cgroup_disabled() && !eviction_memcg)
459 return false;
460
> 461 css_get(&eviction_memcg->css);
462 rcu_read_unlock();
463
464 /* Flush stats (and potentially sleep) outside the RCU read section */
465 mem_cgroup_flush_stats_ratelimited();
466
467 eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat);
468 refault = atomic_long_read(&eviction_lruvec->nonresident_age);
469
470 /*
471 * Calculate the refault distance
472 *
473 * The unsigned subtraction here gives an accurate distance
474 * across nonresident_age overflows in most cases. There is a
475 * special case: usually, shadow entries have a short lifetime
476 * and are either refaulted or reclaimed along with the inode
477 * before they get too old. But it is not impossible for the
478 * nonresident_age to lap a shadow entry in the field, which
479 * can then result in a false small refault distance, leading
480 * to a false activation should this old entry actually
481 * refault again. However, earlier kernels used to deactivate
482 * unconditionally with *every* reclaim invocation for the
483 * longest time, so the occasional inappropriate activation
484 * leading to pressure on the active list is not a problem.
485 */
486 refault_distance = (refault - eviction) & EVICTION_MASK;
487
488 /*
489 * Compare the distance to the existing workingset size. We
490 * don't activate pages that couldn't stay resident even if
491 * all the memory was available to the workingset. Whether
492 * workingset competition needs to consider anon or not depends
493 * on having free swap space.
494 */
495 workingset_size = lruvec_page_state(eviction_lruvec, NR_ACTIVE_FILE);
496 if (!file) {
497 workingset_size += lruvec_page_state(eviction_lruvec,
498 NR_INACTIVE_FILE);
499 }
500 if (mem_cgroup_get_nr_swap_pages(eviction_memcg) > 0) {
501 workingset_size += lruvec_page_state(eviction_lruvec,
502 NR_ACTIVE_ANON);
503 if (file) {
504 workingset_size += lruvec_page_state(eviction_lruvec,
505 NR_INACTIVE_ANON);
506 }
507 }
508
509 mem_cgroup_put(eviction_memcg);
510 return refault_distance <= workingset_size;
511 }
512

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

2023-09-22 02:15:41

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 4/5] mm: workingset: move the stats flush into workingset_test_recent()

Hi Yosry,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]

url: https://github.com/intel-lab-lkp/linux/commits/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20230921-161246
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20230921081057.3440885-5-yosryahmed%40google.com
patch subject: [PATCH 4/5] mm: workingset: move the stats flush into workingset_test_recent()
config: um-i386_defconfig (https://download.01.org/0day-ci/archive/20230922/[email protected]/config)
compiler: gcc-7 (Ubuntu 7.5.0-6ubuntu2) 7.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20230922/[email protected]/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/

All errors (new ones prefixed by >>):

mm/workingset.c: In function 'workingset_test_recent':
>> mm/workingset.c:461:25: error: dereferencing pointer to incomplete type 'struct mem_cgroup'
css_get(&eviction_memcg->css);
^~


vim +461 mm/workingset.c

405
406 /**
407 * workingset_test_recent - tests if the shadow entry is for a folio that was
408 * recently evicted. Also fills in @workingset with the value unpacked from
409 * shadow.
410 * @shadow: the shadow entry to be tested.
411 * @file: whether the corresponding folio is from the file lru.
412 * @workingset: where the workingset value unpacked from shadow should
413 * be stored.
414 *
415 * Return: true if the shadow is for a recently evicted folio; false otherwise.
416 */
417 bool workingset_test_recent(void *shadow, bool file, bool *workingset)
418 {
419 struct mem_cgroup *eviction_memcg;
420 struct lruvec *eviction_lruvec;
421 unsigned long refault_distance;
422 unsigned long workingset_size;
423 unsigned long refault;
424 int memcgid;
425 struct pglist_data *pgdat;
426 unsigned long eviction;
427
428 rcu_read_lock();
429
430 if (lru_gen_enabled()) {
431 bool recent = lru_gen_test_recent(shadow, file,
432 &eviction_lruvec, &eviction,
433 workingset);
434 rcu_read_unlock();
435 return recent;
436 }
437
438 unpack_shadow(shadow, &memcgid, &pgdat, &eviction, workingset);
439 eviction <<= bucket_order;
440
441 /*
442 * Look up the memcg associated with the stored ID. It might
443 * have been deleted since the folio's eviction.
444 *
445 * Note that in rare events the ID could have been recycled
446 * for a new cgroup that refaults a shared folio. This is
447 * impossible to tell from the available data. However, this
448 * should be a rare and limited disturbance, and activations
449 * are always speculative anyway. Ultimately, it's the aging
450 * algorithm's job to shake out the minimum access frequency
451 * for the active cache.
452 *
453 * XXX: On !CONFIG_MEMCG, this will always return NULL; it
454 * would be better if the root_mem_cgroup existed in all
455 * configurations instead.
456 */
457 eviction_memcg = mem_cgroup_from_id(memcgid);
458 if (!mem_cgroup_disabled() && !eviction_memcg)
459 return false;
460
> 461 css_get(&eviction_memcg->css);
462 rcu_read_unlock();
463
464 /* Flush stats (and potentially sleep) outside the RCU read section */
465 mem_cgroup_flush_stats_ratelimited();
466
467 eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat);
468 refault = atomic_long_read(&eviction_lruvec->nonresident_age);
469
470 /*
471 * Calculate the refault distance
472 *
473 * The unsigned subtraction here gives an accurate distance
474 * across nonresident_age overflows in most cases. There is a
475 * special case: usually, shadow entries have a short lifetime
476 * and are either refaulted or reclaimed along with the inode
477 * before they get too old. But it is not impossible for the
478 * nonresident_age to lap a shadow entry in the field, which
479 * can then result in a false small refault distance, leading
480 * to a false activation should this old entry actually
481 * refault again. However, earlier kernels used to deactivate
482 * unconditionally with *every* reclaim invocation for the
483 * longest time, so the occasional inappropriate activation
484 * leading to pressure on the active list is not a problem.
485 */
486 refault_distance = (refault - eviction) & EVICTION_MASK;
487
488 /*
489 * Compare the distance to the existing workingset size. We
490 * don't activate pages that couldn't stay resident even if
491 * all the memory was available to the workingset. Whether
492 * workingset competition needs to consider anon or not depends
493 * on having free swap space.
494 */
495 workingset_size = lruvec_page_state(eviction_lruvec, NR_ACTIVE_FILE);
496 if (!file) {
497 workingset_size += lruvec_page_state(eviction_lruvec,
498 NR_INACTIVE_FILE);
499 }
500 if (mem_cgroup_get_nr_swap_pages(eviction_memcg) > 0) {
501 workingset_size += lruvec_page_state(eviction_lruvec,
502 NR_ACTIVE_ANON);
503 if (file) {
504 workingset_size += lruvec_page_state(eviction_lruvec,
505 NR_INACTIVE_ANON);
506 }
507 }
508
509 mem_cgroup_put(eviction_memcg);
510 return refault_distance <= workingset_size;
511 }
512

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

2023-09-22 09:52:47

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [PATCH 4/5] mm: workingset: move the stats flush into workingset_test_recent()

On Thu, Sep 21, 2023 at 4:06 AM kernel test robot <[email protected]> wrote:
>
> Hi Yosry,
>
> kernel test robot noticed the following build errors:
>
> [auto build test ERROR on akpm-mm/mm-everything]
>
> url: https://github.com/intel-lab-lkp/linux/commits/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20230921-161246
> base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
> patch link: https://lore.kernel.org/r/20230921081057.3440885-5-yosryahmed%40google.com
> patch subject: [PATCH 4/5] mm: workingset: move the stats flush into workingset_test_recent()
> config: powerpc-allnoconfig (https://download.01.org/0day-ci/archive/20230921/[email protected]/config)
> compiler: powerpc-linux-gcc (GCC) 13.2.0
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20230921/[email protected]/reproduce)
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <[email protected]>
> | Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
>
> All errors (new ones prefixed by >>):
>
> mm/workingset.c: In function 'workingset_test_recent':
> >> mm/workingset.c:461:32: error: invalid use of undefined type 'struct mem_cgroup'
> 461 | css_get(&eviction_memcg->css);
> | ^~
>

Ah yes, I cannot dereference the memcg pointer here. I think we want
mem_cgroup_get_from_id (a getter version of mem_cgroup_from_id) or
mem_cgroup_get (similar to mem_cgroup_put) to have stubs when
!CONFIG_MEMCG. I will do this change in the next version, but I'll
wait for feedback on this version first.

>
> vim +461 mm/workingset.c
>
> 405
> 406 /**
> 407 * workingset_test_recent - tests if the shadow entry is for a folio that was
> 408 * recently evicted. Also fills in @workingset with the value unpacked from
> 409 * shadow.
> 410 * @shadow: the shadow entry to be tested.
> 411 * @file: whether the corresponding folio is from the file lru.
> 412 * @workingset: where the workingset value unpacked from shadow should
> 413 * be stored.
> 414 *
> 415 * Return: true if the shadow is for a recently evicted folio; false otherwise.
> 416 */
> 417 bool workingset_test_recent(void *shadow, bool file, bool *workingset)
> 418 {
> 419 struct mem_cgroup *eviction_memcg;
> 420 struct lruvec *eviction_lruvec;
> 421 unsigned long refault_distance;
> 422 unsigned long workingset_size;
> 423 unsigned long refault;
> 424 int memcgid;
> 425 struct pglist_data *pgdat;
> 426 unsigned long eviction;
> 427
> 428 rcu_read_lock();
> 429
> 430 if (lru_gen_enabled()) {
> 431 bool recent = lru_gen_test_recent(shadow, file,
> 432 &eviction_lruvec, &eviction,
> 433 workingset);
> 434 rcu_read_unlock();
> 435 return recent;
> 436 }
> 437
> 438 unpack_shadow(shadow, &memcgid, &pgdat, &eviction, workingset);
> 439 eviction <<= bucket_order;
> 440
> 441 /*
> 442 * Look up the memcg associated with the stored ID. It might
> 443 * have been deleted since the folio's eviction.
> 444 *
> 445 * Note that in rare events the ID could have been recycled
> 446 * for a new cgroup that refaults a shared folio. This is
> 447 * impossible to tell from the available data. However, this
> 448 * should be a rare and limited disturbance, and activations
> 449 * are always speculative anyway. Ultimately, it's the aging
> 450 * algorithm's job to shake out the minimum access frequency
> 451 * for the active cache.
> 452 *
> 453 * XXX: On !CONFIG_MEMCG, this will always return NULL; it
> 454 * would be better if the root_mem_cgroup existed in all
> 455 * configurations instead.
> 456 */
> 457 eviction_memcg = mem_cgroup_from_id(memcgid);
> 458 if (!mem_cgroup_disabled() && !eviction_memcg)
> 459 return false;
> 460
> > 461 css_get(&eviction_memcg->css);
> 462 rcu_read_unlock();
> 463
> 464 /* Flush stats (and potentially sleep) outside the RCU read section */
> 465 mem_cgroup_flush_stats_ratelimited();
> 466
> 467 eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat);
> 468 refault = atomic_long_read(&eviction_lruvec->nonresident_age);
> 469
> 470 /*
> 471 * Calculate the refault distance
> 472 *
> 473 * The unsigned subtraction here gives an accurate distance
> 474 * across nonresident_age overflows in most cases. There is a
> 475 * special case: usually, shadow entries have a short lifetime
> 476 * and are either refaulted or reclaimed along with the inode
> 477 * before they get too old. But it is not impossible for the
> 478 * nonresident_age to lap a shadow entry in the field, which
> 479 * can then result in a false small refault distance, leading
> 480 * to a false activation should this old entry actually
> 481 * refault again. However, earlier kernels used to deactivate
> 482 * unconditionally with *every* reclaim invocation for the
> 483 * longest time, so the occasional inappropriate activation
> 484 * leading to pressure on the active list is not a problem.
> 485 */
> 486 refault_distance = (refault - eviction) & EVICTION_MASK;
> 487
> 488 /*
> 489 * Compare the distance to the existing workingset size. We
> 490 * don't activate pages that couldn't stay resident even if
> 491 * all the memory was available to the workingset. Whether
> 492 * workingset competition needs to consider anon or not depends
> 493 * on having free swap space.
> 494 */
> 495 workingset_size = lruvec_page_state(eviction_lruvec, NR_ACTIVE_FILE);
> 496 if (!file) {
> 497 workingset_size += lruvec_page_state(eviction_lruvec,
> 498 NR_INACTIVE_FILE);
> 499 }
> 500 if (mem_cgroup_get_nr_swap_pages(eviction_memcg) > 0) {
> 501 workingset_size += lruvec_page_state(eviction_lruvec,
> 502 NR_ACTIVE_ANON);
> 503 if (file) {
> 504 workingset_size += lruvec_page_state(eviction_lruvec,
> 505 NR_INACTIVE_ANON);
> 506 }
> 507 }
> 508
> 509 mem_cgroup_put(eviction_memcg);
> 510 return refault_distance <= workingset_size;
> 511 }
> 512
>
> --
> 0-DAY CI Kernel Test Service
> https://github.com/intel/lkp-tests/wiki