On Fri, Mar 31, 2023 at 12:30:11AM -0700, Yosry Ahmed wrote:
> On Fri, Mar 31, 2023 at 12:25 AM Yu Zhao <[email protected]> wrote:
> > On Fri, Mar 31, 2023 at 1:08 AM Yosry Ahmed <[email protected]> wrote:
> >
> > ...
> >
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index a3e38851b34ac..bf9d8e175e92a 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -533,7 +533,35 @@ EXPORT_SYMBOL(mm_account_reclaimed_pages);
> > > static void flush_reclaim_state(struct scan_control *sc,
> > > struct reclaim_state *rs)
> > > {
> > > - if (rs) {
> > > + /*
> > > + * Currently, reclaim_state->reclaimed includes three types of pages
> > > + * freed outside of vmscan:
> > > + * (1) Slab pages.
> > > + * (2) Clean file pages from pruned inodes.
> > > + * (3) XFS freed buffer pages.
> > > + *
> > > + * For all of these cases, we have no way of finding out whether these
> > > + * pages were related to the memcg under reclaim. For example, a freed
> > > + * slab page could have had only a single object charged to the memcg
> > > + * under reclaim. Also, populated inodes are not on shrinker LRUs
> > > + * anymore except on highmem systems.
> > > + *
> > > + * Instead of over-reporting the reclaimed pages in a memcg reclaim,
> > > + * only count such pages in system-wide reclaim. This prevents
> > > + * unnecessary retries during memcg charging and false positive from
> > > + * proactive reclaim (memory.reclaim).
> >
> > What happens when writing to the root memory.reclaim?
> >
> > > + *
> > > + * For uncommon cases were the freed pages were actually significantly
> > > + * charged to the memcg under reclaim, and we end up under-reporting, it
> > > + * should be fine. The freed pages will be uncharged anyway, even if
> > > + * they are not reported properly, and we will be able to make forward
> > > + * progress in charging (which is usually in a retry loop).
> > > + *
> > > + * We can go one step further, and report the uncharged objcg pages in
> > > + * memcg reclaim, to make reporting more accurate and reduce
> > > + * under-reporting, but it's probably not worth the complexity for now.
> > > + */
> > > + if (rs && !cgroup_reclaim(sc)) {
> >
> > To answer the question above, global_reclaim() would be preferred.
>
> Great point, global_reclaim() is fairly recent. I didn't see it
> before. Thanks for pointing it out. I will change it for v4 -- will
> wait for more feedback before respinning.
I didn't realize it came back, I deleted it a while ago:
commit b5ead35e7e1d3434ce436dfcb2af32820ce54589
Author: Johannes Weiner <[email protected]>
Date: Sat Nov 30 17:55:40 2019 -0800
mm: vmscan: naming fixes: global_reclaim() and sane_reclaim()
Seven years after introducing the global_reclaim() function, I still have
to double take when reading a callsite. I don't know how others do it,
this is a terrible name.
Invert the meaning and rename it to cgroup_reclaim().
Could you shed some light on why it was brought back? It's not clear
to me from the changelog in a579086c99ed70cc4bfc104348dbe3dd8f2787e6.
We also now have this:
static bool cgroup_reclaim(struct scan_control *sc)
{
return sc->target_mem_cgroup;
}
static bool global_reclaim(struct scan_control *sc)
{
return !sc->target_mem_cgroup || mem_cgroup_is_root(sc->target_mem_cgroup);
}
The name suggests it's the same thing twice, with opposite
polarity. But of course they're subtly different, and not documented.
When do you use which?
On Wed, Apr 5, 2023 at 2:01 PM Johannes Weiner <[email protected]> wrote:
>
> On Fri, Mar 31, 2023 at 12:30:11AM -0700, Yosry Ahmed wrote:
> > On Fri, Mar 31, 2023 at 12:25 AM Yu Zhao <[email protected]> wrote:
> > > On Fri, Mar 31, 2023 at 1:08 AM Yosry Ahmed <[email protected]> wrote:
> > >
> > > ...
> > >
> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > index a3e38851b34ac..bf9d8e175e92a 100644
> > > > --- a/mm/vmscan.c
> > > > +++ b/mm/vmscan.c
> > > > @@ -533,7 +533,35 @@ EXPORT_SYMBOL(mm_account_reclaimed_pages);
> > > > static void flush_reclaim_state(struct scan_control *sc,
> > > > struct reclaim_state *rs)
> > > > {
> > > > - if (rs) {
> > > > + /*
> > > > + * Currently, reclaim_state->reclaimed includes three types of pages
> > > > + * freed outside of vmscan:
> > > > + * (1) Slab pages.
> > > > + * (2) Clean file pages from pruned inodes.
> > > > + * (3) XFS freed buffer pages.
> > > > + *
> > > > + * For all of these cases, we have no way of finding out whether these
> > > > + * pages were related to the memcg under reclaim. For example, a freed
> > > > + * slab page could have had only a single object charged to the memcg
> > > > + * under reclaim. Also, populated inodes are not on shrinker LRUs
> > > > + * anymore except on highmem systems.
> > > > + *
> > > > + * Instead of over-reporting the reclaimed pages in a memcg reclaim,
> > > > + * only count such pages in system-wide reclaim. This prevents
> > > > + * unnecessary retries during memcg charging and false positive from
> > > > + * proactive reclaim (memory.reclaim).
> > >
> > > What happens when writing to the root memory.reclaim?
> > >
> > > > + *
> > > > + * For uncommon cases were the freed pages were actually significantly
> > > > + * charged to the memcg under reclaim, and we end up under-reporting, it
> > > > + * should be fine. The freed pages will be uncharged anyway, even if
> > > > + * they are not reported properly, and we will be able to make forward
> > > > + * progress in charging (which is usually in a retry loop).
> > > > + *
> > > > + * We can go one step further, and report the uncharged objcg pages in
> > > > + * memcg reclaim, to make reporting more accurate and reduce
> > > > + * under-reporting, but it's probably not worth the complexity for now.
> > > > + */
> > > > + if (rs && !cgroup_reclaim(sc)) {
> > >
> > > To answer the question above, global_reclaim() would be preferred.
> >
> > Great point, global_reclaim() is fairly recent. I didn't see it
> > before. Thanks for pointing it out. I will change it for v4 -- will
> > wait for more feedback before respinning.
>
> I didn't realize it came back, I deleted it a while ago:
>
> commit b5ead35e7e1d3434ce436dfcb2af32820ce54589
> Author: Johannes Weiner <[email protected]>
> Date: Sat Nov 30 17:55:40 2019 -0800
>
> mm: vmscan: naming fixes: global_reclaim() and sane_reclaim()
>
> Seven years after introducing the global_reclaim() function, I still have
> to double take when reading a callsite. I don't know how others do it,
> this is a terrible name.
>
> Invert the meaning and rename it to cgroup_reclaim().
>
> Could you shed some light on why it was brought back? It's not clear
> to me from the changelog in a579086c99ed70cc4bfc104348dbe3dd8f2787e6.
Sorry about the confusion. I was asking Yosry to post an RFC to clear that up.
> We also now have this:
>
> static bool cgroup_reclaim(struct scan_control *sc)
> {
> return sc->target_mem_cgroup;
> }
>
> static bool global_reclaim(struct scan_control *sc)
> {
> return !sc->target_mem_cgroup || mem_cgroup_is_root(sc->target_mem_cgroup);
> }
>
> The name suggests it's the same thing twice, with opposite
> polarity. But of course they're subtly different, and not documented.
>
> When do you use which?
The problem I saw is that target_mem_cgroup is set when writing to the
root memory.reclaim. And for this case, a few places might prefer
global_reclaim(), e.g., in shrink_lruvec(), in addition to where it's
being used.
If this makes sense, we could 1) document it (or rename it) and apply
it to those places, or 2) just unset target_mem_cgroup for root and
delete global_reclaim(). Option 2 might break ABI but still be
acceptable.
If we don't want to decide right now, I can rename global_reclaim() to
root_reclaim() and move it within #ifdef CONFIG_LRU_GEN and probably
come back and revisit later.
On Wed, Apr 05, 2023 at 03:09:27PM -0600, Yu Zhao wrote:
> On Wed, Apr 5, 2023 at 2:01 PM Johannes Weiner <[email protected]> wrote:
> > static bool cgroup_reclaim(struct scan_control *sc)
> > {
> > return sc->target_mem_cgroup;
> > }
> >
> > static bool global_reclaim(struct scan_control *sc)
> > {
> > return !sc->target_mem_cgroup || mem_cgroup_is_root(sc->target_mem_cgroup);
> > }
> >
> > The name suggests it's the same thing twice, with opposite
> > polarity. But of course they're subtly different, and not documented.
> >
> > When do you use which?
>
> The problem I saw is that target_mem_cgroup is set when writing to the
> root memory.reclaim. And for this case, a few places might prefer
> global_reclaim(), e.g., in shrink_lruvec(), in addition to where it's
> being used.
>
> If this makes sense, we could 1) document it (or rename it) and apply
> it to those places, or 2) just unset target_mem_cgroup for root and
> delete global_reclaim(). Option 2 might break ABI but still be
> acceptable.
Ah, cgroup_reclaim() tests whether it's limit/proactive reclaim or
allocator reclaim. global_reclaim() tests whether it's root reclaim
(which could be from either after memory.reclaim).
I suppose we didn't clarify when introducing memory.reclaim what the
semantics should be on the root cgroup:
- We currently exclude PGSCAN and PGSTEAL stats from /proc/vmstat for
limit reclaim to tell cgroup constraints from physical pressure. We
currently exclude root memory.reclaim activity as well. Seems okay.
- The file_is_tiny heuristic is for allocator reclaim near OOM. It's
currently excluded for root memory.reclaim, which seems okay too.
- Like limit reclaim, root memory.reclaim currently NEVER swaps when
global swappiness is 0. The whole cgroup-specific swappiness=0
semantic is kind of quirky. But I suppose we can keep it as-is.
- Proportional reclaim is disabled for everybody but direct reclaim
from the allocator at initial priority. Effectively disabled for all
situations where it might matter, including root memory.reclaim. We
should also keep this as-is.
- Writeback throttling is interesting. Historically, kswapd set the
congestion state when running into lots of PG_reclaim pages, and
clear it when the node is balanced. This throttles direct reclaim.
Cgroup limit reclaim would set and clear congestion on non-root only
to do local limit-reclaim throttling. But now root memory.reclaim
might clear a bit set by kswapd before the node is balanced, and
release direct reclaimers from throttling prematurely. This seems
wrong. However, the alternative is throttling memory.reclaim on
subgroup congestion but not root, which seems also wrong.
- Root memory.reclaim is exempted from the compaction_ready() bail
condition as well as soft limit reclaim. But they'd both be no-ops
anyway if we changed cgroup_reclaim() semantics.
IMO we should either figure out what we want to do in those cases
above, at least for writeback throttling.
Are you guys using root-level proactive reclaim?
> If we don't want to decide right now, I can rename global_reclaim() to
> root_reclaim() and move it within #ifdef CONFIG_LRU_GEN and probably
> come back and revisit later.
So conventional vmscan treats root-level memory.reclaim the same as
any other cgroup reclaim. And the cgroup_reclaim() checks are mostly
reserved for (questionable) allocator reclaim-specific heuristics.
Is there a chance lrugen could do the same, and you'd be fine with
using cgroup_reclaim()? Or is it a data structure problem?
If so, I agree it could be better to move it to CONFIG_LRU_GEN and
rename it for the time being. root_reclaim() SGTM.
On Thu, Apr 6, 2023 at 1:02 PM Johannes Weiner <[email protected]> wrote:
>
> On Wed, Apr 05, 2023 at 03:09:27PM -0600, Yu Zhao wrote:
> > On Wed, Apr 5, 2023 at 2:01 PM Johannes Weiner <[email protected]> wrote:
> > > static bool cgroup_reclaim(struct scan_control *sc)
> > > {
> > > return sc->target_mem_cgroup;
> > > }
> > >
> > > static bool global_reclaim(struct scan_control *sc)
> > > {
> > > return !sc->target_mem_cgroup || mem_cgroup_is_root(sc->target_mem_cgroup);
> > > }
> > >
> > > The name suggests it's the same thing twice, with opposite
> > > polarity. But of course they're subtly different, and not documented.
> > >
> > > When do you use which?
> >
> > The problem I saw is that target_mem_cgroup is set when writing to the
> > root memory.reclaim. And for this case, a few places might prefer
> > global_reclaim(), e.g., in shrink_lruvec(), in addition to where it's
> > being used.
> >
> > If this makes sense, we could 1) document it (or rename it) and apply
> > it to those places, or 2) just unset target_mem_cgroup for root and
> > delete global_reclaim(). Option 2 might break ABI but still be
> > acceptable.
>
> Ah, cgroup_reclaim() tests whether it's limit/proactive reclaim or
> allocator reclaim. global_reclaim() tests whether it's root reclaim
> (which could be from either after memory.reclaim).
>
> I suppose we didn't clarify when introducing memory.reclaim what the
> semantics should be on the root cgroup:
Thanks for the great summary, Johannes!
>
> - We currently exclude PGSCAN and PGSTEAL stats from /proc/vmstat for
> limit reclaim to tell cgroup constraints from physical pressure. We
> currently exclude root memory.reclaim activity as well. Seems okay.
>
> - The file_is_tiny heuristic is for allocator reclaim near OOM. It's
> currently excluded for root memory.reclaim, which seems okay too.
>
> - Like limit reclaim, root memory.reclaim currently NEVER swaps when
> global swappiness is 0. The whole cgroup-specific swappiness=0
> semantic is kind of quirky. But I suppose we can keep it as-is.
>
> - Proportional reclaim is disabled for everybody but direct reclaim
> from the allocator at initial priority. Effectively disabled for all
> situations where it might matter, including root memory.reclaim. We
> should also keep this as-is.
Agree with the above.
>
> - Writeback throttling is interesting. Historically, kswapd set the
> congestion state when running into lots of PG_reclaim pages, and
> clear it when the node is balanced. This throttles direct reclaim.
>
> Cgroup limit reclaim would set and clear congestion on non-root only
> to do local limit-reclaim throttling. But now root memory.reclaim
> might clear a bit set by kswapd before the node is balanced, and
> release direct reclaimers from throttling prematurely. This seems
> wrong. However, the alternative is throttling memory.reclaim on
> subgroup congestion but not root, which seems also wrong.
Ah yes, that is a problem.
It seems like the behavior of the congested bit is different on the
root's lruvec, it would throttle direct reclaimers until the node is
balanced, not just until reclaim completes. Is my understanding
correct?
If yes, would it be a solution to introduce a dedicated bit for this
behavior, LRUVEC_UNBALANCED or so?
We can set this bit in kswapd only for root, and only clear it when
the node is balanced.
This would be separate from LRUVEC_CONGESTED that always gets cleared
when reclaim completes.
Although it might be confusing that we set both LRUVEC_CONGESTED and
LRUVEC_UNBALANCED for root, perhaps it would be better to have a more
explicit condition to set LRUVEC_UNBALANCED in kswapd logic -- but
this may be a change of behavior.
I understand the special casing might not be pretty, but it seems like
it may already be a special case, so perhaps a separate bit will make
this more clear.
Just thinking out loud here, I am not familiar with this part of
reclaim code so please excuse any stupid statements I might have made.
>
> - Root memory.reclaim is exempted from the compaction_ready() bail
> condition as well as soft limit reclaim. But they'd both be no-ops
> anyway if we changed cgroup_reclaim() semantics.
>
> IMO we should either figure out what we want to do in those cases
> above, at least for writeback throttling.
>
> Are you guys using root-level proactive reclaim?
We do, but not in its current upstream form. We have some special
flags to only iterate the root memcg and zombie memcgs, and we
periodically use proactive reclaim on the root memcg with these flags.
The purpose is to reclaim any unaccounted memory or memory charged to
zombie memcgs if possible -- potentially freeing zombie memcgs as
well.
>
> > If we don't want to decide right now, I can rename global_reclaim() to
> > root_reclaim() and move it within #ifdef CONFIG_LRU_GEN and probably
> > come back and revisit later.
>
> So conventional vmscan treats root-level memory.reclaim the same as
> any other cgroup reclaim. And the cgroup_reclaim() checks are mostly
> reserved for (questionable) allocator reclaim-specific heuristics.
>
> Is there a chance lrugen could do the same, and you'd be fine with
> using cgroup_reclaim()? Or is it a data structure problem?
I will let Yu answer this question, but I will take a stab at it just
for my own education :)
It seems like we use global_reclaim() mainly for two things for MGLRU:
(1) For global_reclaim(), we use the memcg fairness/scalability logic
that was introduced by [1], where for global_reclaim() we don't use
mem_cgroup_iter(), but we use an LRU of memcgs for fairness and
scalability purposes (we don't have to iterate all memcgs every time,
and parallel reclaimers can iterate different memcgs).
(2) For !global_reclaim(), we do not abort early before traversing all
memcgs to be fair to all children.
If we use cgroup_reclaim() instead of global_reclaim(), we move
proactive reclaim on root from (1) to (2) above.
My gut feeling is that it's fine, because proactive reclaim is more
latency tolerant, and other direct reclaimers will be using the LRU of
memcgs anyway, so proactive reclaim should not affect their
fairness/scalability.
[1]https://lkml.kernel.org/r/[email protected]
>
> If so, I agree it could be better to move it to CONFIG_LRU_GEN and
> rename it for the time being. root_reclaim() SGTM.
On Thu, Apr 6, 2023 at 1:02 PM Johannes Weiner <[email protected]> wrote:
>
> On Wed, Apr 05, 2023 at 03:09:27PM -0600, Yu Zhao wrote:
> > On Wed, Apr 5, 2023 at 2:01 PM Johannes Weiner <[email protected]> wrote:
> > > static bool cgroup_reclaim(struct scan_control *sc)
> > > {
> > > return sc->target_mem_cgroup;
> > > }
> > >
> > > static bool global_reclaim(struct scan_control *sc)
> > > {
> > > return !sc->target_mem_cgroup || mem_cgroup_is_root(sc->target_mem_cgroup);
> > > }
> > >
> > > The name suggests it's the same thing twice, with opposite
> > > polarity. But of course they're subtly different, and not documented.
> > >
> > > When do you use which?
> >
> > The problem I saw is that target_mem_cgroup is set when writing to the
> > root memory.reclaim. And for this case, a few places might prefer
> > global_reclaim(), e.g., in shrink_lruvec(), in addition to where it's
> > being used.
> >
> > If this makes sense, we could 1) document it (or rename it) and apply
> > it to those places, or 2) just unset target_mem_cgroup for root and
> > delete global_reclaim(). Option 2 might break ABI but still be
> > acceptable.
>
> Ah, cgroup_reclaim() tests whether it's limit/proactive reclaim or
> allocator reclaim. global_reclaim() tests whether it's root reclaim
> (which could be from either after memory.reclaim).
>
> I suppose we didn't clarify when introducing memory.reclaim what the
> semantics should be on the root cgroup:
>
> - We currently exclude PGSCAN and PGSTEAL stats from /proc/vmstat for
> limit reclaim to tell cgroup constraints from physical pressure. We
> currently exclude root memory.reclaim activity as well. Seems okay.
>
> - The file_is_tiny heuristic is for allocator reclaim near OOM. It's
> currently excluded for root memory.reclaim, which seems okay too.
>
> - Like limit reclaim, root memory.reclaim currently NEVER swaps when
> global swappiness is 0. The whole cgroup-specific swappiness=0
> semantic is kind of quirky. But I suppose we can keep it as-is.
>
> - Proportional reclaim is disabled for everybody but direct reclaim
> from the allocator at initial priority. Effectively disabled for all
> situations where it might matter, including root memory.reclaim. We
> should also keep this as-is.
>
> - Writeback throttling is interesting. Historically, kswapd set the
> congestion state when running into lots of PG_reclaim pages, and
> clear it when the node is balanced. This throttles direct reclaim.
>
> Cgroup limit reclaim would set and clear congestion on non-root only
> to do local limit-reclaim throttling. But now root memory.reclaim
> might clear a bit set by kswapd before the node is balanced, and
> release direct reclaimers from throttling prematurely. This seems
> wrong. However, the alternative is throttling memory.reclaim on
> subgroup congestion but not root, which seems also wrong.
>
> - Root memory.reclaim is exempted from the compaction_ready() bail
> condition as well as soft limit reclaim. But they'd both be no-ops
> anyway if we changed cgroup_reclaim() semantics.
>
> IMO we should either figure out what we want to do in those cases
> above, at least for writeback throttling.
>
> Are you guys using root-level proactive reclaim?
>
> > If we don't want to decide right now, I can rename global_reclaim() to
> > root_reclaim() and move it within #ifdef CONFIG_LRU_GEN and probably
> > come back and revisit later.
>
> So conventional vmscan treats root-level memory.reclaim the same as
> any other cgroup reclaim. And the cgroup_reclaim() checks are mostly
> reserved for (questionable) allocator reclaim-specific heuristics.
>
> Is there a chance lrugen could do the same, and you'd be fine with
> using cgroup_reclaim()? Or is it a data structure problem?
>
> If so, I agree it could be better to move it to CONFIG_LRU_GEN and
> rename it for the time being. root_reclaim() SGTM.
Oh and if we decide to keep the helper as root_reclaim I would prefer
it to be outside CONFIG_LRU_GEN so that I can still use it in the
patch series that this thread was originally a part of (ignoring
non-LRU freed pages in memcg reclaim).
On Fri, Apr 7, 2023 at 12:03 AM Yosry Ahmed <[email protected]> wrote:
>
> On Thu, Apr 6, 2023 at 1:02 PM Johannes Weiner <[email protected]> wrote:
> >
> > On Wed, Apr 05, 2023 at 03:09:27PM -0600, Yu Zhao wrote:
> > > On Wed, Apr 5, 2023 at 2:01 PM Johannes Weiner <[email protected]> wrote:
> > > > static bool cgroup_reclaim(struct scan_control *sc)
> > > > {
> > > > return sc->target_mem_cgroup;
> > > > }
> > > >
> > > > static bool global_reclaim(struct scan_control *sc)
> > > > {
> > > > return !sc->target_mem_cgroup || mem_cgroup_is_root(sc->target_mem_cgroup);
> > > > }
> > > >
> > > > The name suggests it's the same thing twice, with opposite
> > > > polarity. But of course they're subtly different, and not documented.
> > > >
> > > > When do you use which?
> > >
> > > The problem I saw is that target_mem_cgroup is set when writing to the
> > > root memory.reclaim. And for this case, a few places might prefer
> > > global_reclaim(), e.g., in shrink_lruvec(), in addition to where it's
> > > being used.
> > >
> > > If this makes sense, we could 1) document it (or rename it) and apply
> > > it to those places, or 2) just unset target_mem_cgroup for root and
> > > delete global_reclaim(). Option 2 might break ABI but still be
> > > acceptable.
> >
> > Ah, cgroup_reclaim() tests whether it's limit/proactive reclaim or
> > allocator reclaim. global_reclaim() tests whether it's root reclaim
> > (which could be from either after memory.reclaim).
> >
> > I suppose we didn't clarify when introducing memory.reclaim what the
> > semantics should be on the root cgroup:
> >
> > - We currently exclude PGSCAN and PGSTEAL stats from /proc/vmstat for
> > limit reclaim to tell cgroup constraints from physical pressure. We
> > currently exclude root memory.reclaim activity as well. Seems okay.
> >
> > - The file_is_tiny heuristic is for allocator reclaim near OOM. It's
> > currently excluded for root memory.reclaim, which seems okay too.
> >
> > - Like limit reclaim, root memory.reclaim currently NEVER swaps when
> > global swappiness is 0. The whole cgroup-specific swappiness=0
> > semantic is kind of quirky. But I suppose we can keep it as-is.
> >
> > - Proportional reclaim is disabled for everybody but direct reclaim
> > from the allocator at initial priority. Effectively disabled for all
> > situations where it might matter, including root memory.reclaim. We
> > should also keep this as-is.
> >
> > - Writeback throttling is interesting. Historically, kswapd set the
> > congestion state when running into lots of PG_reclaim pages, and
> > clear it when the node is balanced. This throttles direct reclaim.
> >
> > Cgroup limit reclaim would set and clear congestion on non-root only
> > to do local limit-reclaim throttling. But now root memory.reclaim
> > might clear a bit set by kswapd before the node is balanced, and
> > release direct reclaimers from throttling prematurely. This seems
> > wrong. However, the alternative is throttling memory.reclaim on
> > subgroup congestion but not root, which seems also wrong.
> >
> > - Root memory.reclaim is exempted from the compaction_ready() bail
> > condition as well as soft limit reclaim. But they'd both be no-ops
> > anyway if we changed cgroup_reclaim() semantics.
> >
> > IMO we should either figure out what we want to do in those cases
> > above, at least for writeback throttling.
> >
> > Are you guys using root-level proactive reclaim?
> >
> > > If we don't want to decide right now, I can rename global_reclaim() to
> > > root_reclaim() and move it within #ifdef CONFIG_LRU_GEN and probably
> > > come back and revisit later.
> >
> > So conventional vmscan treats root-level memory.reclaim the same as
> > any other cgroup reclaim. And the cgroup_reclaim() checks are mostly
> > reserved for (questionable) allocator reclaim-specific heuristics.
> >
> > Is there a chance lrugen could do the same, and you'd be fine with
> > using cgroup_reclaim()? Or is it a data structure problem?
> >
> > If so, I agree it could be better to move it to CONFIG_LRU_GEN and
> > rename it for the time being. root_reclaim() SGTM.
>
> Oh and if we decide to keep the helper as root_reclaim I would prefer
> it to be outside CONFIG_LRU_GEN so that I can still use it in the
> patch series that this thread was originally a part of (ignoring
> non-LRU freed pages in memcg reclaim).
Thanks for the summaries. I think you both covered all the important details.
Just to recap -- it all comes down to how we want to define the
semantics for the root memory.reclaim for *future* use case, since
currently it has no known users (I personally use it for testing
purposes, which is not important at all). So I'm fine with whatever
you two see fit (delete, refactor or rename).
On Thu, Apr 6, 2023 at 7:12 PM Yosry Ahmed <[email protected]> wrote:
>
> On Thu, Apr 6, 2023 at 1:02 PM Johannes Weiner <[email protected]> wrote:
> >
> > On Wed, Apr 05, 2023 at 03:09:27PM -0600, Yu Zhao wrote:
> > > On Wed, Apr 5, 2023 at 2:01 PM Johannes Weiner <[email protected]> wrote:
> > > > static bool cgroup_reclaim(struct scan_control *sc)
> > > > {
> > > > return sc->target_mem_cgroup;
> > > > }
> > > >
> > > > static bool global_reclaim(struct scan_control *sc)
> > > > {
> > > > return !sc->target_mem_cgroup || mem_cgroup_is_root(sc->target_mem_cgroup);
> > > > }
> > > >
> > > > The name suggests it's the same thing twice, with opposite
> > > > polarity. But of course they're subtly different, and not documented.
> > > >
> > > > When do you use which?
> > >
> > > The problem I saw is that target_mem_cgroup is set when writing to the
> > > root memory.reclaim. And for this case, a few places might prefer
> > > global_reclaim(), e.g., in shrink_lruvec(), in addition to where it's
> > > being used.
> > >
> > > If this makes sense, we could 1) document it (or rename it) and apply
> > > it to those places, or 2) just unset target_mem_cgroup for root and
> > > delete global_reclaim(). Option 2 might break ABI but still be
> > > acceptable.
> >
> > Ah, cgroup_reclaim() tests whether it's limit/proactive reclaim or
> > allocator reclaim. global_reclaim() tests whether it's root reclaim
> > (which could be from either after memory.reclaim).
> >
> > I suppose we didn't clarify when introducing memory.reclaim what the
> > semantics should be on the root cgroup:
>
> Thanks for the great summary, Johannes!
>
> >
> > - We currently exclude PGSCAN and PGSTEAL stats from /proc/vmstat for
> > limit reclaim to tell cgroup constraints from physical pressure. We
> > currently exclude root memory.reclaim activity as well. Seems okay.
> >
> > - The file_is_tiny heuristic is for allocator reclaim near OOM. It's
> > currently excluded for root memory.reclaim, which seems okay too.
> >
> > - Like limit reclaim, root memory.reclaim currently NEVER swaps when
> > global swappiness is 0. The whole cgroup-specific swappiness=0
> > semantic is kind of quirky. But I suppose we can keep it as-is.
> >
> > - Proportional reclaim is disabled for everybody but direct reclaim
> > from the allocator at initial priority. Effectively disabled for all
> > situations where it might matter, including root memory.reclaim. We
> > should also keep this as-is.
>
> Agree with the above.
>
> >
> > - Writeback throttling is interesting. Historically, kswapd set the
> > congestion state when running into lots of PG_reclaim pages, and
> > clear it when the node is balanced. This throttles direct reclaim.
> >
> > Cgroup limit reclaim would set and clear congestion on non-root only
> > to do local limit-reclaim throttling. But now root memory.reclaim
> > might clear a bit set by kswapd before the node is balanced, and
> > release direct reclaimers from throttling prematurely. This seems
> > wrong. However, the alternative is throttling memory.reclaim on
> > subgroup congestion but not root, which seems also wrong.
>
> Ah yes, that is a problem.
>
> It seems like the behavior of the congested bit is different on the
> root's lruvec, it would throttle direct reclaimers until the node is
> balanced, not just until reclaim completes. Is my understanding
> correct?
>
> If yes, would it be a solution to introduce a dedicated bit for this
> behavior, LRUVEC_UNBALANCED or so?
> We can set this bit in kswapd only for root, and only clear it when
> the node is balanced.
> This would be separate from LRUVEC_CONGESTED that always gets cleared
> when reclaim completes.
>
> Although it might be confusing that we set both LRUVEC_CONGESTED and
> LRUVEC_UNBALANCED for root, perhaps it would be better to have a more
> explicit condition to set LRUVEC_UNBALANCED in kswapd logic -- but
> this may be a change of behavior.
>
> I understand the special casing might not be pretty, but it seems like
> it may already be a special case, so perhaps a separate bit will make
> this more clear.
>
> Just thinking out loud here, I am not familiar with this part of
> reclaim code so please excuse any stupid statements I might have made.
>
> >
> > - Root memory.reclaim is exempted from the compaction_ready() bail
> > condition as well as soft limit reclaim. But they'd both be no-ops
> > anyway if we changed cgroup_reclaim() semantics.
> >
> > IMO we should either figure out what we want to do in those cases
> > above, at least for writeback throttling.
> >
> > Are you guys using root-level proactive reclaim?
>
> We do, but not in its current upstream form. We have some special
> flags to only iterate the root memcg and zombie memcgs, and we
> periodically use proactive reclaim on the root memcg with these flags.
> The purpose is to reclaim any unaccounted memory or memory charged to
> zombie memcgs if possible -- potentially freeing zombie memcgs as
> well.
>
> >
> > > If we don't want to decide right now, I can rename global_reclaim() to
> > > root_reclaim() and move it within #ifdef CONFIG_LRU_GEN and probably
> > > come back and revisit later.
> >
> > So conventional vmscan treats root-level memory.reclaim the same as
> > any other cgroup reclaim. And the cgroup_reclaim() checks are mostly
> > reserved for (questionable) allocator reclaim-specific heuristics.
> >
> > Is there a chance lrugen could do the same, and you'd be fine with
> > using cgroup_reclaim()? Or is it a data structure problem?
>
> I will let Yu answer this question, but I will take a stab at it just
> for my own education :)
>
> It seems like we use global_reclaim() mainly for two things for MGLRU:
> (1) For global_reclaim(), we use the memcg fairness/scalability logic
> that was introduced by [1], where for global_reclaim() we don't use
> mem_cgroup_iter(), but we use an LRU of memcgs for fairness and
> scalability purposes (we don't have to iterate all memcgs every time,
> and parallel reclaimers can iterate different memcgs).
>
> (2) For !global_reclaim(), we do not abort early before traversing all
> memcgs to be fair to all children.
>
> If we use cgroup_reclaim() instead of global_reclaim(), we move
> proactive reclaim on root from (1) to (2) above.
> My gut feeling is that it's fine, because proactive reclaim is more
> latency tolerant, and other direct reclaimers will be using the LRU of
> memcgs anyway, so proactive reclaim should not affect their
> fairness/scalability.
>
> [1]https://lkml.kernel.org/r/[email protected]
>
> >
> > If so, I agree it could be better to move it to CONFIG_LRU_GEN and
> > rename it for the time being. root_reclaim() SGTM.
Hey Johannes, any thoughts on this?