2022-12-13 15:59:27

by Michal Hocko

[permalink] [raw]
Subject: memcg reclaim demotion wrt. isolation

Hi,
I have just noticed that that pages allocated for demotion targets
includes __GFP_KSWAPD_RECLAIM (through GFP_NOWAIT). This is the case
since the code has been introduced by 26aa2d199d6f ("mm/migrate: demote
pages during reclaim"). I suspect the intention is to trigger the aging
on the fallback node and either drop or further demote oldest pages.

This makes sense but I suspect that this wasn't intended also for
memcg triggered reclaim. This would mean that a memory pressure in one
hierarchy could trigger paging out pages of a different hierarchy if the
demotion target is close to full.

I haven't really checked at the current kswapd wake up checks but I
suspect that kswapd would back off in most cases so this shouldn't
really cause any big problems. But I guess it would be better to simply
not wake kswapd up for the memcg reclaim. What do you think?
---
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8fcc5fa768c0..1f3161173b85 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1568,7 +1568,7 @@ static struct page *alloc_demote_page(struct page *page, unsigned long private)
* Folios which are not demoted are left on @demote_folios.
*/
static unsigned int demote_folio_list(struct list_head *demote_folios,
- struct pglist_data *pgdat)
+ struct pglist_data *pgdat, bool cgroup_reclaim)
{
int target_nid = next_demotion_node(pgdat->node_id);
unsigned int nr_succeeded;
@@ -1589,6 +1589,10 @@ static unsigned int demote_folio_list(struct list_head *demote_folios,
if (list_empty(demote_folios))
return 0;

+ /* local memcg reclaim shouldn't directly reclaim from other memcgs */
+ if (cgroup_reclaim)
+ mtc->gfp_mask &= ~__GFP_RECLAIM;
+
if (target_nid == NUMA_NO_NODE)
return 0;

@@ -2066,7 +2070,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
/* 'folio_list' is always empty here */

/* Migrate folios selected for demotion */
- nr_reclaimed += demote_folio_list(&demote_folios, pgdat);
+ nr_reclaimed += demote_folio_list(&demote_folios, pgdat, cgroup_reclaim(sc));
/* Folios that could not be demoted are still in @demote_folios */
if (!list_empty(&demote_folios)) {
/* Folios which weren't demoted go back on @folio_list for retry: */
--
Michal Hocko
SUSE Labs


2022-12-13 16:44:47

by Johannes Weiner

[permalink] [raw]
Subject: Re: memcg reclaim demotion wrt. isolation

On Tue, Dec 13, 2022 at 04:41:10PM +0100, Michal Hocko wrote:
> Hi,
> I have just noticed that that pages allocated for demotion targets
> includes __GFP_KSWAPD_RECLAIM (through GFP_NOWAIT). This is the case
> since the code has been introduced by 26aa2d199d6f ("mm/migrate: demote
> pages during reclaim"). I suspect the intention is to trigger the aging
> on the fallback node and either drop or further demote oldest pages.
>
> This makes sense but I suspect that this wasn't intended also for
> memcg triggered reclaim. This would mean that a memory pressure in one
> hierarchy could trigger paging out pages of a different hierarchy if the
> demotion target is close to full.

This is also true if you don't do demotion. If a cgroup tries to
allocate memory on a full node (i.e. mbind()), it may wake kswapd or
enter global reclaim directly which may push out the memory of other
cgroups, regardless of the respective cgroup limits.

The demotion allocations don't strike me as any different. They're
just allocations on behalf of a cgroup. I would expect them to wake
kswapd and reclaim physical memory as needed.

2022-12-13 22:44:42

by Dave Hansen

[permalink] [raw]
Subject: Re: memcg reclaim demotion wrt. isolation

On 12/13/22 07:41, Michal Hocko wrote:
> This makes sense but I suspect that this wasn't intended also for
> memcg triggered reclaim. This would mean that a memory pressure in one
> hierarchy could trigger paging out pages of a different hierarchy if the
> demotion target is close to full.
>
> I haven't really checked at the current kswapd wake up checks but I
> suspect that kswapd would back off in most cases so this shouldn't
> really cause any big problems. But I guess it would be better to simply
> not wake kswapd up for the memcg reclaim. What do you think?

You're right that this wasn't really considering memcg-based reclaim.
The entire original idea was that demotion allocations should fail fast,
but it would be nice if they could kick kswapd so they would
*eventually* succeed and just just fail fast forever.

Before we go trying to patch anything, I'd be really interested what it
does in practice. How much does it actually wake up kswapd? Does
kswapd cause any collateral damage?

I don't have any fundamental objections to the patch, though.

2022-12-14 03:09:34

by Huang, Ying

[permalink] [raw]
Subject: Re: memcg reclaim demotion wrt. isolation

Michal Hocko <[email protected]> writes:

> Hi,
> I have just noticed that that pages allocated for demotion targets
> includes __GFP_KSWAPD_RECLAIM (through GFP_NOWAIT). This is the case
> since the code has been introduced by 26aa2d199d6f ("mm/migrate: demote
> pages during reclaim").

IIUC, the issue was introduced by commit 3f1509c57b1b ("Revert
"mm/vmscan: never demote for memcg reclaim""). Before that, we will not
demote for memcg reclaim.

> I suspect the intention is to trigger the aging on the fallback node
> and either drop or further demote oldest pages.
>
> This makes sense but I suspect that this wasn't intended also for
> memcg triggered reclaim. This would mean that a memory pressure in one
> hierarchy could trigger paging out pages of a different hierarchy if the
> demotion target is close to full.

It seems that it's unnecessary to wake up kswapd of demotion target node
in most cases. Because we will try to reclaim on the demotion target
nodes in the loop of do_try_to_free_pages(). It may be better to loop
the zonelist in the reverse order. Because the demotion targets are
usually located at the latter of the zonelist.

Best Regards,
Huang, Ying

2022-12-14 09:51:54

by Michal Hocko

[permalink] [raw]
Subject: Re: memcg reclaim demotion wrt. isolation

On Tue 13-12-22 14:26:42, Dave Hansen wrote:
> On 12/13/22 07:41, Michal Hocko wrote:
> > This makes sense but I suspect that this wasn't intended also for
> > memcg triggered reclaim. This would mean that a memory pressure in one
> > hierarchy could trigger paging out pages of a different hierarchy if the
> > demotion target is close to full.
> >
> > I haven't really checked at the current kswapd wake up checks but I
> > suspect that kswapd would back off in most cases so this shouldn't
> > really cause any big problems. But I guess it would be better to simply
> > not wake kswapd up for the memcg reclaim. What do you think?
>
> You're right that this wasn't really considering memcg-based reclaim.
> The entire original idea was that demotion allocations should fail fast,
> but it would be nice if they could kick kswapd so they would
> *eventually* succeed and just just fail fast forever.
>
> Before we go trying to patch anything, I'd be really interested what it
> does in practice. How much does it actually wake up kswapd? Does
> kswapd cause any collateral damage?

I haven't seen any real problem so far. I was just trying to wrap my
head around consenquences of discussed memory.demote memcg interface
[1]. See my reply to Johannes about specific concerns.

[1] http://lkml.kernel.org/r/[email protected]
--
Michal Hocko
SUSE Labs

2022-12-14 10:08:08

by Michal Hocko

[permalink] [raw]
Subject: Re: memcg reclaim demotion wrt. isolation

On Wed 14-12-22 10:57:52, Huang, Ying wrote:
> Michal Hocko <[email protected]> writes:
[...]
> > This makes sense but I suspect that this wasn't intended also for
> > memcg triggered reclaim. This would mean that a memory pressure in one
> > hierarchy could trigger paging out pages of a different hierarchy if the
> > demotion target is close to full.
>
> It seems that it's unnecessary to wake up kswapd of demotion target node
> in most cases. Because we will try to reclaim on the demotion target
> nodes in the loop of do_try_to_free_pages(). It may be better to loop
> the zonelist in the reverse order. Because the demotion targets are
> usually located at the latter of the zonelist.

Reclaiming from demotion targets first would deal with that as well.
Thanks! Let's establish whether this is something we really need/want
fix first.
--
Michal Hocko
SUSE Labs

2022-12-14 10:11:16

by Michal Hocko

[permalink] [raw]
Subject: Re: memcg reclaim demotion wrt. isolation

On Tue 13-12-22 17:14:48, Johannes Weiner wrote:
> On Tue, Dec 13, 2022 at 04:41:10PM +0100, Michal Hocko wrote:
> > Hi,
> > I have just noticed that that pages allocated for demotion targets
> > includes __GFP_KSWAPD_RECLAIM (through GFP_NOWAIT). This is the case
> > since the code has been introduced by 26aa2d199d6f ("mm/migrate: demote
> > pages during reclaim"). I suspect the intention is to trigger the aging
> > on the fallback node and either drop or further demote oldest pages.
> >
> > This makes sense but I suspect that this wasn't intended also for
> > memcg triggered reclaim. This would mean that a memory pressure in one
> > hierarchy could trigger paging out pages of a different hierarchy if the
> > demotion target is close to full.
>
> This is also true if you don't do demotion. If a cgroup tries to
> allocate memory on a full node (i.e. mbind()), it may wake kswapd or
> enter global reclaim directly which may push out the memory of other
> cgroups, regardless of the respective cgroup limits.

You are right on this. But this is describing a slightly different
situaton IMO.

> The demotion allocations don't strike me as any different. They're
> just allocations on behalf of a cgroup. I would expect them to wake
> kswapd and reclaim physical memory as needed.

I am not sure this is an expected behavior. Consider the currently
discussed memory.demote interface when the userspace can trigger
(almost) arbitrary demotions. This can deplete fallback nodes without
over-committing the memory overall yet push out demoted memory from
other workloads. From the user POV it would look like a reclaim while
the overall memory is far from depleted so it would be considered as
premature and a warrant a bug report.

The reclaim behavior would make more sense to me if it was constrained
to the allocating memcg hierarchy so unrelated lruvecs wouldn't be
disrupted.

--
Michal Hocko
SUSE Labs

2022-12-14 13:02:11

by Johannes Weiner

[permalink] [raw]
Subject: Re: memcg reclaim demotion wrt. isolation

On Wed, Dec 14, 2022 at 10:42:56AM +0100, Michal Hocko wrote:
> On Tue 13-12-22 17:14:48, Johannes Weiner wrote:
> > On Tue, Dec 13, 2022 at 04:41:10PM +0100, Michal Hocko wrote:
> > > Hi,
> > > I have just noticed that that pages allocated for demotion targets
> > > includes __GFP_KSWAPD_RECLAIM (through GFP_NOWAIT). This is the case
> > > since the code has been introduced by 26aa2d199d6f ("mm/migrate: demote
> > > pages during reclaim"). I suspect the intention is to trigger the aging
> > > on the fallback node and either drop or further demote oldest pages.
> > >
> > > This makes sense but I suspect that this wasn't intended also for
> > > memcg triggered reclaim. This would mean that a memory pressure in one
> > > hierarchy could trigger paging out pages of a different hierarchy if the
> > > demotion target is close to full.
> >
> > This is also true if you don't do demotion. If a cgroup tries to
> > allocate memory on a full node (i.e. mbind()), it may wake kswapd or
> > enter global reclaim directly which may push out the memory of other
> > cgroups, regardless of the respective cgroup limits.
>
> You are right on this. But this is describing a slightly different
> situaton IMO.
>
> > The demotion allocations don't strike me as any different. They're
> > just allocations on behalf of a cgroup. I would expect them to wake
> > kswapd and reclaim physical memory as needed.
>
> I am not sure this is an expected behavior. Consider the currently
> discussed memory.demote interface when the userspace can trigger
> (almost) arbitrary demotions. This can deplete fallback nodes without
> over-committing the memory overall yet push out demoted memory from
> other workloads. From the user POV it would look like a reclaim while
> the overall memory is far from depleted so it would be considered as
> premature and a warrant a bug report.
>
> The reclaim behavior would make more sense to me if it was constrained
> to the allocating memcg hierarchy so unrelated lruvecs wouldn't be
> disrupted.

What if the second tier is full, and the memcg you're trying to demote
doesn't have any pages to vacate on that tier yet? Will it fail to
demote?

Does that mean that a shared second tier node is only usable for the
cgroup that demotes to it first? And demotion stops for everybody else
until that cgroup vacates the node voluntarily?

As you can see, these would be unprecedented and quite surprising
first-come-first-serve memory protection semantics.

The only way to prevent cgroups from disrupting each other on NUMA
nodes is NUMA constraints. Cgroup per-node limits. That shields not
only from demotion, but also from DoS-mbinding, or aggressive
promotion. All of these can result in some form of premature
reclaim/demotion, proactive demotion isn't special in that way.

The default behavior for cgroups is that without limits or
protections, resource access is unconstrained and competitive. Without
NUMA constraints, it's very much expected that cgroups compete over
nodes, and that the hottest pages win out. Per aging rules, freshly
demoted pages are hotter than anything else on the target node, so it
should displace accordingly.

Consider the case where you have two lower tier nodes and there are
cpuset isolation for the main workloads, but some maintenance thing
runs and pollutes one of the lower tier nodes. Or consider the case
where a shared lower tier node is divvied up between two cgroups using
protection settings to allow overcommit, i.e. per-node memory.low.

Demotions, proactive or not, MUST do global reclaim on a full node.

2022-12-14 15:48:47

by Michal Hocko

[permalink] [raw]
Subject: Re: memcg reclaim demotion wrt. isolation

On Wed 14-12-22 13:40:33, Johannes Weiner wrote:
> On Wed, Dec 14, 2022 at 10:42:56AM +0100, Michal Hocko wrote:
[...]
> > The reclaim behavior would make more sense to me if it was constrained
> > to the allocating memcg hierarchy so unrelated lruvecs wouldn't be
> > disrupted.
>
> What if the second tier is full, and the memcg you're trying to demote
> doesn't have any pages to vacate on that tier yet? Will it fail to
> demote?
>
> Does that mean that a shared second tier node is only usable for the
> cgroup that demotes to it first? And demotion stops for everybody else
> until that cgroup vacates the node voluntarily?
>
> As you can see, these would be unprecedented and quite surprising
> first-come-first-serve memory protection semantics.

This is a very good example!

> The only way to prevent cgroups from disrupting each other on NUMA
> nodes is NUMA constraints. Cgroup per-node limits. That shields not
> only from demotion, but also from DoS-mbinding, or aggressive
> promotion. All of these can result in some form of premature
> reclaim/demotion, proactive demotion isn't special in that way.

Any numa based balancing is a real challenge with memcg semantic. I do
not see per numa node memcg limits without a major overhaul of how we do
charging though. I am not sure this is on the table even long term.
Unless I am really missing something here we have to live with the
existing semantic for a foreseeable future.

> The default behavior for cgroups is that without limits or
> protections, resource access is unconstrained and competitive. Without
> NUMA constraints, it's very much expected that cgroups compete over
> nodes, and that the hottest pages win out. Per aging rules, freshly
> demoted pages are hotter than anything else on the target node, so it
> should displace accordingly.

That is certainly a way to look at it but I would really emphasise
that this competition depends quite significantly on a higher level
balancing on top. Memory allocations fall back to different nodes so the
resource distribution should be roughly even in this case. If there is a
competition then it most likely means our resources are overcommitted.

The picture is slightly different with the demotion for memory tiering
IMHO because that spills an internal resource contention or explicit
user space balancing (via pro-active reclaim/demotion) outside because
it creates pressure on the demotion target that is a shared resource as
you have mentioned above.

> Consider the case where you have two lower tier nodes and there are
> cpuset isolation for the main workloads, but some maintenance thing
> runs and pollutes one of the lower tier nodes.

Well, this is not really much different from regular NUMA system where
node aware and constrained workloads compete with NUMA unconstrained
workloads. This has never worked.

> Or consider the case
> where a shared lower tier node is divvied up between two cgroups using
> protection settings to allow overcommit, i.e. per-node memory.low.

> Demotions, proactive or not, MUST do global reclaim on a full node.

OK, but my concern is how to implement any usersoace policy around that
behavior. If you see demotion failures then you can trigger some
rebalancing explicitly. If those are silent then your only option left
is to check the capacity of the demotion target regularly and play a
catch up game. Is this sufficient?

All that being said, I can see that both approaches result in some
corner cases. I do agree that a starvation is likely easier scenario
than an actively evil container disrupting another container by pushing
its demoted pages out.

So scratch the patch.

Thanks
--
Michal Hocko
SUSE Labs

2022-12-14 17:55:58

by Johannes Weiner

[permalink] [raw]
Subject: Re: memcg reclaim demotion wrt. isolation

Hey Michal,

On Wed, Dec 14, 2022 at 04:29:06PM +0100, Michal Hocko wrote:
> On Wed 14-12-22 13:40:33, Johannes Weiner wrote:
> > The only way to prevent cgroups from disrupting each other on NUMA
> > nodes is NUMA constraints. Cgroup per-node limits. That shields not
> > only from demotion, but also from DoS-mbinding, or aggressive
> > promotion. All of these can result in some form of premature
> > reclaim/demotion, proactive demotion isn't special in that way.
>
> Any numa based balancing is a real challenge with memcg semantic. I do
> not see per numa node memcg limits without a major overhaul of how we do
> charging though. I am not sure this is on the table even long term.
> Unless I am really missing something here we have to live with the
> existing semantic for a foreseeable future.

Yes, I think you're quite right.

We've been mostly skirting the NUMA issue in cgroups (and to a degree
in MM code in general) with two possible answers:

a) The NUMA distances are close enough that we ignore it and pretend
all memory is (mostly) fungible.

b) The NUMA distances are big enough that it matters, in which case
the best option is to avoid sharing, and use bindings to keep
workloads/containers isolated to their own CPU+memory domains.

Tiered memory forces the issue by providing memory that must be shared
between workloads/containers, but is not fungible. At least not
without incurring priority inversions between containers, where a
lopri container promotes itself to the top and demotes the hipri
workload, while staying happily within its global memory allowance.

This applies to mbind() cases as much as it does to NUMA balancing.

If these setups proliferate, it seems inevitable to me that sooner or
later the full problem space of memory cgroups - dividing up a shared
resource while allowing overcommit - applies not just to "RAM as a
whole", but to each memory tier individually.

Whether we need the full memcg interface per tier or per node, I'm not
sure. It might be enough to automatically apportion global allowances
to nodes; so if you have 32G toptier and 16G lowtier, and a cgroup has
a 20G allowance, it gets 13G on top and 7G on low.

(That, or we settle on multi-socket systems with private tiers, such
that memory continues to be unshared :-)

Either way, I expect this issue will keep coming up as we try to use
containers on such systems.

2022-12-15 06:40:30

by Huang, Ying

[permalink] [raw]
Subject: Re: memcg reclaim demotion wrt. isolation

Michal Hocko <[email protected]> writes:

> On Tue 13-12-22 17:14:48, Johannes Weiner wrote:
>> On Tue, Dec 13, 2022 at 04:41:10PM +0100, Michal Hocko wrote:
>> > Hi,
>> > I have just noticed that that pages allocated for demotion targets
>> > includes __GFP_KSWAPD_RECLAIM (through GFP_NOWAIT). This is the case
>> > since the code has been introduced by 26aa2d199d6f ("mm/migrate: demote
>> > pages during reclaim"). I suspect the intention is to trigger the aging
>> > on the fallback node and either drop or further demote oldest pages.
>> >
>> > This makes sense but I suspect that this wasn't intended also for
>> > memcg triggered reclaim. This would mean that a memory pressure in one
>> > hierarchy could trigger paging out pages of a different hierarchy if the
>> > demotion target is close to full.
>>
>> This is also true if you don't do demotion. If a cgroup tries to
>> allocate memory on a full node (i.e. mbind()), it may wake kswapd or
>> enter global reclaim directly which may push out the memory of other
>> cgroups, regardless of the respective cgroup limits.
>
> You are right on this. But this is describing a slightly different
> situaton IMO.
>
>> The demotion allocations don't strike me as any different. They're
>> just allocations on behalf of a cgroup. I would expect them to wake
>> kswapd and reclaim physical memory as needed.
>
> I am not sure this is an expected behavior. Consider the currently
> discussed memory.demote interface when the userspace can trigger
> (almost) arbitrary demotions. This can deplete fallback nodes without
> over-committing the memory overall yet push out demoted memory from
> other workloads. From the user POV it would look like a reclaim while
> the overall memory is far from depleted so it would be considered as
> premature and a warrant a bug report.
>
> The reclaim behavior would make more sense to me if it was constrained
> to the allocating memcg hierarchy so unrelated lruvecs wouldn't be
> disrupted.

When we reclaim/demote some pages from a memcg proactively, what is our
goal? To free up some memory in this memcg for other memcgs to use? If
so, it sounds reasonable to keep the pages of other memcgs as many as
possible.

Best Regards,
Huang, Ying

2022-12-15 08:31:43

by Johannes Weiner

[permalink] [raw]
Subject: Re: memcg reclaim demotion wrt. isolation

On Thu, Dec 15, 2022 at 02:17:13PM +0800, Huang, Ying wrote:
> Michal Hocko <[email protected]> writes:
>
> > On Tue 13-12-22 17:14:48, Johannes Weiner wrote:
> >> On Tue, Dec 13, 2022 at 04:41:10PM +0100, Michal Hocko wrote:
> >> > Hi,
> >> > I have just noticed that that pages allocated for demotion targets
> >> > includes __GFP_KSWAPD_RECLAIM (through GFP_NOWAIT). This is the case
> >> > since the code has been introduced by 26aa2d199d6f ("mm/migrate: demote
> >> > pages during reclaim"). I suspect the intention is to trigger the aging
> >> > on the fallback node and either drop or further demote oldest pages.
> >> >
> >> > This makes sense but I suspect that this wasn't intended also for
> >> > memcg triggered reclaim. This would mean that a memory pressure in one
> >> > hierarchy could trigger paging out pages of a different hierarchy if the
> >> > demotion target is close to full.
> >>
> >> This is also true if you don't do demotion. If a cgroup tries to
> >> allocate memory on a full node (i.e. mbind()), it may wake kswapd or
> >> enter global reclaim directly which may push out the memory of other
> >> cgroups, regardless of the respective cgroup limits.
> >
> > You are right on this. But this is describing a slightly different
> > situaton IMO.
> >
> >> The demotion allocations don't strike me as any different. They're
> >> just allocations on behalf of a cgroup. I would expect them to wake
> >> kswapd and reclaim physical memory as needed.
> >
> > I am not sure this is an expected behavior. Consider the currently
> > discussed memory.demote interface when the userspace can trigger
> > (almost) arbitrary demotions. This can deplete fallback nodes without
> > over-committing the memory overall yet push out demoted memory from
> > other workloads. From the user POV it would look like a reclaim while
> > the overall memory is far from depleted so it would be considered as
> > premature and a warrant a bug report.
> >
> > The reclaim behavior would make more sense to me if it was constrained
> > to the allocating memcg hierarchy so unrelated lruvecs wouldn't be
> > disrupted.
>
> When we reclaim/demote some pages from a memcg proactively, what is our
> goal? To free up some memory in this memcg for other memcgs to use? If
> so, it sounds reasonable to keep the pages of other memcgs as many as
> possible.

The goal of proactive aging is to free up any resources that aren't
needed to meet the SLAs (e.g. end-to-end response time of webserver).
Meaning, to run things as leanly as possible within spec. Into that
free space, another container can then be co-located.

This means that the goal is to free up as many resources as possible,
starting with the coveted hightier. If a container has been using
all-hightier memory but is able demote to lowtier, there are 3 options
for existing memory in the lower tier:

1) Colder/stale memory - should be displaced

2) Memory that can be promoted once the hightier is free -
reclaim/demotion of the coldest pages needs to happen at least
temporarily, or the tierswap is in stale mate.

3) Equally hot memory - if this exceeds capacity of the lower tier,
the hottest overall pages should stay, the excess demoted/reclaimed.

You can't know what scenario you're in until you put the demoted pages
in direct LRU competition with what's already there. And in all three
scenarios, direct LRU competition also produces the optimal outcome.

2022-12-16 04:32:25

by Huang, Ying

[permalink] [raw]
Subject: Re: memcg reclaim demotion wrt. isolation

Johannes Weiner <[email protected]> writes:

> On Thu, Dec 15, 2022 at 02:17:13PM +0800, Huang, Ying wrote:
>> Michal Hocko <[email protected]> writes:
>>
>> > On Tue 13-12-22 17:14:48, Johannes Weiner wrote:
>> >> On Tue, Dec 13, 2022 at 04:41:10PM +0100, Michal Hocko wrote:
>> >> > Hi,
>> >> > I have just noticed that that pages allocated for demotion targets
>> >> > includes __GFP_KSWAPD_RECLAIM (through GFP_NOWAIT). This is the case
>> >> > since the code has been introduced by 26aa2d199d6f ("mm/migrate: demote
>> >> > pages during reclaim"). I suspect the intention is to trigger the aging
>> >> > on the fallback node and either drop or further demote oldest pages.
>> >> >
>> >> > This makes sense but I suspect that this wasn't intended also for
>> >> > memcg triggered reclaim. This would mean that a memory pressure in one
>> >> > hierarchy could trigger paging out pages of a different hierarchy if the
>> >> > demotion target is close to full.
>> >>
>> >> This is also true if you don't do demotion. If a cgroup tries to
>> >> allocate memory on a full node (i.e. mbind()), it may wake kswapd or
>> >> enter global reclaim directly which may push out the memory of other
>> >> cgroups, regardless of the respective cgroup limits.
>> >
>> > You are right on this. But this is describing a slightly different
>> > situaton IMO.
>> >
>> >> The demotion allocations don't strike me as any different. They're
>> >> just allocations on behalf of a cgroup. I would expect them to wake
>> >> kswapd and reclaim physical memory as needed.
>> >
>> > I am not sure this is an expected behavior. Consider the currently
>> > discussed memory.demote interface when the userspace can trigger
>> > (almost) arbitrary demotions. This can deplete fallback nodes without
>> > over-committing the memory overall yet push out demoted memory from
>> > other workloads. From the user POV it would look like a reclaim while
>> > the overall memory is far from depleted so it would be considered as
>> > premature and a warrant a bug report.
>> >
>> > The reclaim behavior would make more sense to me if it was constrained
>> > to the allocating memcg hierarchy so unrelated lruvecs wouldn't be
>> > disrupted.
>>
>> When we reclaim/demote some pages from a memcg proactively, what is our
>> goal? To free up some memory in this memcg for other memcgs to use? If
>> so, it sounds reasonable to keep the pages of other memcgs as many as
>> possible.
>
> The goal of proactive aging is to free up any resources that aren't
> needed to meet the SLAs (e.g. end-to-end response time of webserver).
> Meaning, to run things as leanly as possible within spec. Into that
> free space, another container can then be co-located.
>
> This means that the goal is to free up as many resources as possible,
> starting with the coveted hightier. If a container has been using
> all-hightier memory but is able demote to lowtier, there are 3 options
> for existing memory in the lower tier:
>
> 1) Colder/stale memory - should be displaced
>
> 2) Memory that can be promoted once the hightier is free -
> reclaim/demotion of the coldest pages needs to happen at least
> temporarily, or the tierswap is in stale mate.
>
> 3) Equally hot memory - if this exceeds capacity of the lower tier,
> the hottest overall pages should stay, the excess demoted/reclaimed.
>
> You can't know what scenario you're in until you put the demoted pages
> in direct LRU competition with what's already there. And in all three
> scenarios, direct LRU competition also produces the optimal outcome.

If my understanding were correct, your preferred semantics is to be memcg
specific in the higher tier, and global in the lower tier.

Another choice is to add another global "memory.reclaim" knob, for
example, as /sys/devices/virtual/memory_tiering/memory_tier<N>/memory.reclaim ?
Then we can trigger global memory reclaim in lower tiers firstly. Then
trigger memcg specific memory reclaim in higher tier for the specified
memcg.

The cons of this choice is that you need 2 steps to finish the work.
The pros is that you don't need to combine memcg-specific and global
behavior in one interface.

Best Regards,
Huang, Ying