2022-03-24 03:41:28

by Michal Koutný

[permalink] [raw]
Subject: [RFC PATCH] mm: memcg: Do not count memory.low reclaim if it does not happen

This was observed with memcontrol selftest/new LTP test but can be also
reproduced in simplified setup of two siblings:

`parent .low=50M
` s1 .low=50M .current=50M+ε
` s2 .low=0M .current=50M

The expectation is that s2/memory.events:low will be zero under outer
reclaimer since no protection should be given to cgroup s2 (even with
memory_recursiveprot).

However, this does not happen. The apparent reason is that when s1 is
considered for (proportional) reclaim the scanned proportion is rounded
up to SWAP_CLUSTER_MAX and slightly over-proportional amount is
reclaimed. Consequently, when the effective low value of s2 is
calculated, it observes unclaimed parent's protection from s1
(ε-SWAP_CLUSTER_MAX in theory) and effectively appropriates it.
The effect is slightly regularized protection (workload dependent)
between siblings and misreported MEMCG_LOW event when reclaiming s2 with
this protection.

Fix the behavior by not reporting breached memory.low in such
situations. (This affects also setups where all siblings have
memory.low=0, parent's memory.events:low will still be non-zero when
parent's memory.low is breached but it will be reduced by the events
originated in children.)

Fixes: 8a931f801340 ("mm: memcontrol: recursive memory.low protection")
Reported-by: Richard Palethorpe <[email protected]>
Link: https://lore.kernel.org/all/[email protected]/
Signed-off-by: Michal Koutný <[email protected]>
---
include/linux/memcontrol.h | 8 ++++----
mm/vmscan.c | 5 +++--
2 files changed, 7 insertions(+), 6 deletions(-)

Why is this RFC?

1) It changes number of events observed on parent/memory.events:low (especially
for truly recursive configs where all children specify memory.low=0).
IIUC past discussions about equality of all-zeros and all-infinities, those
eagerly reported MEMCG_LOW events (in latter case) were deemed skewing the
stats [1].
2) The observed behavior slightly impacts distribution of parent's memory.low.
Constructed example is a passive protected workload in s1 and active in s2
(active ~ counteracts the reclaim with allocations). It could strip
protection from s1 one by one (one:=SWAP_CLUSTER_MAX/2^sc.priority).
That may be considered both wrong (s1 should have been more protected) or
correct s2 deserves protection due to its activity.
I don't have (didn't collect) data for this, so I think just masking the
false events is sufficient (or independent).

[1] https://lore.kernel.org/r/[email protected]

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0abbd685703b..99ac72e00bff 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -626,13 +626,13 @@ static inline bool mem_cgroup_supports_protection(struct mem_cgroup *memcg)

}

-static inline bool mem_cgroup_below_low(struct mem_cgroup *memcg)
+static inline bool mem_cgroup_below_low(struct mem_cgroup *memcg, bool effective)
{
if (!mem_cgroup_supports_protection(memcg))
return false;

- return READ_ONCE(memcg->memory.elow) >=
- page_counter_read(&memcg->memory);
+ return page_counter_read(&memcg->memory) <= (effective ?
+ READ_ONCE(memcg->memory.elow) : READ_ONCE(memcg->memory.low));
}

static inline bool mem_cgroup_below_min(struct mem_cgroup *memcg)
@@ -1177,7 +1177,7 @@ static inline void mem_cgroup_calculate_protection(struct mem_cgroup *root,
{
}

-static inline bool mem_cgroup_below_low(struct mem_cgroup *memcg)
+static inline bool mem_cgroup_below_low(struct mem_cgroup *memcg, bool effective)
{
return false;
}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 59b14e0d696c..3bdb35d6bee6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3152,7 +3152,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
* If there is no reclaimable memory, OOM.
*/
continue;
- } else if (mem_cgroup_below_low(memcg)) {
+ } else if (mem_cgroup_below_low(memcg, true)) {
/*
* Soft protection.
* Respect the protection only as long as
@@ -3163,7 +3163,8 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
sc->memcg_low_skipped = 1;
continue;
}
- memcg_memory_event(memcg, MEMCG_LOW);
+ if (mem_cgroup_below_low(memcg, false))
+ memcg_memory_event(memcg, MEMCG_LOW);
}

reclaimed = sc->nr_reclaimed;
--
2.35.1


2022-03-24 03:41:28

by Roman Gushchin

[permalink] [raw]
Subject: Re: [RFC PATCH] mm: memcg: Do not count memory.low reclaim if it does not happen

On Tue, Mar 22, 2022 at 07:22:48PM +0100, Michal Koutny wrote:
> This was observed with memcontrol selftest/new LTP test but can be also
> reproduced in simplified setup of two siblings:
>
> `parent .low=50M
> ` s1 .low=50M .current=50M+ε
> ` s2 .low=0M .current=50M
>
> The expectation is that s2/memory.events:low will be zero under outer
> reclaimer since no protection should be given to cgroup s2 (even with
> memory_recursiveprot).
>
> However, this does not happen. The apparent reason is that when s1 is
> considered for (proportional) reclaim the scanned proportion is rounded
> up to SWAP_CLUSTER_MAX and slightly over-proportional amount is
> reclaimed. Consequently, when the effective low value of s2 is
> calculated, it observes unclaimed parent's protection from s1
> (ε-SWAP_CLUSTER_MAX in theory) and effectively appropriates it.
> The effect is slightly regularized protection (workload dependent)
> between siblings and misreported MEMCG_LOW event when reclaiming s2 with
> this protection.
>
> Fix the behavior by not reporting breached memory.low in such
> situations. (This affects also setups where all siblings have
> memory.low=0, parent's memory.events:low will still be non-zero when
> parent's memory.low is breached but it will be reduced by the events
> originated in children.)
>
> Fixes: 8a931f801340 ("mm: memcontrol: recursive memory.low protection")
> Reported-by: Richard Palethorpe <[email protected]>
> Link: https://lore.kernel.org/all/[email protected]/
> Signed-off-by: Michal Koutný <[email protected]>

Hi Michal!

Does it mean that in the following configuration:
`parent .low=50M
` s1 .low=0M .current=50M
` s2 .low=0M .current=50M
there will be no memory.events::low at all? (assuming the recursive thing is on)

Thanks!

2022-03-25 15:07:13

by Michal Koutný

[permalink] [raw]
Subject: Re: [RFC PATCH] mm: memcg: Do not count memory.low reclaim if it does not happen

On Wed, Mar 23, 2022 at 02:44:24PM -0700, Roman Gushchin <[email protected]> wrote:
> Does it mean that in the following configuration:
> `parent .low=50M
> ` s1 .low=0M .current=50M
> ` s2 .low=0M .current=50M
> there will be no memory.events::low at all? (assuming the recursive thing is on)

True, no memory.events:low among siblings.
Number of memory.events:low in the parent depends on how much has to be
reclaimed (>50M means carving into parent's protection, hence it'll be
counted).

This is a quantitative change in the events reporting (point 1 of
RFCness), my understanding is that the potential events due to recursive
surplus protection carry no new information regarding configured
memory.low.


Michal

2022-03-25 19:03:34

by Michal Koutný

[permalink] [raw]
Subject: Re: [RFC PATCH] mm: memcg: Do not count memory.low reclaim if it does not happen

On Thu, Mar 24, 2022 at 11:17:14AM -0700, Roman Gushchin <[email protected]> wrote:
> Ok, so it’s not really about the implementation details of the reclaim
> mechanism (I mean rounding up to the batch size etc),

Actually, that was what I deemed more serious first.
It's the point 2 of RFCness:

| 2) The observed behavior slightly impacts distribution of parent's memory.low.
| Constructed example is a passive protected workload in s1 and active in s2
| (active ~ counteracts the reclaim with allocations). It could strip
| protection from s1 one by one (one:=SWAP_CLUSTER_MAX/2^sc.priority).
| That may be considered both wrong (s1 should have been more protected) or
| correct s2 deserves protection due to its activity.
| I don't have (didn't collect) data for this, so I think just masking the
| false events is sufficient (or independent).

> Idk, I don’t have a strong argument against this change (except that
> it changes the existing behavior), but I also don’t see why such
> events are harmful. Do you mind elaborating a bit more?

So I've collected some demo data now.

systemd-run \
-u precious.service --slice=test-protected.slice \
-p MemoryLow=50M \
/root/memeater 50 # allocates 50M anon, doesn't use it

systemd-run \
-u victim.service --slice=test-protected.slice \
-p MemoryLow=0M \
/root/memeater -m 50 50 # allocates 50M anon, uses it

echo "Started workloads"

systemctl set-property --runtime test.slice MemoryMax=200M
systemctl set-property --runtime test-protected.slice MemoryLow=50M

sleep 5

systemd-run \
-u pressure.service --slice=test.slice \
-p MemorySwapMax=0M \ # to push test-protected.slice to swap
/root/memeater -m 170 170

sleep 5
systemd-cgtop -b -1 -m test.slice

Result with memory_recursiveprot

> Control Group Tasks %CPU Memory Input/s Output/s
> test.slice 3 - 199.9M - -
> test.slice/pressure.service 1 - 170.5M - -
> test.slice/test-protected.slice 2 - 29.4M - -
> test.slice/test-protected.slice/victim.service 1 - 29.1M - -
> test.slice/test-protected.slice/precious.service 1 - 292.0K - -

Result without memory_recursiveprot

> Control Group Tasks %CPU Memory Input/s Output/s
> test.slice 3 - 199.8M - -
> test.slice/pressure.service 1 - 170.5M - -
> test.slice/test-protected.slice 2 - 29.3M - -
> test.slice/test-protected.slice/precious.service 1 - 28.7M - -
> test.slice/test-protected.slice/victim.service 1 - 560.0K - -

(kernel 5.17.0, systemd 249.10)

So with this result, I'd say the event reporting is an independent change
(admiteddly, thanks to the current implementation (not the proposal of
mine) I noticed this issue).
/me scratches head, let me review my other approaches...


Michal

2022-03-25 19:58:54

by Roman Gushchin

[permalink] [raw]
Subject: Re: [RFC PATCH] mm: memcg: Do not count memory.low reclaim if it does not happen


> On Mar 24, 2022, at 2:52 AM, Michal Koutný <[email protected]> wrote:
>
> On Wed, Mar 23, 2022 at 02:44:24PM -0700, Roman Gushchin <[email protected]> wrote:
>> Does it mean that in the following configuration:
>> `parent .low=50M
>> ` s1 .low=0M .current=50M
>> ` s2 .low=0M .current=50M
>> there will be no memory.events::low at all? (assuming the recursive thing is on)
>
> True, no memory.events:low among siblings.
> Number of memory.events:low in the parent depends on how much has to be
> reclaimed (>50M means carving into parent's protection, hence it'll be
> counted).

Ok, so it’s not really about the implementation details of the reclaim mechanism (I mean rounding up to the batch size etc), it’s a more generic change: do not generate low events for cgroups not explicitly protected by a non-zero memory.low value.

Idk, I don’t have a strong argument against this change (except that it changes the existing behavior), but I also don’t see why such events are harmful. Do you mind elaborating a bit more?

Thank you!