2021-02-17 21:44:35

by Tim Chen

[permalink] [raw]
Subject: [PATCH v2 0/3] Soft limit memory management bug fixes

During testing of tiered memory management based on memory soft limit,
I found three issues with memory management using cgroup based soft
limit in the mainline code. Fix the issues with the three patches in
this series. Also updated patch 3 per Johannes' comments on the first
version of this patch.

Thanks.

Tim

Changelog:
v2
1. Do soft limit tree uncharge update in batch of the same node only
for v1 cgroups that have a soft limit. Batching in nodes is only
relevant for cgroup v1 that has per node soft limit tree.

Tim Chen (3):
mm: Fix dropped memcg from mem cgroup soft limit tree
mm: Force update of mem cgroup soft limit tree on usage excess
mm: Fix missing mem cgroup soft limit tree updates

mm/memcontrol.c | 27 +++++++++++++++++++++++----
1 file changed, 23 insertions(+), 4 deletions(-)

--
2.20.1


2021-02-17 21:46:28

by Tim Chen

[permalink] [raw]
Subject: [PATCH v2 2/3] mm: Force update of mem cgroup soft limit tree on usage excess

To rate limit updates to the mem cgroup soft limit tree, we only perform
updates every SOFTLIMIT_EVENTS_TARGET (defined as 1024) memory events.

However, this sampling based updates may miss a critical update: i.e. when
the mem cgroup first exceeded its limit but it was not on the soft limit tree.
It should be on the tree at that point so it could be subjected to soft
limit page reclaim. If the mem cgroup had few memory events compared with
other mem cgroups, we may not update it and place in on the mem cgroup
soft limit tree for many memory events. And this mem cgroup excess
usage could creep up and the mem cgroup could be hidden from the soft
limit page reclaim for a long time.

Fix this issue by forcing an update to the mem cgroup soft limit tree if a
mem cgroup has exceeded its memory soft limit but it is not on the mem
cgroup soft limit tree.

Reviewed-by: Ying Huang <[email protected]>
Signed-off-by: Tim Chen <[email protected]>
---
mm/memcontrol.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a51bf90732cb..d72449eeb85a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -985,15 +985,22 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
*/
static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)
{
+ struct mem_cgroup_per_node *mz;
+ bool force_update = false;
+
+ mz = mem_cgroup_nodeinfo(memcg, page_to_nid(page));
+ if (mz && !mz->on_tree && soft_limit_excess(mz->memcg) > 0)
+ force_update = true;
+
/* threshold event is triggered in finer grain than soft limit */
- if (unlikely(mem_cgroup_event_ratelimit(memcg,
+ if (unlikely((force_update) || mem_cgroup_event_ratelimit(memcg,
MEM_CGROUP_TARGET_THRESH))) {
bool do_softlimit;

do_softlimit = mem_cgroup_event_ratelimit(memcg,
MEM_CGROUP_TARGET_SOFTLIMIT);
mem_cgroup_threshold(memcg);
- if (unlikely(do_softlimit))
+ if (unlikely((force_update) || do_softlimit))
mem_cgroup_update_tree(memcg, page);
}
}
--
2.20.1

2021-02-17 21:46:48

by Tim Chen

[permalink] [raw]
Subject: [PATCH v2 1/3] mm: Fix dropped memcg from mem cgroup soft limit tree

During soft limit memory reclaim, we will temporarily remove the target
mem cgroup from the cgroup soft limit tree. We then perform memory
reclaim, update the memory usage excess count and re-insert the mem
cgroup back into the mem cgroup soft limit tree according to the new
memory usage excess count.

However, when memory reclaim failed for a maximum number of attempts
and we bail out of the reclaim loop, we forgot to put the target mem
cgroup chosen for next reclaim back to the soft limit tree. This prevented
pages in the mem cgroup from being reclaimed in the future even though
the mem cgroup exceeded its soft limit. Fix the logic and put the mem
cgroup back on the tree when page reclaim failed for the mem cgroup.

Reviewed-by: Ying Huang <[email protected]>
Signed-off-by: Tim Chen <[email protected]>
---
mm/memcontrol.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ed5cc78a8dbf..a51bf90732cb 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3505,8 +3505,12 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS))
break;
} while (!nr_reclaimed);
- if (next_mz)
+ if (next_mz) {
+ spin_lock_irq(&mctz->lock);
+ __mem_cgroup_insert_exceeded(next_mz, mctz, excess);
+ spin_unlock_irq(&mctz->lock);
css_put(&next_mz->memcg->css);
+ }
return nr_reclaimed;
}

--
2.20.1

2021-02-17 21:49:05

by Tim Chen

[permalink] [raw]
Subject: [PATCH v2 3/3] mm: Fix missing mem cgroup soft limit tree updates

On a per node basis, the mem cgroup soft limit tree on each node tracks
how much a cgroup has exceeded its soft limit memory limit and sorts
the cgroup by its excess usage. On page release, the trees are not
updated right away, until we have gathered a batch of pages belonging to
the same cgroup. This reduces the frequency of updating the soft limit tree
and locking of the tree and associated cgroup.

However, the batch of pages could contain pages from multiple nodes but
only the soft limit tree from one node would get updated. Change the
logic so that we update the tree in batch of pages, with each batch of
pages all in the same mem cgroup and memory node. An update is issued for
the batch of pages of a node collected till now whenever we encounter
a page belonging to a different node. Note that this batching for
the same node logic is only relevant for v1 cgroup that has a memory
soft limit.

Reviewed-by: Ying Huang <[email protected]>
Signed-off-by: Tim Chen <[email protected]>
---
mm/memcontrol.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d72449eeb85a..8bddee75f5cb 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6804,6 +6804,7 @@ struct uncharge_gather {
unsigned long pgpgout;
unsigned long nr_kmem;
struct page *dummy_page;
+ int nid;
};

static inline void uncharge_gather_clear(struct uncharge_gather *ug)
@@ -6849,7 +6850,13 @@ static void uncharge_page(struct page *page, struct uncharge_gather *ug)
* exclusive access to the page.
*/

- if (ug->memcg != page_memcg(page)) {
+ if (ug->memcg != page_memcg(page) ||
+ /*
+ * Update soft limit tree used in v1 cgroup in page batch for
+ * the same node. Relevant only to v1 cgroup with a soft limit.
+ */
+ (ug->dummy_page && ug->nid != page_to_nid(page) &&
+ ug->memcg->soft_limit != PAGE_COUNTER_MAX)) {
if (ug->memcg) {
uncharge_batch(ug);
uncharge_gather_clear(ug);
@@ -6869,6 +6876,7 @@ static void uncharge_page(struct page *page, struct uncharge_gather *ug)
ug->pgpgout++;

ug->dummy_page = page;
+ ug->nid = page_to_nid(page);
page->memcg_data = 0;
css_put(&ug->memcg->css);
}
--
2.20.1

2021-02-18 06:20:42

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] mm: Fix missing mem cgroup soft limit tree updates

On Wed, Feb 17, 2021 at 12:41:36PM -0800, Tim Chen wrote:
> On a per node basis, the mem cgroup soft limit tree on each node tracks
> how much a cgroup has exceeded its soft limit memory limit and sorts
> the cgroup by its excess usage. On page release, the trees are not
> updated right away, until we have gathered a batch of pages belonging to
> the same cgroup. This reduces the frequency of updating the soft limit tree
> and locking of the tree and associated cgroup.
>
> However, the batch of pages could contain pages from multiple nodes but
> only the soft limit tree from one node would get updated. Change the
> logic so that we update the tree in batch of pages, with each batch of
> pages all in the same mem cgroup and memory node. An update is issued for
> the batch of pages of a node collected till now whenever we encounter
> a page belonging to a different node. Note that this batching for
> the same node logic is only relevant for v1 cgroup that has a memory
> soft limit.
>
> Reviewed-by: Ying Huang <[email protected]>
> Signed-off-by: Tim Chen <[email protected]>
> ---
> mm/memcontrol.c | 10 +++++++++-
> 1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d72449eeb85a..8bddee75f5cb 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6804,6 +6804,7 @@ struct uncharge_gather {
> unsigned long pgpgout;
> unsigned long nr_kmem;
> struct page *dummy_page;
> + int nid;
> };
>
> static inline void uncharge_gather_clear(struct uncharge_gather *ug)
> @@ -6849,7 +6850,13 @@ static void uncharge_page(struct page *page, struct uncharge_gather *ug)
> * exclusive access to the page.
> */
>
> - if (ug->memcg != page_memcg(page)) {
> + if (ug->memcg != page_memcg(page) ||
> + /*
> + * Update soft limit tree used in v1 cgroup in page batch for
> + * the same node. Relevant only to v1 cgroup with a soft limit.
> + */
> + (ug->dummy_page && ug->nid != page_to_nid(page) &&
> + ug->memcg->soft_limit != PAGE_COUNTER_MAX)) {

Sorry, I used weird phrasing in my last email.

Can you please preface the checks you're adding with a
!cgroup_subsys_on_dfl(memory_cgrp_subsys) to static branch for
cgroup1? The uncharge path is pretty hot, and this would avoid the
runtime overhead on cgroup2 at least, which doesn't have the SL.

Also, do we need the ug->dummy_page check? It's only NULL on the first
loop - where ug->memcg is NULL as well and the branch is taken anyway.

The soft limit check is also slightly cheaper than the nid check, as
page_to_nid() might be out-of-line, so we should do it first. This?

/*
* Batch-uncharge all pages of the same memcg.
*
* Unless we're looking at a cgroup1 with a softlimit
* set: the soft limit trees are maintained per-node
* and updated on uncharge (via dummy_page), so keep
* batches confined to a single node as well.
*/
if (ug->memcg != page_memcg(page) ||
(!cgroup_subsys_on_dfl(memory_cgrp_subsys) &&
ug->memcg->soft_limit != PAGE_COUNTER_MAX &&
ug->nid != page_to_nid(page)))

2021-02-18 10:01:09

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] mm: Fix dropped memcg from mem cgroup soft limit tree

On Wed 17-02-21 12:41:34, Tim Chen wrote:
> During soft limit memory reclaim, we will temporarily remove the target
> mem cgroup from the cgroup soft limit tree. We then perform memory
> reclaim, update the memory usage excess count and re-insert the mem
> cgroup back into the mem cgroup soft limit tree according to the new
> memory usage excess count.
>
> However, when memory reclaim failed for a maximum number of attempts
> and we bail out of the reclaim loop, we forgot to put the target mem
> cgroup chosen for next reclaim back to the soft limit tree. This prevented
> pages in the mem cgroup from being reclaimed in the future even though
> the mem cgroup exceeded its soft limit. Fix the logic and put the mem
> cgroup back on the tree when page reclaim failed for the mem cgroup.
>
> Reviewed-by: Ying Huang <[email protected]>
> Signed-off-by: Tim Chen <[email protected]>

I have already acked this patch in the previous version along with Fixes
tag. It seems that my review feedback has been completely ignored also
for other patches in this series.

> ---
> mm/memcontrol.c | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ed5cc78a8dbf..a51bf90732cb 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3505,8 +3505,12 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
> loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS))
> break;
> } while (!nr_reclaimed);
> - if (next_mz)
> + if (next_mz) {
> + spin_lock_irq(&mctz->lock);
> + __mem_cgroup_insert_exceeded(next_mz, mctz, excess);
> + spin_unlock_irq(&mctz->lock);
> css_put(&next_mz->memcg->css);
> + }
> return nr_reclaimed;
> }
>
> --
> 2.20.1

--
Michal Hocko
SUSE Labs

2021-02-18 19:29:14

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] mm: Fix dropped memcg from mem cgroup soft limit tree



On 2/18/21 12:24 AM, Michal Hocko wrote:

>
> I have already acked this patch in the previous version along with Fixes
> tag. It seems that my review feedback has been completely ignored also
> for other patches in this series.

Michal,

My apology. Our mail system screwed up and there are some mail missing
from our mail system that I completely missed your mail.
Only saw them now after I looked into the lore.kernel.org.

Responding to your comment:

>Have you observed this happening in the real life? I do agree that the
>threshold based updates of the tree is not ideal but the whole soft
>reclaim code is far from optimal. So why do we care only now? The
>feature is essentially dead and fine tuning it sounds like a step back
>to me.

Yes, I did see the issue mentioned in patch 2 breaking soft limit
reclaim for cgroup v1. There are still some of our customers using
cgroup v1 so we will like to fix this if possible.

For patch 3 regarding the uncharge_batch, it
is more of an observation that we should uncharge in batch of same node
and not prompted by actual workload.
Thinking more about this, the worst that could happen
is we could have some entries in the soft limit tree that overestimate
the memory used. The worst that could happen is a soft page reclaim
on that cgroup. The overhead from extra memcg event update could
be more than a soft page reclaim pass. So let's drop patch 3
for now.

Let me know if you will like me to resend patch 1 with the fixes tag
for commit 4e41695356fb ("memory controller: soft limit reclaim on contention")
and if there are any changes I should make for patch 2.

Thanks.

Tim

2021-02-18 19:46:10

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] mm: Fix dropped memcg from mem cgroup soft limit tree

On Thu 18-02-21 10:30:20, Tim Chen wrote:
>
>
> On 2/18/21 12:24 AM, Michal Hocko wrote:
>
> >
> > I have already acked this patch in the previous version along with Fixes
> > tag. It seems that my review feedback has been completely ignored also
> > for other patches in this series.
>
> Michal,
>
> My apology. Our mail system screwed up and there are some mail missing
> from our mail system that I completely missed your mail.
> Only saw them now after I looked into the lore.kernel.org.

I see. My apology for suspecting you from ignoring my review.

> Responding to your comment:
>
> >Have you observed this happening in the real life? I do agree that the
> >threshold based updates of the tree is not ideal but the whole soft
> >reclaim code is far from optimal. So why do we care only now? The
> >feature is essentially dead and fine tuning it sounds like a step back
> >to me.
>
> Yes, I did see the issue mentioned in patch 2 breaking soft limit
> reclaim for cgroup v1. There are still some of our customers using
> cgroup v1 so we will like to fix this if possible.

It would be great to see more details.

> For patch 3 regarding the uncharge_batch, it
> is more of an observation that we should uncharge in batch of same node
> and not prompted by actual workload.
> Thinking more about this, the worst that could happen
> is we could have some entries in the soft limit tree that overestimate
> the memory used. The worst that could happen is a soft page reclaim
> on that cgroup. The overhead from extra memcg event update could
> be more than a soft page reclaim pass. So let's drop patch 3
> for now.

I would still prefer to handle that in the soft limit reclaim path and
check each memcg for the soft limit reclaim excess before the reclaim.

> Let me know if you will like me to resend patch 1 with the fixes tag
> for commit 4e41695356fb ("memory controller: soft limit reclaim on contention")
> and if there are any changes I should make for patch 2.

I will ack and suggest Fixes.

>
> Thanks.
>
> Tim

--
Michal Hocko
SUSE Labs

2021-02-18 19:48:50

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] mm: Fix dropped memcg from mem cgroup soft limit tree

On Wed 17-02-21 12:41:34, Tim Chen wrote:
> During soft limit memory reclaim, we will temporarily remove the target
> mem cgroup from the cgroup soft limit tree. We then perform memory
> reclaim, update the memory usage excess count and re-insert the mem
> cgroup back into the mem cgroup soft limit tree according to the new
> memory usage excess count.
>
> However, when memory reclaim failed for a maximum number of attempts
> and we bail out of the reclaim loop, we forgot to put the target mem
> cgroup chosen for next reclaim back to the soft limit tree. This prevented
> pages in the mem cgroup from being reclaimed in the future even though
> the mem cgroup exceeded its soft limit. Fix the logic and put the mem
> cgroup back on the tree when page reclaim failed for the mem cgroup.
>
> Reviewed-by: Ying Huang <[email protected]>
> Signed-off-by: Tim Chen <[email protected]>

Fixes: 4e41695356fb ("memory controller: soft limit reclaim on contention")
Acked-by: Michal Hocko <[email protected]>

Thanks!
> ---
> mm/memcontrol.c | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ed5cc78a8dbf..a51bf90732cb 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3505,8 +3505,12 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
> loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS))
> break;
> } while (!nr_reclaimed);
> - if (next_mz)
> + if (next_mz) {
> + spin_lock_irq(&mctz->lock);
> + __mem_cgroup_insert_exceeded(next_mz, mctz, excess);
> + spin_unlock_irq(&mctz->lock);
> css_put(&next_mz->memcg->css);
> + }
> return nr_reclaimed;
> }
>
> --
> 2.20.1

--
Michal Hocko
SUSE Labs

2021-02-18 20:04:54

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] mm: Fix dropped memcg from mem cgroup soft limit tree



On 2/18/21 11:13 AM, Michal Hocko wrote:
> On Thu 18-02-21 10:30:20, Tim Chen wrote:
>>
>>
>> On 2/18/21 12:24 AM, Michal Hocko wrote:
>>
>>>
>>> I have already acked this patch in the previous version along with Fixes
>>> tag. It seems that my review feedback has been completely ignored also
>>> for other patches in this series.
>>
>> Michal,
>>
>> My apology. Our mail system screwed up and there are some mail missing
>> from our mail system that I completely missed your mail.
>> Only saw them now after I looked into the lore.kernel.org.
>
> I see. My apology for suspecting you from ignoring my review.
>
>> Responding to your comment:
>>
>>> Have you observed this happening in the real life? I do agree that the
>>> threshold based updates of the tree is not ideal but the whole soft
>>> reclaim code is far from optimal. So why do we care only now? The
>>> feature is essentially dead and fine tuning it sounds like a step back
>>> to me.
>>
>> Yes, I did see the issue mentioned in patch 2 breaking soft limit
>> reclaim for cgroup v1. There are still some of our customers using
>> cgroup v1 so we will like to fix this if possible.
>
> It would be great to see more details.
>

The sceanrio I saw was we have multiple cgroups running pmbench.
One cgroup exceeded the soft limit and soft reclaim is active on
that cgroup. So there are a whole bunch of memcg events associated
with that cgroup. Then another cgroup starts to exceed its
soft limit.

Memory is accessed at a much lower frequency
for the second cgroup. The memcg event update was not triggered for the
second cgroup as the memcg event update didn't happened on the 1024th sample.
The second cgroup was not placed on the soft limit tree and we didn't
try to reclaim the excess pages.

As time goes on, we saw that the first cgroup was kept close to its
soft limit due to reclaim activities, while the second cgroup's memory
usage slowly creep up as it keeps getting missed from the soft limit tree
update as the update didn't fall on the modulo 1024 sample. As a result,
the memory usage of the second cgroup keeps growing over the soft limit
for a long time due to its relatively rare occurrence.

>> For patch 3 regarding the uncharge_batch, it
>> is more of an observation that we should uncharge in batch of same node
>> and not prompted by actual workload.
>> Thinking more about this, the worst that could happen
>> is we could have some entries in the soft limit tree that overestimate
>> the memory used. The worst that could happen is a soft page reclaim
>> on that cgroup. The overhead from extra memcg event update could
>> be more than a soft page reclaim pass. So let's drop patch 3
>> for now.
>
> I would still prefer to handle that in the soft limit reclaim path and
> check each memcg for the soft limit reclaim excess before the reclaim.
>

Something like this?

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8bddee75f5cb..b50cae3b2a1a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3472,6 +3472,14 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
if (!mz)
break;

+ /*
+ * Soft limit tree is updated based on memcg events sampling.
+ * We could have missed some updates on page uncharge and
+ * the cgroup is below soft limit. Skip useless soft reclaim.
+ */
+ if (!soft_limit_excess(mz->memcg))
+ continue;
+
nr_scanned = 0;
reclaimed = mem_cgroup_soft_reclaim(mz->memcg, pgdat,

Tim

2021-02-19 09:15:03

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] mm: Force update of mem cgroup soft limit tree on usage excess

On Wed 17-02-21 12:41:35, Tim Chen wrote:
> To rate limit updates to the mem cgroup soft limit tree, we only perform
> updates every SOFTLIMIT_EVENTS_TARGET (defined as 1024) memory events.
>
> However, this sampling based updates may miss a critical update: i.e. when
> the mem cgroup first exceeded its limit but it was not on the soft limit tree.
> It should be on the tree at that point so it could be subjected to soft
> limit page reclaim. If the mem cgroup had few memory events compared with
> other mem cgroups, we may not update it and place in on the mem cgroup
> soft limit tree for many memory events. And this mem cgroup excess
> usage could creep up and the mem cgroup could be hidden from the soft
> limit page reclaim for a long time.
>
> Fix this issue by forcing an update to the mem cgroup soft limit tree if a
> mem cgroup has exceeded its memory soft limit but it is not on the mem
> cgroup soft limit tree.

Let me copy your clarification from the other reply (this should go to
the changelog btw.):
> The sceanrio I saw was we have multiple cgroups running pmbench.
> One cgroup exceeded the soft limit and soft reclaim is active on
> that cgroup. So there are a whole bunch of memcg events associated
> with that cgroup. Then another cgroup starts to exceed its
> soft limit.
>
> Memory is accessed at a much lower frequency
> for the second cgroup. The memcg event update was not triggered for the
> second cgroup as the memcg event update didn't happened on the 1024th sample.
> The second cgroup was not placed on the soft limit tree and we didn't
> try to reclaim the excess pages.
>
> As time goes on, we saw that the first cgroup was kept close to its
> soft limit due to reclaim activities, while the second cgroup's memory
> usage slowly creep up as it keeps getting missed from the soft limit tree
> update as the update didn't fall on the modulo 1024 sample. As a result,
> the memory usage of the second cgroup keeps growing over the soft limit
> for a long time due to its relatively rare occurrence.

Soft limit is evaluated every THRESHOLDS_EVENTS_TARGET * SOFTLIMIT_EVENTS_TARGET.
If all events correspond with a newly charged memory and the last event
was just about the soft limit boundary then we should be bound by 128k
pages (512M and much more if this were huge pages) which is a lot!
I haven't realized this was that much. Now I see the problem. This would
be a useful information for the changelog.

Your fix is focusing on the over-the-limit boundary which will solve the
problem but wouldn't that lead to to updates happening too often in
pathological situation when a memcg would get reclaimed immediatelly?

One way around that would be to lower the SOFTLIMIT_EVENTS_TARGET. Have
you tried that? Do we even need a separate treshold for soft limit, why
cannot we simply update the tree each MEM_CGROUP_TARGET_THRESH?

> Reviewed-by: Ying Huang <[email protected]>
> Signed-off-by: Tim Chen <[email protected]>
> ---
> mm/memcontrol.c | 11 +++++++++--
> 1 file changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a51bf90732cb..d72449eeb85a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -985,15 +985,22 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
> */
> static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)
> {
> + struct mem_cgroup_per_node *mz;
> + bool force_update = false;
> +
> + mz = mem_cgroup_nodeinfo(memcg, page_to_nid(page));
> + if (mz && !mz->on_tree && soft_limit_excess(mz->memcg) > 0)
> + force_update = true;
> +
> /* threshold event is triggered in finer grain than soft limit */
> - if (unlikely(mem_cgroup_event_ratelimit(memcg,
> + if (unlikely((force_update) || mem_cgroup_event_ratelimit(memcg,
> MEM_CGROUP_TARGET_THRESH))) {
> bool do_softlimit;
>
> do_softlimit = mem_cgroup_event_ratelimit(memcg,
> MEM_CGROUP_TARGET_SOFTLIMIT);
> mem_cgroup_threshold(memcg);
> - if (unlikely(do_softlimit))
> + if (unlikely((force_update) || do_softlimit))
> mem_cgroup_update_tree(memcg, page);
> }
> }
> --
> 2.20.1

--
Michal Hocko
SUSE Labs

2021-02-19 09:19:06

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] mm: Fix missing mem cgroup soft limit tree updates

On Wed 17-02-21 12:41:36, Tim Chen wrote:
> On a per node basis, the mem cgroup soft limit tree on each node tracks
> how much a cgroup has exceeded its soft limit memory limit and sorts
> the cgroup by its excess usage. On page release, the trees are not
> updated right away, until we have gathered a batch of pages belonging to
> the same cgroup. This reduces the frequency of updating the soft limit tree
> and locking of the tree and associated cgroup.
>
> However, the batch of pages could contain pages from multiple nodes but
> only the soft limit tree from one node would get updated. Change the
> logic so that we update the tree in batch of pages, with each batch of
> pages all in the same mem cgroup and memory node. An update is issued for
> the batch of pages of a node collected till now whenever we encounter
> a page belonging to a different node. Note that this batching for
> the same node logic is only relevant for v1 cgroup that has a memory
> soft limit.

Let me paste the discussion related to this patch from other reply:
> >> For patch 3 regarding the uncharge_batch, it
> >> is more of an observation that we should uncharge in batch of same node
> >> and not prompted by actual workload.
> >> Thinking more about this, the worst that could happen
> >> is we could have some entries in the soft limit tree that overestimate
> >> the memory used. The worst that could happen is a soft page reclaim
> >> on that cgroup. The overhead from extra memcg event update could
> >> be more than a soft page reclaim pass. So let's drop patch 3
> >> for now.
> >
> > I would still prefer to handle that in the soft limit reclaim path and
> > check each memcg for the soft limit reclaim excess before the reclaim.
> >
>
> Something like this?
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 8bddee75f5cb..b50cae3b2a1a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3472,6 +3472,14 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
> if (!mz)
> break;
>
> + /*
> + * Soft limit tree is updated based on memcg events sampling.
> + * We could have missed some updates on page uncharge and
> + * the cgroup is below soft limit. Skip useless soft reclaim.
> + */
> + if (!soft_limit_excess(mz->memcg))
> + continue;
> +
> nr_scanned = 0;
> reclaimed = mem_cgroup_soft_reclaim(mz->memcg, pgdat,

Yes I meant something like this but then I have looked more closely and
this shouldn't be needed afterall. __mem_cgroup_largest_soft_limit_node
already does all the work
if (!soft_limit_excess(mz->memcg) ||
!css_tryget(&mz->memcg->css))
goto retry;
so this shouldn't really happen.
--
Michal Hocko
SUSE Labs

2021-02-19 19:03:31

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] mm: Force update of mem cgroup soft limit tree on usage excess



On 2/19/21 1:11 AM, Michal Hocko wrote:
> On Wed 17-02-21 12:41:35, Tim Chen wrote:

>> Memory is accessed at a much lower frequency
>> for the second cgroup. The memcg event update was not triggered for the
>> second cgroup as the memcg event update didn't happened on the 1024th sample.
>> The second cgroup was not placed on the soft limit tree and we didn't
>> try to reclaim the excess pages.
>>
>> As time goes on, we saw that the first cgroup was kept close to its
>> soft limit due to reclaim activities, while the second cgroup's memory
>> usage slowly creep up as it keeps getting missed from the soft limit tree
>> update as the update didn't fall on the modulo 1024 sample. As a result,
>> the memory usage of the second cgroup keeps growing over the soft limit
>> for a long time due to its relatively rare occurrence.
>
> Soft limit is evaluated every THRESHOLDS_EVENTS_TARGET * SOFTLIMIT_EVENTS_TARGET.
> If all events correspond with a newly charged memory and the last event
> was just about the soft limit boundary then we should be bound by 128k
> pages (512M and much more if this were huge pages) which is a lot!
> I haven't realized this was that much. Now I see the problem. This would
> be a useful information for the changelog.
>
> Your fix is focusing on the over-the-limit boundary which will solve the
> problem but wouldn't that lead to to updates happening too often in
> pathological situation when a memcg would get reclaimed immediatelly?

Not really immediately. The memcg that has the most soft limit excess will
be chosen for page reclaim, which is the way it should be.
It is less likely that a memcg that just exceeded
the soft limit becomes the worst offender immediately. With the fix, we make
sure that it is on the bad guys list and will not be ignored and be chosen
eventually for reclaim. It will not sneakily increase its memory usage
slowly.

>
> One way around that would be to lower the SOFTLIMIT_EVENTS_TARGET. Have
> you tried that? Do we even need a separate treshold for soft limit, why
> cannot we simply update the tree each MEM_CGROUP_TARGET_THRESH?
>

Lowering the threshold is a band aid that really doesn't fix the problem.
I found that if the cgroup touches the memory infrequently enough, you
could still miss the update of it. And in the mean time, you are updating
things a lot more frequently with added overhead.

Tim

2021-02-19 19:32:30

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] mm: Fix missing mem cgroup soft limit tree updates



On 2/19/21 1:16 AM, Michal Hocko wrote:

>>
>> Something like this?
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 8bddee75f5cb..b50cae3b2a1a 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -3472,6 +3472,14 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
>> if (!mz)
>> break;
>>
>> + /*
>> + * Soft limit tree is updated based on memcg events sampling.
>> + * We could have missed some updates on page uncharge and
>> + * the cgroup is below soft limit. Skip useless soft reclaim.
>> + */
>> + if (!soft_limit_excess(mz->memcg))
>> + continue;
>> +
>> nr_scanned = 0;
>> reclaimed = mem_cgroup_soft_reclaim(mz->memcg, pgdat,
>
> Yes I meant something like this but then I have looked more closely and
> this shouldn't be needed afterall. __mem_cgroup_largest_soft_limit_node
> already does all the work
> if (!soft_limit_excess(mz->memcg) ||
> !css_tryget(&mz->memcg->css))
> goto retry;
> so this shouldn't really happen.
>

Ah, that's true. The added check for soft_limit_excess is not needed.

Do you think it is still a good idea to add patch 3 to
restrict the uncharge update in page batch of the same node and cgroup?

I am okay with dropping patch 3 and let the inaccuracies in the ordering
of soft limit tree be cleared out by an occasional soft reclaim.
These inaccuracies will still be there even with patch 3 fix due
to the memcg event sampling. Patch 3 does help to keep the soft reclaim
tree ordering more up to date.

Thanks.

Tim

2021-02-20 16:29:01

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] mm: Force update of mem cgroup soft limit tree on usage excess



On 2/19/21 10:59 AM, Tim Chen wrote:
>
>
> On 2/19/21 1:11 AM, Michal Hocko wrote:

>>
>> Soft limit is evaluated every THRESHOLDS_EVENTS_TARGET * SOFTLIMIT_EVENTS_TARGET.
>> If all events correspond with a newly charged memory and the last event
>> was just about the soft limit boundary then we should be bound by 128k
>> pages (512M and much more if this were huge pages) which is a lot!
>> I haven't realized this was that much. Now I see the problem. This would
>> be a useful information for the changelog.
>>
>> Your fix is focusing on the over-the-limit boundary which will solve the
>> problem but wouldn't that lead to to updates happening too often in
>> pathological situation when a memcg would get reclaimed immediatelly?
>
> Not really immediately. The memcg that has the most soft limit excess will
> be chosen for page reclaim, which is the way it should be.
> It is less likely that a memcg that just exceeded
> the soft limit becomes the worst offender immediately. With the fix, we make
> sure that it is on the bad guys list and will not be ignored and be chosen
> eventually for reclaim. It will not sneakily increase its memory usage
> slowly.
>

I should also mention that the forced update is only performed once when
the memcg first exceeded the soft limit as we check whether the memcg
is already in the soft limit tree due to the !mz->on_tree check.
+ if (mz && !mz->on_tree && soft_limit_excess(mz->memcg) > 0)
+ force_update = true;

So the update overhead is very low.

Tim

2021-02-22 08:44:56

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] mm: Force update of mem cgroup soft limit tree on usage excess

On Fri 19-02-21 10:59:05, Tim Chen wrote:
>
>
> On 2/19/21 1:11 AM, Michal Hocko wrote:
> > On Wed 17-02-21 12:41:35, Tim Chen wrote:
>
> >> Memory is accessed at a much lower frequency
> >> for the second cgroup. The memcg event update was not triggered for the
> >> second cgroup as the memcg event update didn't happened on the 1024th sample.
> >> The second cgroup was not placed on the soft limit tree and we didn't
> >> try to reclaim the excess pages.
> >>
> >> As time goes on, we saw that the first cgroup was kept close to its
> >> soft limit due to reclaim activities, while the second cgroup's memory
> >> usage slowly creep up as it keeps getting missed from the soft limit tree
> >> update as the update didn't fall on the modulo 1024 sample. As a result,
> >> the memory usage of the second cgroup keeps growing over the soft limit
> >> for a long time due to its relatively rare occurrence.
> >
> > Soft limit is evaluated every THRESHOLDS_EVENTS_TARGET * SOFTLIMIT_EVENTS_TARGET.
> > If all events correspond with a newly charged memory and the last event
> > was just about the soft limit boundary then we should be bound by 128k
> > pages (512M and much more if this were huge pages) which is a lot!
> > I haven't realized this was that much. Now I see the problem. This would
> > be a useful information for the changelog.
> >
> > Your fix is focusing on the over-the-limit boundary which will solve the
> > problem but wouldn't that lead to to updates happening too often in
> > pathological situation when a memcg would get reclaimed immediatelly?
>
> Not really immediately. The memcg that has the most soft limit excess will
> be chosen for page reclaim, which is the way it should be.
> It is less likely that a memcg that just exceeded
> the soft limit becomes the worst offender immediately.

Well this all depends on when the the soft limit reclaim triggeres. In
other words how often you see the global memory reclaim. If we have a
memcg with a sufficient excess then this will work mostly fine. I was more
worried about a case when you have memcgs just slightly over the limit
and the global memory pressure is a regular event. You can easily end up
bouncing memcgs off and on the tree in a rapid fashion.

> With the fix, we make
> sure that it is on the bad guys list and will not be ignored and be chosen
> eventually for reclaim. It will not sneakily increase its memory usage
> slowly.
>
> >
> > One way around that would be to lower the SOFTLIMIT_EVENTS_TARGET. Have
> > you tried that? Do we even need a separate treshold for soft limit, why
> > cannot we simply update the tree each MEM_CGROUP_TARGET_THRESH?
> >
>
> Lowering the threshold is a band aid that really doesn't fix the problem.
> I found that if the cgroup touches the memory infrequently enough, you
> could still miss the update of it. And in the mean time, you are updating
> things a lot more frequently with added overhead.

Yes, I agree this is more of a workaround than a fix but I would rather
go and touch the threshold which is simply bad than play more tricks
which can lead to other potential problems. All that for a feature which
is rarely used and quite problematic in itself. Not sure what Johannes
thinks about that.
--
Michal Hocko
SUSE Labs

2021-02-22 08:45:13

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] mm: Fix missing mem cgroup soft limit tree updates

On Fri 19-02-21 11:28:47, Tim Chen wrote:
>
>
> On 2/19/21 1:16 AM, Michal Hocko wrote:
>
> >>
> >> Something like this?
> >>
> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >> index 8bddee75f5cb..b50cae3b2a1a 100644
> >> --- a/mm/memcontrol.c
> >> +++ b/mm/memcontrol.c
> >> @@ -3472,6 +3472,14 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
> >> if (!mz)
> >> break;
> >>
> >> + /*
> >> + * Soft limit tree is updated based on memcg events sampling.
> >> + * We could have missed some updates on page uncharge and
> >> + * the cgroup is below soft limit. Skip useless soft reclaim.
> >> + */
> >> + if (!soft_limit_excess(mz->memcg))
> >> + continue;
> >> +
> >> nr_scanned = 0;
> >> reclaimed = mem_cgroup_soft_reclaim(mz->memcg, pgdat,
> >
> > Yes I meant something like this but then I have looked more closely and
> > this shouldn't be needed afterall. __mem_cgroup_largest_soft_limit_node
> > already does all the work
> > if (!soft_limit_excess(mz->memcg) ||
> > !css_tryget(&mz->memcg->css))
> > goto retry;
> > so this shouldn't really happen.
> >
>
> Ah, that's true. The added check for soft_limit_excess is not needed.
>
> Do you think it is still a good idea to add patch 3 to
> restrict the uncharge update in page batch of the same node and cgroup?

I would rather drop it. The less the soft limit reclaim code is spread
around the better.
--
Michal Hocko
SUSE Labs

2021-02-22 17:45:45

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] mm: Force update of mem cgroup soft limit tree on usage excess



On 2/22/21 12:40 AM, Michal Hocko wrote:
> On Fri 19-02-21 10:59:05, Tim Chen wrote:
occurrence.
>>>
>>> Soft limit is evaluated every THRESHOLDS_EVENTS_TARGET * SOFTLIMIT_EVENTS_TARGET.
>>> If all events correspond with a newly charged memory and the last event
>>> was just about the soft limit boundary then we should be bound by 128k
>>> pages (512M and much more if this were huge pages) which is a lot!
>>> I haven't realized this was that much. Now I see the problem. This would
>>> be a useful information for the changelog.
>>>
>>> Your fix is focusing on the over-the-limit boundary which will solve the
>>> problem but wouldn't that lead to to updates happening too often in
>>> pathological situation when a memcg would get reclaimed immediatelly?
>>
>> Not really immediately. The memcg that has the most soft limit excess will
>> be chosen for page reclaim, which is the way it should be.
>> It is less likely that a memcg that just exceeded
>> the soft limit becomes the worst offender immediately.
>
> Well this all depends on when the the soft limit reclaim triggeres. In
> other words how often you see the global memory reclaim. If we have a
> memcg with a sufficient excess then this will work mostly fine. I was more
> worried about a case when you have memcgs just slightly over the limit
> and the global memory pressure is a regular event. You can easily end up
> bouncing memcgs off and on the tree in a rapid fashion.
>

If you are concerned about such a case, we can add an excess threshold,
say 4 MB (or 1024 4K pages), before we trigger a forced update. You think
that will cover this concern?

>>>
>>> One way around that would be to lower the SOFTLIMIT_EVENTS_TARGET. Have
>>> you tried that? Do we even need a separate treshold for soft limit, why
>>> cannot we simply update the tree each MEM_CGROUP_TARGET_THRESH?
>>>
>>
>> Lowering the threshold is a band aid that really doesn't fix the problem.
>> I found that if the cgroup touches the memory infrequently enough, you
>> could still miss the update of it. And in the mean time, you are updating
>> things a lot more frequently with added overhead.
>
> Yes, I agree this is more of a workaround than a fix but I would rather
> go and touch the threshold which is simply bad than play more tricks
> which can lead to other potential problems. All that for a feature which
> is rarely used and quite problematic in itself. Not sure what Johannes
> thinks about that.
>

I actually have tried adjusting the threshold but found that it doesn't work well for
the case with unenven memory access frequency between cgroups. The soft
limit for the low memory event cgroup could creep up quite a lot, exceeding
the soft limit by hundreds of MB, even
if I drop the SOFTLIMIT_EVENTS_TARGET from 1024 to something like 8.

Tim

2021-02-22 17:49:58

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] mm: Fix missing mem cgroup soft limit tree updates



On 2/22/21 12:41 AM, Michal Hocko wrote:

>>
>>
>> Ah, that's true. The added check for soft_limit_excess is not needed.
>>
>> Do you think it is still a good idea to add patch 3 to
>> restrict the uncharge update in page batch of the same node and cgroup?
>
> I would rather drop it. The less the soft limit reclaim code is spread
> around the better.
>

Let's drop patch 3 then. I find patch 2 is the most critical one in this series.
Without that patch some cgroups exceeds the soft limit excess very badly.

Tim

2021-02-22 19:16:50

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] mm: Force update of mem cgroup soft limit tree on usage excess

On Mon 22-02-21 09:41:00, Tim Chen wrote:
>
>
> On 2/22/21 12:40 AM, Michal Hocko wrote:
> > On Fri 19-02-21 10:59:05, Tim Chen wrote:
> occurrence.
> >>>
> >>> Soft limit is evaluated every THRESHOLDS_EVENTS_TARGET * SOFTLIMIT_EVENTS_TARGET.
> >>> If all events correspond with a newly charged memory and the last event
> >>> was just about the soft limit boundary then we should be bound by 128k
> >>> pages (512M and much more if this were huge pages) which is a lot!
> >>> I haven't realized this was that much. Now I see the problem. This would
> >>> be a useful information for the changelog.
> >>>
> >>> Your fix is focusing on the over-the-limit boundary which will solve the
> >>> problem but wouldn't that lead to to updates happening too often in
> >>> pathological situation when a memcg would get reclaimed immediatelly?
> >>
> >> Not really immediately. The memcg that has the most soft limit excess will
> >> be chosen for page reclaim, which is the way it should be.
> >> It is less likely that a memcg that just exceeded
> >> the soft limit becomes the worst offender immediately.
> >
> > Well this all depends on when the the soft limit reclaim triggeres. In
> > other words how often you see the global memory reclaim. If we have a
> > memcg with a sufficient excess then this will work mostly fine. I was more
> > worried about a case when you have memcgs just slightly over the limit
> > and the global memory pressure is a regular event. You can easily end up
> > bouncing memcgs off and on the tree in a rapid fashion.
> >
>
> If you are concerned about such a case, we can add an excess threshold,
> say 4 MB (or 1024 4K pages), before we trigger a forced update. You think
> that will cover this concern?

Yes some sort of rate limiting should help. My understanding has been
that this is the main purpose of the even counting threshold. The code
as we have doesn't seem to work properly so there are two ways, either
tune the existing threshold or replace it by something else. Having both
a force update and non-functional threshold is not a great outcome.

> >>> One way around that would be to lower the SOFTLIMIT_EVENTS_TARGET. Have
> >>> you tried that? Do we even need a separate treshold for soft limit, why
> >>> cannot we simply update the tree each MEM_CGROUP_TARGET_THRESH?
> >>>
> >>
> >> Lowering the threshold is a band aid that really doesn't fix the problem.
> >> I found that if the cgroup touches the memory infrequently enough, you
> >> could still miss the update of it. And in the mean time, you are updating
> >> things a lot more frequently with added overhead.
> >
> > Yes, I agree this is more of a workaround than a fix but I would rather
> > go and touch the threshold which is simply bad than play more tricks
> > which can lead to other potential problems. All that for a feature which
> > is rarely used and quite problematic in itself. Not sure what Johannes
> > thinks about that.
> >
>
> I actually have tried adjusting the threshold but found that it doesn't work well for
> the case with unenven memory access frequency between cgroups. The soft
> limit for the low memory event cgroup could creep up quite a lot, exceeding
> the soft limit by hundreds of MB, even
> if I drop the SOFTLIMIT_EVENTS_TARGET from 1024 to something like 8.

What was the underlying reason? Higher order allocations?

--
Michal Hocko
SUSE Labs

2021-02-22 19:30:03

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] mm: Force update of mem cgroup soft limit tree on usage excess



On 2/22/21 11:09 AM, Michal Hocko wrote:
> On Mon 22-02-21 09:41:00, Tim Chen wrote:
>>
>>
>> On 2/22/21 12:40 AM, Michal Hocko wrote:
>>> On Fri 19-02-21 10:59:05, Tim Chen wrote:
>> occurrence.
>>>>>
>>>>> Soft limit is evaluated every THRESHOLDS_EVENTS_TARGET * SOFTLIMIT_EVENTS_TARGET.
>>>>> If all events correspond with a newly charged memory and the last event
>>>>> was just about the soft limit boundary then we should be bound by 128k
>>>>> pages (512M and much more if this were huge pages) which is a lot!
>>>>> I haven't realized this was that much. Now I see the problem. This would
>>>>> be a useful information for the changelog.
>>>>>
>>>>> Your fix is focusing on the over-the-limit boundary which will solve the
>>>>> problem but wouldn't that lead to to updates happening too often in
>>>>> pathological situation when a memcg would get reclaimed immediatelly?
>>>>
>>>> Not really immediately. The memcg that has the most soft limit excess will
>>>> be chosen for page reclaim, which is the way it should be.
>>>> It is less likely that a memcg that just exceeded
>>>> the soft limit becomes the worst offender immediately.
>>>
>>> Well this all depends on when the the soft limit reclaim triggeres. In
>>> other words how often you see the global memory reclaim. If we have a
>>> memcg with a sufficient excess then this will work mostly fine. I was more
>>> worried about a case when you have memcgs just slightly over the limit
>>> and the global memory pressure is a regular event. You can easily end up
>>> bouncing memcgs off and on the tree in a rapid fashion.
>>>
>>
>> If you are concerned about such a case, we can add an excess threshold,
>> say 4 MB (or 1024 4K pages), before we trigger a forced update. You think
>> that will cover this concern?
>
> Yes some sort of rate limiting should help. My understanding has been
> that this is the main purpose of the even counting threshold. The code
> as we have doesn't seem to work properly so there are two ways, either
> tune the existing threshold or replace it by something else. Having both
> a force update and non-functional threshold is not a great outcome.
>

The event counting threshold's purpose is to limit the *rate* of event update. The
side effect is we will miss some update by the sampling nature. However, we can't
afford to let the sampling go to frequent or the overhead will be too much.

The forced update makes sure that a needed update happens
to put the memcg on the tree happens and we don't allow the
memcg to escape page reclaim even if the update happens relatively rarely. Lowering
the threshold does not guarantee that the update happens, resulting in a runaway
memcg. The forced update and threshold serves different purpose. In my opinion
the forced update is necessary. We can keep the update overhead low with
low update rate, and yet make sure that a needed update happens. I think both
are necessary for proper function. The overhead from the forced update
is negligible and I don't a downside adding it.

Tim

2021-02-22 19:54:10

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] mm: Force update of mem cgroup soft limit tree on usage excess



On 2/22/21 11:09 AM, Michal Hocko wrote:

>>
>> I actually have tried adjusting the threshold but found that it doesn't work well for
>> the case with unenven memory access frequency between cgroups. The soft
>> limit for the low memory event cgroup could creep up quite a lot, exceeding
>> the soft limit by hundreds of MB, even
>> if I drop the SOFTLIMIT_EVENTS_TARGET from 1024 to something like 8.
>
> What was the underlying reason? Higher order allocations?
>

Not high order allocation.

The reason was because the run away memcg asks for memory much less often, compared
to the other memcgs in the system. So it escapes the sampling update and
was not put onto the tree and exceeds the soft limit
pretty badly. Even if it was put onto the tree and gets page reclaimed below the
limit, it could escape the sampling the next time it exceeds the soft limit.

As long as we are doing sampling update, this problem is baked in unless we
add the check to make sure that the memcg is subjected to page reclaim as long
as it exceeds the soft limit.

Tim

2021-02-22 20:25:20

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] mm: Fix missing mem cgroup soft limit tree updates



On 2/17/21 9:56 PM, Johannes Weiner wrote:

>> static inline void uncharge_gather_clear(struct uncharge_gather *ug)
>> @@ -6849,7 +6850,13 @@ static void uncharge_page(struct page *page, struct uncharge_gather *ug)
>> * exclusive access to the page.
>> */
>>
>> - if (ug->memcg != page_memcg(page)) {
>> + if (ug->memcg != page_memcg(page) ||
>> + /*
>> + * Update soft limit tree used in v1 cgroup in page batch for
>> + * the same node. Relevant only to v1 cgroup with a soft limit.
>> + */
>> + (ug->dummy_page && ug->nid != page_to_nid(page) &&
>> + ug->memcg->soft_limit != PAGE_COUNTER_MAX)) {
>
> Sorry, I used weird phrasing in my last email.
>
> Can you please preface the checks you're adding with a
> !cgroup_subsys_on_dfl(memory_cgrp_subsys) to static branch for
> cgroup1? The uncharge path is pretty hot, and this would avoid the
> runtime overhead on cgroup2 at least, which doesn't have the SL.
>
> Also, do we need the ug->dummy_page check? It's only NULL on the first
> loop - where ug->memcg is NULL as well and the branch is taken anyway.
>
> The soft limit check is also slightly cheaper than the nid check, as
> page_to_nid() might be out-of-line, so we should do it first. This?
>
> /*
> * Batch-uncharge all pages of the same memcg.
> *
> * Unless we're looking at a cgroup1 with a softlimit
> * set: the soft limit trees are maintained per-node
> * and updated on uncharge (via dummy_page), so keep
> * batches confined to a single node as well.
> */
> if (ug->memcg != page_memcg(page) ||
> (!cgroup_subsys_on_dfl(memory_cgrp_subsys) &&
> ug->memcg->soft_limit != PAGE_COUNTER_MAX &&
> ug->nid != page_to_nid(page)))
>

Johannes,

Thanks for your feedback. Since Michal has concerns about the overhead
this patch could incur, I think we'll hold the patch for now. If later
on Michal think that this patch is a good idea, I'll incorporate these
changes you suggested.

Tim

2021-02-23 15:22:30

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] mm: Fix missing mem cgroup soft limit tree updates

On Mon, Feb 22, 2021 at 10:38:27AM -0800, Tim Chen wrote:
> Johannes,
>
> Thanks for your feedback. Since Michal has concerns about the overhead
> this patch could incur, I think we'll hold the patch for now. If later
> on Michal think that this patch is a good idea, I'll incorporate these
> changes you suggested.

That works for me.

Thanks!

2021-02-24 11:59:38

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] mm: Force update of mem cgroup soft limit tree on usage excess

On Mon 22-02-21 11:48:37, Tim Chen wrote:
>
>
> On 2/22/21 11:09 AM, Michal Hocko wrote:
>
> >>
> >> I actually have tried adjusting the threshold but found that it doesn't work well for
> >> the case with unenven memory access frequency between cgroups. The soft
> >> limit for the low memory event cgroup could creep up quite a lot, exceeding
> >> the soft limit by hundreds of MB, even
> >> if I drop the SOFTLIMIT_EVENTS_TARGET from 1024 to something like 8.
> >
> > What was the underlying reason? Higher order allocations?
> >
>
> Not high order allocation.
>
> The reason was because the run away memcg asks for memory much less often, compared
> to the other memcgs in the system. So it escapes the sampling update and
> was not put onto the tree and exceeds the soft limit
> pretty badly. Even if it was put onto the tree and gets page reclaimed below the
> limit, it could escape the sampling the next time it exceeds the soft limit.

I am sorry but I really do not follow. Maybe I am missing something
obvious but the the rate of events (charge/uncharge) shouldn't be really
important. There is no way to exceed the limit without charging memory
(either a new or via task migration in v1 and immigrate_on_move). If you
have SOFTLIMIT_EVENTS_TARGET 8 then you should be 128 * 8 events to
re-evaluate. Huge pages can make the runaway much bigger but how it
would be possible to runaway outside of that bound.

Btw. do we really need SOFTLIMIT_EVENTS_TARGET at all? Why cannot we
just stick with a single threshold? mem_cgroup_update_tree can be made
a effectivelly a noop when there is no soft limit in place so overhead
shouldn't matter for the vast majority of workloads.
--
Michal Hocko
SUSE Labs

2021-02-25 22:31:12

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] mm: Force update of mem cgroup soft limit tree on usage excess



On 2/22/21 9:41 AM, Tim Chen wrote:
>
>
> On 2/22/21 12:40 AM, Michal Hocko wrote:
>> On Fri 19-02-21 10:59:05, Tim Chen wrote:
> occurrence.
>>>>
>>>> Soft limit is evaluated every THRESHOLDS_EVENTS_TARGET * SOFTLIMIT_EVENTS_TARGET.
>>>> If all events correspond with a newly charged memory and the last event
>>>> was just about the soft limit boundary then we should be bound by 128k
>>>> pages (512M and much more if this were huge pages) which is a lot!
>>>> I haven't realized this was that much. Now I see the problem. This would
>>>> be a useful information for the changelog.
>>>>
>>>> Your fix is focusing on the over-the-limit boundary which will solve the
>>>> problem but wouldn't that lead to to updates happening too often in
>>>> pathological situation when a memcg would get reclaimed immediatelly?
>>>
>>> Not really immediately. The memcg that has the most soft limit excess will
>>> be chosen for page reclaim, which is the way it should be.
>>> It is less likely that a memcg that just exceeded
>>> the soft limit becomes the worst offender immediately.
>>
>> Well this all depends on when the the soft limit reclaim triggeres. In
>> other words how often you see the global memory reclaim. If we have a
>> memcg with a sufficient excess then this will work mostly fine. I was more
>> worried about a case when you have memcgs just slightly over the limit
>> and the global memory pressure is a regular event. You can easily end up
>> bouncing memcgs off and on the tree in a rapid fashion.
>>
>
> If you are concerned about such a case, we can add an excess threshold,
> say 4 MB (or 1024 4K pages), before we trigger a forced update. You think
> that will cover this concern?
>

Michal,

How about modifiying this patch with a threshold? Like the following?

Tim

---
From 5a78ab56e2e654290cacab2f5a1631e1da1d90d2 Mon Sep 17 00:00:00 2001
From: Tim Chen <[email protected]>
Date: Wed, 3 Feb 2021 14:08:48 -0800
Subject: [PATCH] mm: Force update of mem cgroup soft limit tree on usage
excess

To rate limit updates to the mem cgroup soft limit tree, we only perform
updates every SOFTLIMIT_EVENTS_TARGET (defined as 1024) memory events.

However, this sampling based updates may miss a critical update: i.e. when
the mem cgroup first exceeded its limit but it was not on the soft limit tree.
It should be on the tree at that point so it could be subjected to soft
limit page reclaim. If the mem cgroup had few memory events compared with
other mem cgroups, we may not update it and place in on the mem cgroup
soft limit tree for many memory events. And this mem cgroup excess
usage could creep up and the mem cgroup could be hidden from the soft
limit page reclaim for a long time.

Fix this issue by forcing an update to the mem cgroup soft limit tree if a
mem cgroup has exceeded its memory soft limit but it is not on the mem
cgroup soft limit tree.

---
mm/memcontrol.c | 19 +++++++++++++++++--
1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a51bf90732cb..e0f6948f8ea5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -104,6 +104,7 @@ static bool do_memsw_account(void)

#define THRESHOLDS_EVENTS_TARGET 128
#define SOFTLIMIT_EVENTS_TARGET 1024
+#define SOFTLIMIT_EXCESS_THRESHOLD 1024

/*
* Cgroups above their limits are maintained in a RB-Tree, independent of
@@ -985,15 +986,29 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
*/
static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)
{
+ struct mem_cgroup_per_node *mz;
+ bool force_update = false;
+
+ mz = mem_cgroup_nodeinfo(memcg, page_to_nid(page));
+ /*
+ * mem_cgroup_update_tree may not be called for a memcg exceeding
+ * soft limit due to the sampling nature of update. Don't allow
+ * a memcg to be left out of the tree if it has too much usage
+ * excess.
+ */
+ if (mz && !mz->on_tree &&
+ soft_limit_excess(mz->memcg) > SOFTLIMIT_EXCESS_THRESHOLD)
+ force_update = true;
+
/* threshold event is triggered in finer grain than soft limit */
- if (unlikely(mem_cgroup_event_ratelimit(memcg,
+ if (unlikely((force_update) || mem_cgroup_event_ratelimit(memcg,
MEM_CGROUP_TARGET_THRESH))) {
bool do_softlimit;

do_softlimit = mem_cgroup_event_ratelimit(memcg,
MEM_CGROUP_TARGET_SOFTLIMIT);
mem_cgroup_threshold(memcg);
- if (unlikely(do_softlimit))
+ if (unlikely((force_update) || do_softlimit))
mem_cgroup_update_tree(memcg, page);
}
}
--
2.20.1


2021-02-25 22:51:47

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] mm: Force update of mem cgroup soft limit tree on usage excess



On 2/24/21 3:53 AM, Michal Hocko wrote:
> On Mon 22-02-21 11:48:37, Tim Chen wrote:
>>
>>
>> On 2/22/21 11:09 AM, Michal Hocko wrote:
>>
>>>>
>>>> I actually have tried adjusting the threshold but found that it doesn't work well for
>>>> the case with unenven memory access frequency between cgroups. The soft
>>>> limit for the low memory event cgroup could creep up quite a lot, exceeding
>>>> the soft limit by hundreds of MB, even
>>>> if I drop the SOFTLIMIT_EVENTS_TARGET from 1024 to something like 8.
>>>
>>> What was the underlying reason? Higher order allocations?
>>>
>>
>> Not high order allocation.
>>
>> The reason was because the run away memcg asks for memory much less often, compared
>> to the other memcgs in the system. So it escapes the sampling update and
>> was not put onto the tree and exceeds the soft limit
>> pretty badly. Even if it was put onto the tree and gets page reclaimed below the
>> limit, it could escape the sampling the next time it exceeds the soft limit.
>
> I am sorry but I really do not follow. Maybe I am missing something
> obvious but the the rate of events (charge/uncharge) shouldn't be really
> important. There is no way to exceed the limit without charging memory
> (either a new or via task migration in v1 and immigrate_on_move). If you
> have SOFTLIMIT_EVENTS_TARGET 8 then you should be 128 * 8 events to
> re-evaluate. Huge pages can make the runaway much bigger but how it
> would be possible to runaway outside of that bound.


Michal,

Let's take an extreme case where memcg 1 always generate the
first event and memcg 2 generates the rest of 128*8-1 events
and the pattern repeat. The update tree happens on the 128*8th event
so memcg 1 did not trigger update tree. In this case we will
keep missing memcg 1's event and not put memcg 1 on the tree.

Something like this pattern of memory events


cg1 cg2 cg2 cg2 ....cg2 cg1 cg2 cg2 cg2....cg2 cg1 cg2 .....
^ ^
update tree update tree

Of course in real life the update events are random in nature.
However, due to the low occurrence of memcg 1 event, we can miss
updating it for a long time due to its lower probability of occurrence.

>
> Btw. do we really need SOFTLIMIT_EVENTS_TARGET at all? Why cannot we
> just stick with a single threshold? mem_cgroup_update_tree can be made
> a effectivelly a noop when there is no soft limit in place so overhead
> shouldn't matter for the vast majority of workloads.
>

I think there are two limits because the original code wants
memc_cgroup_threshold to be updated more frequently than the
soft_limit_tree. The soft limit tree update is more costly.

Tim

2021-02-26 08:56:00

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] mm: Force update of mem cgroup soft limit tree on usage excess

On Thu 25-02-21 14:48:58, Tim Chen wrote:
>
>
> On 2/24/21 3:53 AM, Michal Hocko wrote:
> > On Mon 22-02-21 11:48:37, Tim Chen wrote:
> >>
> >>
> >> On 2/22/21 11:09 AM, Michal Hocko wrote:
> >>
> >>>>
> >>>> I actually have tried adjusting the threshold but found that it doesn't work well for
> >>>> the case with unenven memory access frequency between cgroups. The soft
> >>>> limit for the low memory event cgroup could creep up quite a lot, exceeding
> >>>> the soft limit by hundreds of MB, even
> >>>> if I drop the SOFTLIMIT_EVENTS_TARGET from 1024 to something like 8.
> >>>
> >>> What was the underlying reason? Higher order allocations?
> >>>
> >>
> >> Not high order allocation.
> >>
> >> The reason was because the run away memcg asks for memory much less often, compared
> >> to the other memcgs in the system. So it escapes the sampling update and
> >> was not put onto the tree and exceeds the soft limit
> >> pretty badly. Even if it was put onto the tree and gets page reclaimed below the
> >> limit, it could escape the sampling the next time it exceeds the soft limit.
> >
> > I am sorry but I really do not follow. Maybe I am missing something
> > obvious but the the rate of events (charge/uncharge) shouldn't be really
> > important. There is no way to exceed the limit without charging memory
> > (either a new or via task migration in v1 and immigrate_on_move). If you
> > have SOFTLIMIT_EVENTS_TARGET 8 then you should be 128 * 8 events to
> > re-evaluate. Huge pages can make the runaway much bigger but how it
> > would be possible to runaway outside of that bound.
>
>
> Michal,
>
> Let's take an extreme case where memcg 1 always generate the
> first event and memcg 2 generates the rest of 128*8-1 events
> and the pattern repeat.

I do not follow. Events are per-memcg, aren't they?
__this_cpu_read(memcg->vmstats_percpu->targets[target]);
[...]
__this_cpu_write(memcg->vmstats_percpu->targets[target], next);
--
Michal Hocko
SUSE Labs

2021-02-27 01:02:26

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] mm: Force update of mem cgroup soft limit tree on usage excess



On 2/26/21 12:52 AM, Michal Hocko wrote:

>>
>> Michal,
>>
>> Let's take an extreme case where memcg 1 always generate the
>> first event and memcg 2 generates the rest of 128*8-1 events
>> and the pattern repeat.
>
> I do not follow. Events are per-memcg, aren't they?
> __this_cpu_read(memcg->vmstats_percpu->targets[target]);
> [...]
> __this_cpu_write(memcg->vmstats_percpu->targets[target], next);
>

You are right. My previous reasoning is incorrect as the sampling is done per memcg.
I'll do some additional debugging on why memcg is not on the tree.

Tim

2021-03-02 20:03:09

by kernel test robot

[permalink] [raw]
Subject: [mm] 4f09feb8bf: vm-scalability.throughput -4.3% regression


Greeting,

FYI, we noticed a -4.3% regression of vm-scalability.throughput due to commit:


commit: 4f09feb8bf083be3834080ddf3782aee12a7c3f7 ("mm: Force update of mem cgroup soft limit tree on usage excess")
url: https://github.com/0day-ci/linux/commits/Tim-Chen/Soft-limit-memory-management-bug-fixes/20210218-054228


in testcase: vm-scalability
on test machine: 192 threads Intel(R) Xeon(R) Platinum 9242 CPU @ 2.30GHz with 192G memory
with following parameters:

runtime: 300s
test: lru-file-readonce
cpufreq_governor: performance
ucode: 0x5003006

test-description: The motivation behind this suite is to exercise functions and regions of the mm/ of the Linux kernel which are of interest to us.
test-url: https://git.kernel.org/cgit/linux/kernel/git/wfg/vm-scalability.git/



If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>


Details are as below:
-------------------------------------------------------------------------------------------------->


To reproduce:

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp install job.yaml # job file is attached in this email
bin/lkp split-job --compatible job.yaml
bin/lkp run compatible-job.yaml

=========================================================================================
compiler/cpufreq_governor/kconfig/rootfs/runtime/tbox_group/test/testcase/ucode:
gcc-9/performance/x86_64-rhel-8.3/debian-10.4-x86_64-20200603.cgz/300s/lkp-csl-2ap4/lru-file-readonce/vm-scalability/0x5003006

commit:
f0812bba8b ("mm: Fix dropped memcg from mem cgroup soft limit tree")
4f09feb8bf ("mm: Force update of mem cgroup soft limit tree on usage excess")

f0812bba8bbd02bf 4f09feb8bf083be3834080ddf37
---------------- ---------------------------
%stddev %change %stddev
\ | \
131114 -4.2% 125626 vm-scalability.median
25345052 -4.3% 24265993 vm-scalability.throughput
224.30 +3.6% 232.43 vm-scalability.time.elapsed_time
224.30 +3.6% 232.43 vm-scalability.time.elapsed_time.max
31611 +4.5% 33046 vm-scalability.time.system_time
5225 -2.5% 5095 vmstat.system.cs
95143 ? 4% +20.7% 114842 ? 15% meminfo.Active
94109 ? 4% +20.9% 113806 ? 15% meminfo.Active(anon)
7347 ? 8% +17.9% 8665 ? 11% softirqs.CPU73.RCU
6628 ? 7% +21.3% 8040 ? 11% softirqs.CPU77.RCU
7378 ? 7% +35.9% 10027 ? 20% softirqs.CPU90.RCU
6976 ? 4% +31.5% 9173 ? 25% softirqs.CPU95.RCU
8983 ? 24% +19.5% 10737 ? 4% softirqs.CPU96.SCHED
220617 ? 33% -68.4% 69618 ?106% numa-meminfo.node0.Inactive(anon)
4442 ? 24% -31.2% 3055 ? 18% numa-meminfo.node0.PageTables
2964 ? 45% +153.1% 7504 ? 23% numa-meminfo.node1.Active
2620 ? 39% +179.8% 7332 ? 25% numa-meminfo.node1.Active(anon)
37702 ?130% +248.6% 131446 ? 60% numa-meminfo.node1.AnonPages
39649 ?131% +368.8% 185879 ? 43% numa-meminfo.node1.Inactive(anon)
3078 ? 11% +50.9% 4647 ? 21% numa-meminfo.node1.PageTables
4610 ? 59% +1241.2% 61840 ? 78% numa-meminfo.node1.Shmem
23809 ? 4% +20.6% 28704 ? 15% proc-vmstat.nr_active_anon
939.33 ? 4% -8.1% 863.33 ? 3% proc-vmstat.nr_isolated_file
56262 +9.7% 61735 ? 8% proc-vmstat.nr_shmem
23811 ? 4% +20.6% 28705 ? 15% proc-vmstat.nr_zone_active_anon
75883 ? 2% +23.1% 93446 ? 19% proc-vmstat.pgactivate
1038398 +5.0% 1089871 ? 2% proc-vmstat.pgfault
65866 +3.5% 68179 proc-vmstat.pgreuse
18338900 ? 6% -8.5% 16783260 ? 4% proc-vmstat.slabs_scanned
1216 ? 9% -8.2% 1116 ? 3% interrupts.CPU145.CAL:Function_call_interrupts
27.33 ? 16% +195.1% 80.67 ? 72% interrupts.CPU172.RES:Rescheduling_interrupts
162.00 ? 4% +21.0% 196.00 ? 17% interrupts.CPU33.RES:Rescheduling_interrupts
163.83 ? 3% +15.8% 189.67 ? 14% interrupts.CPU34.RES:Rescheduling_interrupts
129.83 ? 2% +19.8% 155.50 ? 9% interrupts.CPU56.RES:Rescheduling_interrupts
97.50 ? 10% +40.0% 136.50 ? 20% interrupts.CPU65.RES:Rescheduling_interrupts
261.17 ? 44% -44.2% 145.67 ? 18% interrupts.CPU73.RES:Rescheduling_interrupts
49.17 ? 48% +133.9% 115.00 ? 36% interrupts.CPU85.RES:Rescheduling_interrupts
41.83 ? 24% +144.2% 102.17 ? 52% interrupts.CPU88.RES:Rescheduling_interrupts
48.17 ? 38% +215.2% 151.83 ? 54% interrupts.CPU89.RES:Rescheduling_interrupts
38.17 ? 15% +106.1% 78.67 ? 39% interrupts.CPU90.RES:Rescheduling_interrupts
55160 ? 33% -68.5% 17396 ?106% numa-vmstat.node0.nr_inactive_anon
3614 ? 8% -17.7% 2974 ? 11% numa-vmstat.node0.nr_mapped
1105 ? 24% -31.0% 762.67 ? 17% numa-vmstat.node0.nr_page_table_pages
55168 ? 33% -68.5% 17402 ?106% numa-vmstat.node0.nr_zone_inactive_anon
663.00 ? 39% +179.7% 1854 ? 25% numa-vmstat.node1.nr_active_anon
9426 ?130% +248.2% 32821 ? 60% numa-vmstat.node1.nr_anon_pages
9914 ?131% +368.5% 46447 ? 43% numa-vmstat.node1.nr_inactive_anon
764.00 ? 11% +51.7% 1159 ? 20% numa-vmstat.node1.nr_page_table_pages
1162 ? 58% +1233.7% 15500 ? 77% numa-vmstat.node1.nr_shmem
663.17 ? 39% +179.6% 1854 ? 25% numa-vmstat.node1.nr_zone_active_anon
9920 ?131% +368.3% 46454 ? 43% numa-vmstat.node1.nr_zone_inactive_anon
9.08 ? 2% +22.4% 11.12 ? 12% perf-stat.i.MPKI
1.303e+10 -4.7% 1.242e+10 perf-stat.i.branch-instructions
0.39 +0.1 0.50 ? 28% perf-stat.i.branch-miss-rate%
1.873e+08 +8.2% 2.027e+08 perf-stat.i.cache-misses
5.924e+08 +7.4% 6.365e+08 perf-stat.i.cache-references
5087 -2.6% 4957 perf-stat.i.context-switches
6.02 +5.2% 6.33 perf-stat.i.cpi
2001 -6.8% 1865 perf-stat.i.cycles-between-cache-misses
0.05 ? 19% +0.0 0.07 ? 24% perf-stat.i.dTLB-load-miss-rate%
1.544e+10 -4.2% 1.479e+10 perf-stat.i.dTLB-loads
5.401e+09 -2.9% 5.247e+09 perf-stat.i.dTLB-stores
24692490 +6.3% 26255880 perf-stat.i.iTLB-load-misses
5.933e+10 -4.4% 5.67e+10 perf-stat.i.instructions
1.89 ? 3% -9.9% 1.70 ? 5% perf-stat.i.major-faults
179.72 -4.0% 172.60 perf-stat.i.metric.M/sec
21954317 +7.3% 23554489 perf-stat.i.node-load-misses
12269278 -3.4% 11855032 perf-stat.i.node-store-misses
14458012 -3.3% 13976704 perf-stat.i.node-stores
10.01 +12.0% 11.21 perf-stat.overall.MPKI
0.32 +0.0 0.34 ? 4% perf-stat.overall.branch-miss-rate%
7.28 +5.3% 7.67 perf-stat.overall.cpi
2307 -7.0% 2146 perf-stat.overall.cycles-between-cache-misses
2408 -10.0% 2167 perf-stat.overall.instructions-per-iTLB-miss
0.14 -5.0% 0.13 perf-stat.overall.ipc
36.46 +2.1 38.51 perf-stat.overall.node-load-miss-rate%
1.346e+10 -4.6% 1.285e+10 perf-stat.ps.branch-instructions
1.936e+08 +8.2% 2.095e+08 perf-stat.ps.cache-misses
6.14e+08 +7.2% 6.579e+08 perf-stat.ps.cache-references
5172 -2.6% 5038 perf-stat.ps.context-switches
1.597e+10 -4.1% 1.531e+10 perf-stat.ps.dTLB-loads
5.575e+09 -2.9% 5.411e+09 perf-stat.ps.dTLB-stores
25461461 +6.3% 27074171 perf-stat.ps.iTLB-load-misses
6.131e+10 -4.3% 5.867e+10 perf-stat.ps.instructions
1.76 ? 2% -5.0% 1.68 perf-stat.ps.major-faults
22688997 +7.3% 24343963 perf-stat.ps.node-load-misses
12711574 -3.5% 12262917 perf-stat.ps.node-store-misses
14988174 -3.4% 14480039 perf-stat.ps.node-stores
0.01 ? 11% +87.2% 0.01 ? 18% perf-sched.sch_delay.avg.ms.__sched_text_start.__sched_text_start.smpboot_thread_fn.kthread.ret_from_fork
0.00 ?114% +353.6% 0.02 ? 47% perf-sched.sch_delay.max.ms.__sched_text_start.__sched_text_start.exit_to_user_mode_prepare.syscall_exit_to_user_mode.entry_SYSCALL_64_after_hwframe
0.02 ? 17% +338.9% 0.08 ? 62% perf-sched.sch_delay.max.ms.__sched_text_start.__sched_text_start.schedule_hrtimeout_range_clock.poll_schedule_timeout.constprop
0.02 ? 49% +3.1e+05% 76.43 ?185% perf-sched.sch_delay.max.ms.__sched_text_start.__sched_text_start.schedule_timeout.kcompactd.kthread
0.68 ?212% +983.9% 7.35 ? 37% perf-sched.sch_delay.max.ms.__sched_text_start.__sched_text_start.smpboot_thread_fn.kthread.ret_from_fork
3.16 ? 75% +8842.7% 282.23 ?218% perf-sched.sch_delay.max.ms.__sched_text_start.__sched_text_start.worker_thread.kthread.ret_from_fork
197.84 ? 7% -23.7% 150.97 ? 18% perf-sched.total_wait_and_delay.average.ms
13780 ? 6% +75.6% 24203 ? 27% perf-sched.total_wait_and_delay.count.ms
197.81 ? 7% -23.7% 150.88 ? 18% perf-sched.total_wait_time.average.ms
2.31 ?100% +286.1% 8.90 ? 38% perf-sched.wait_and_delay.avg.ms.__sched_text_start.__sched_text_start.devkmsg_read.vfs_read.ksys_read
2.32 ?100% +284.4% 8.91 ? 38% perf-sched.wait_and_delay.avg.ms.__sched_text_start.__sched_text_start.do_syslog.part.0
215.21 ? 4% -62.0% 81.72 ? 21% perf-sched.wait_and_delay.avg.ms.__sched_text_start.__sched_text_start.do_task_dead.do_exit.do_group_exit
0.91 ? 4% +145.7% 2.25 ? 5% perf-sched.wait_and_delay.avg.ms.__sched_text_start.__sched_text_start.do_wait.kernel_wait4.__do_sys_wait4
344.42 -92.3% 26.39 ? 33% perf-sched.wait_and_delay.avg.ms.__sched_text_start.__sched_text_start.schedule_hrtimeout_range_clock.poll_schedule_timeout.constprop
478.60 -36.5% 304.11 ? 31% perf-sched.wait_and_delay.avg.ms.__sched_text_start.__sched_text_start.schedule_timeout.kcompactd.kthread
655.59 ? 3% +9.7% 719.00 ? 2% perf-sched.wait_and_delay.avg.ms.__sched_text_start.__sched_text_start.smpboot_thread_fn.kthread.ret_from_fork
167.67 ?124% +426.7% 883.17 ? 12% perf-sched.wait_and_delay.count.__sched_text_start.__sched_text_start.devkmsg_read.vfs_read.ksys_read
11.17 ? 3% +11.9% 12.50 ? 4% perf-sched.wait_and_delay.count.__sched_text_start.__sched_text_start.do_nanosleep.hrtimer_nanosleep.__x64_sys_nanosleep
167.50 ?124% +427.1% 882.83 ? 12% perf-sched.wait_and_delay.count.__sched_text_start.__sched_text_start.do_syslog.part.0
303.50 +169.0% 816.50 ? 14% perf-sched.wait_and_delay.count.__sched_text_start.__sched_text_start.do_task_dead.do_exit.do_group_exit
302.17 +128.4% 690.17 perf-sched.wait_and_delay.count.__sched_text_start.__sched_text_start.do_wait.kernel_wait4.__do_sys_wait4
323.33 ? 28% -87.0% 42.00 ?141% perf-sched.wait_and_delay.count.__sched_text_start.__sched_text_start.preempt_schedule_common._cond_resched.truncate_inode_pages_range
185.67 ?115% +385.9% 902.17 ? 12% perf-sched.wait_and_delay.count.__sched_text_start.__sched_text_start.schedule_hrtimeout_range_clock.ep_poll.do_epoll_wait
26.17 +2201.9% 602.33 perf-sched.wait_and_delay.count.__sched_text_start.__sched_text_start.schedule_hrtimeout_range_clock.poll_schedule_timeout.constprop
79.33 +71.2% 135.83 ? 32% perf-sched.wait_and_delay.count.__sched_text_start.__sched_text_start.schedule_timeout.kcompactd.kthread
974.50 ? 19% +390.9% 4783 ? 70% perf-sched.wait_and_delay.count.__sched_text_start.__sched_text_start.worker_thread.kthread.ret_from_fork
5.23 ?104% +63694.8% 3339 ? 66% perf-sched.wait_and_delay.max.ms.__sched_text_start.__sched_text_start.devkmsg_read.vfs_read.ksys_read
5.23 ?104% +63704.2% 3339 ? 66% perf-sched.wait_and_delay.max.ms.__sched_text_start.__sched_text_start.do_syslog.part.0
21.62 ? 3% +4530.4% 1001 perf-sched.wait_and_delay.max.ms.__sched_text_start.__sched_text_start.do_wait.kernel_wait4.__do_sys_wait4
1019 +227.9% 3342 ? 66% perf-sched.wait_and_delay.max.ms.__sched_text_start.__sched_text_start.pipe_read.new_sync_read.vfs_read
333.34 ? 70% +963.4% 3544 ? 61% perf-sched.wait_and_delay.max.ms.__sched_text_start.__sched_text_start.schedule_hrtimeout_range_clock.ep_poll.do_epoll_wait
1000 +365.0% 4654 ? 71% perf-sched.wait_and_delay.max.ms.__sched_text_start.__sched_text_start.schedule_hrtimeout_range_clock.poll_schedule_timeout.constprop
5.03 +4104.3% 211.64 ? 41% perf-sched.wait_and_delay.max.ms.__sched_text_start.__sched_text_start.schedule_timeout.rcu_gp_kthread.kthread
3.46 ? 44% +157.3% 8.90 ? 38% perf-sched.wait_time.avg.ms.__sched_text_start.__sched_text_start.devkmsg_read.vfs_read.ksys_read
3.47 ? 44% +156.2% 8.90 ? 38% perf-sched.wait_time.avg.ms.__sched_text_start.__sched_text_start.do_syslog.part.0
215.20 ? 4% -62.0% 81.71 ? 21% perf-sched.wait_time.avg.ms.__sched_text_start.__sched_text_start.do_task_dead.do_exit.do_group_exit
0.91 ? 4% +146.2% 2.24 ? 5% perf-sched.wait_time.avg.ms.__sched_text_start.__sched_text_start.do_wait.kernel_wait4.__do_sys_wait4
0.01 ?149% +1185.7% 0.09 ? 53% perf-sched.wait_time.avg.ms.__sched_text_start.__sched_text_start.exit_to_user_mode_prepare.syscall_exit_to_user_mode.entry_SYSCALL_64_after_hwframe
0.51 ? 20% +111.1% 1.08 ? 21% perf-sched.wait_time.avg.ms.__sched_text_start.__sched_text_start.preempt_schedule_common._cond_resched.truncate_inode_pages_range
0.01 ?146% +9.6e+05% 86.17 ? 70% perf-sched.wait_time.avg.ms.__sched_text_start.__sched_text_start.preempt_schedule_common._cond_resched.ww_mutex_lock
1.73 ? 16% +95.0% 3.37 ? 29% perf-sched.wait_time.avg.ms.__sched_text_start.__sched_text_start.rcu_gp_kthread.kthread.ret_from_fork
344.42 -92.3% 26.38 ? 33% perf-sched.wait_time.avg.ms.__sched_text_start.__sched_text_start.schedule_hrtimeout_range_clock.poll_schedule_timeout.constprop
478.59 -36.6% 303.45 ? 31% perf-sched.wait_time.avg.ms.__sched_text_start.__sched_text_start.schedule_timeout.kcompactd.kthread
655.58 ? 3% +9.7% 718.99 ? 2% perf-sched.wait_time.avg.ms.__sched_text_start.__sched_text_start.smpboot_thread_fn.kthread.ret_from_fork
7.49 ? 52% +44482.5% 3339 ? 66% perf-sched.wait_time.max.ms.__sched_text_start.__sched_text_start.devkmsg_read.vfs_read.ksys_read
7.49 ? 52% +44481.2% 3339 ? 66% perf-sched.wait_time.max.ms.__sched_text_start.__sched_text_start.do_syslog.part.0
21.62 ? 3% +4531.3% 1001 perf-sched.wait_time.max.ms.__sched_text_start.__sched_text_start.do_wait.kernel_wait4.__do_sys_wait4
0.02 ?145% +12809.8% 1.98 ?134% perf-sched.wait_time.max.ms.__sched_text_start.__sched_text_start.exit_to_user_mode_prepare.syscall_exit_to_user_mode.entry_SYSCALL_64_after_hwframe
1019 +227.9% 3342 ? 66% perf-sched.wait_time.max.ms.__sched_text_start.__sched_text_start.pipe_read.new_sync_read.vfs_read
0.02 ?142% +2.9e+06% 436.52 ? 67% perf-sched.wait_time.max.ms.__sched_text_start.__sched_text_start.preempt_schedule_common._cond_resched.ww_mutex_lock
413.83 ? 43% +756.5% 3544 ? 61% perf-sched.wait_time.max.ms.__sched_text_start.__sched_text_start.schedule_hrtimeout_range_clock.ep_poll.do_epoll_wait
1000 +365.1% 4654 ? 71% perf-sched.wait_time.max.ms.__sched_text_start.__sched_text_start.schedule_hrtimeout_range_clock.poll_schedule_timeout.constprop
5.02 +4118.6% 211.60 ? 41% perf-sched.wait_time.max.ms.__sched_text_start.__sched_text_start.schedule_timeout.rcu_gp_kthread.kthread
30.68 ? 3% -2.6 28.08 ? 2% perf-profile.calltrace.cycles-pp.lock_page_lruvec_irqsave.__pagevec_lru_add.lru_cache_add.add_to_page_cache_lru.page_cache_ra_unbounded
30.67 ? 3% -2.6 28.07 ? 2% perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.lock_page_lruvec_irqsave.__pagevec_lru_add.lru_cache_add.add_to_page_cache_lru
30.63 ? 3% -2.6 28.03 ? 2% perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.lock_page_lruvec_irqsave.__pagevec_lru_add.lru_cache_add
31.63 ? 3% -2.6 29.07 ? 2% perf-profile.calltrace.cycles-pp.__pagevec_lru_add.lru_cache_add.add_to_page_cache_lru.page_cache_ra_unbounded.generic_file_buffered_read_get_pages
31.68 ? 3% -2.6 29.11 ? 2% perf-profile.calltrace.cycles-pp.lru_cache_add.add_to_page_cache_lru.page_cache_ra_unbounded.generic_file_buffered_read_get_pages.generic_file_buffered_read
49.97 ? 2% -2.6 47.42 perf-profile.calltrace.cycles-pp.__alloc_pages_nodemask.page_cache_ra_unbounded.generic_file_buffered_read_get_pages.generic_file_buffered_read.xfs_file_buffered_aio_read
5.62 ? 5% -0.5 5.10 ? 3% perf-profile.calltrace.cycles-pp.read_pages.page_cache_ra_unbounded.generic_file_buffered_read_get_pages.generic_file_buffered_read.xfs_file_buffered_aio_read
5.57 ? 5% -0.5 5.05 ? 3% perf-profile.calltrace.cycles-pp.iomap_readahead_actor.iomap_apply.iomap_readahead.read_pages.page_cache_ra_unbounded
5.61 ? 5% -0.5 5.09 ? 3% perf-profile.calltrace.cycles-pp.iomap_readahead.read_pages.page_cache_ra_unbounded.generic_file_buffered_read_get_pages.generic_file_buffered_read
5.60 ? 5% -0.5 5.09 ? 3% perf-profile.calltrace.cycles-pp.iomap_apply.iomap_readahead.read_pages.page_cache_ra_unbounded.generic_file_buffered_read_get_pages
5.34 ? 5% -0.5 4.83 ? 3% perf-profile.calltrace.cycles-pp.iomap_readpage_actor.iomap_readahead_actor.iomap_apply.iomap_readahead.read_pages
0.70 ? 14% +0.3 0.95 ? 9% perf-profile.calltrace.cycles-pp.try_charge.mem_cgroup_charge.__add_to_page_cache_locked.add_to_page_cache_lru.page_cache_ra_unbounded
0.38 ? 71% +0.3 0.70 ? 9% perf-profile.calltrace.cycles-pp.page_counter_try_charge.try_charge.mem_cgroup_charge.__add_to_page_cache_locked.add_to_page_cache_lru
0.00 +3.5 3.52 ? 9% perf-profile.calltrace.cycles-pp.memcg_check_events.mem_cgroup_charge.__add_to_page_cache_locked.add_to_page_cache_lru.page_cache_ra_unbounded
38.12 ? 2% +3.8 41.90 ? 2% perf-profile.calltrace.cycles-pp.add_to_page_cache_lru.page_cache_ra_unbounded.generic_file_buffered_read_get_pages.generic_file_buffered_read.xfs_file_buffered_aio_read
5.32 ? 19% +6.3 11.60 ? 6% perf-profile.calltrace.cycles-pp.mem_cgroup_charge.__add_to_page_cache_locked.add_to_page_cache_lru.page_cache_ra_unbounded.generic_file_buffered_read_get_pages
6.41 ? 17% +6.3 12.76 ? 5% perf-profile.calltrace.cycles-pp.__add_to_page_cache_locked.add_to_page_cache_lru.page_cache_ra_unbounded.generic_file_buffered_read_get_pages.generic_file_buffered_read
76.16 ? 2% -5.1 71.05 perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
38.16 ? 5% -3.5 34.66 ? 3% perf-profile.children.cycles-pp._raw_spin_lock_irqsave
50.11 ? 2% -2.6 47.51 perf-profile.children.cycles-pp.__alloc_pages_nodemask
30.76 ? 3% -2.6 28.17 ? 2% perf-profile.children.cycles-pp.lock_page_lruvec_irqsave
31.70 ? 3% -2.6 29.14 ? 2% perf-profile.children.cycles-pp.__pagevec_lru_add
31.72 ? 3% -2.6 29.16 ? 2% perf-profile.children.cycles-pp.lru_cache_add
4.13 ? 16% -1.4 2.69 ? 19% perf-profile.children.cycles-pp.get_page_from_freelist
3.98 ? 17% -1.4 2.55 ? 20% perf-profile.children.cycles-pp.rmqueue
3.73 ? 18% -1.4 2.32 ? 22% perf-profile.children.cycles-pp.rmqueue_bulk
3.60 ? 18% -1.4 2.19 ? 22% perf-profile.children.cycles-pp._raw_spin_lock
5.62 ? 5% -0.5 5.09 ? 2% perf-profile.children.cycles-pp.read_pages
5.57 ? 5% -0.5 5.05 ? 3% perf-profile.children.cycles-pp.iomap_readahead_actor
5.61 ? 5% -0.5 5.09 ? 3% perf-profile.children.cycles-pp.iomap_readahead
5.60 ? 5% -0.5 5.08 ? 3% perf-profile.children.cycles-pp.iomap_apply
5.36 ? 5% -0.5 4.85 ? 3% perf-profile.children.cycles-pp.iomap_readpage_actor
2.93 ? 5% -0.3 2.65 ? 3% perf-profile.children.cycles-pp.memset_erms
2.23 ? 5% -0.2 2.01 ? 3% perf-profile.children.cycles-pp.iomap_set_range_uptodate
0.73 ? 2% -0.1 0.67 perf-profile.children.cycles-pp.__list_del_entry_valid
0.06 ? 11% +0.0 0.08 ? 7% perf-profile.children.cycles-pp.uncharge_page
0.31 ? 7% +0.0 0.35 ? 4% perf-profile.children.cycles-pp.__count_memcg_events
0.18 ? 10% +0.0 0.23 ? 4% perf-profile.children.cycles-pp.mem_cgroup_charge_statistics
0.41 ? 7% +0.0 0.46 ? 3% perf-profile.children.cycles-pp.uncharge_batch
0.16 ? 14% +0.1 0.24 ? 7% perf-profile.children.cycles-pp.propagate_protected_usage
0.51 ? 7% +0.1 0.60 ? 4% perf-profile.children.cycles-pp.__mod_memcg_state
0.97 ? 8% +0.1 1.10 ? 3% perf-profile.children.cycles-pp.__mod_memcg_lruvec_state
0.68 ? 14% +0.2 0.93 ? 6% perf-profile.children.cycles-pp.page_counter_try_charge
0.91 ? 14% +0.3 1.25 ? 6% perf-profile.children.cycles-pp.try_charge
0.05 ? 8% +3.5 3.55 ? 9% perf-profile.children.cycles-pp.memcg_check_events
38.12 ? 2% +3.8 41.91 ? 2% perf-profile.children.cycles-pp.add_to_page_cache_lru
5.34 ? 19% +6.3 11.62 ? 6% perf-profile.children.cycles-pp.mem_cgroup_charge
6.42 ? 17% +6.4 12.77 ? 5% perf-profile.children.cycles-pp.__add_to_page_cache_locked
76.16 ? 2% -5.1 71.05 perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath
2.91 ? 5% -0.3 2.63 ? 3% perf-profile.self.cycles-pp.memset_erms
2.20 ? 5% -0.2 1.98 ? 3% perf-profile.self.cycles-pp.iomap_set_range_uptodate
0.20 ? 8% -0.0 0.17 ? 4% perf-profile.self.cycles-pp.rmqueue_bulk
0.13 ? 4% -0.0 0.11 ? 6% perf-profile.self.cycles-pp.__remove_mapping
0.06 ? 11% +0.0 0.08 ? 7% perf-profile.self.cycles-pp.uncharge_page
0.31 ? 7% +0.0 0.35 ? 4% perf-profile.self.cycles-pp.__count_memcg_events
0.16 ? 14% +0.1 0.24 ? 9% perf-profile.self.cycles-pp.propagate_protected_usage
0.23 ? 13% +0.1 0.32 ? 6% perf-profile.self.cycles-pp.try_charge
0.51 ? 7% +0.1 0.60 ? 3% perf-profile.self.cycles-pp.__mod_memcg_state
0.58 ? 13% +0.2 0.75 ? 6% perf-profile.self.cycles-pp.page_counter_try_charge
2.25 ? 25% +1.8 4.05 ? 6% perf-profile.self.cycles-pp.mem_cgroup_charge
0.01 ?223% +3.5 3.54 ? 9% perf-profile.self.cycles-pp.memcg_check_events



vm-scalability.median

134000 +------------------------------------------------------------------+
| +. |
132000 |-+ .+ .+ + .+ + + + |
| + : +.++.+ : + + : + : + :+ : |
130000 |-+: : + : + : + : : : : + : |
|.+ ++ O O :.+ +.+ O +.+.+O + O + O |
128000 |-+ O + O |
| O O O O O O O O |
126000 |-+ O O O O O O |
| O O O O O |
124000 |-+ O O O O O O |
| O O |
122000 |-+ |
| O O |
120000 +------------------------------------------------------------------+


[*] bisect-good sample
[O] bisect-bad sample



Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


Thanks,
Oliver Sang


Attachments:
(No filename) (25.88 kB)
config-5.11.0-00002-g4f09feb8bf08 (175.11 kB)
job-script (7.95 kB)
job.yaml (5.17 kB)
reproduce (18.25 kB)
Download all attachments

2021-03-03 00:37:27

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] mm: Force update of mem cgroup soft limit tree on usage excess

On Fri 26-02-21 16:56:28, Tim Chen wrote:
>
>
> On 2/26/21 12:52 AM, Michal Hocko wrote:
>
> >>
> >> Michal,
> >>
> >> Let's take an extreme case where memcg 1 always generate the
> >> first event and memcg 2 generates the rest of 128*8-1 events
> >> and the pattern repeat.
> >
> > I do not follow. Events are per-memcg, aren't they?
> > __this_cpu_read(memcg->vmstats_percpu->targets[target]);
> > [...]
> > __this_cpu_write(memcg->vmstats_percpu->targets[target], next);
> >
>
> You are right. My previous reasoning is incorrect as the sampling is done per memcg.
> I'll do some additional debugging on why memcg is not on the tree.

OK, thanks for the confirmation. I think we want to do 2 things. Remove
the soft limit specific threshold and stay with a single one and
recognize THPs.
--
Michal Hocko
SUSE Labs

2021-03-05 00:53:33

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] mm: Fix dropped memcg from mem cgroup soft limit tree



On 2/18/21 11:13 AM, Michal Hocko wrote:

>
> Fixes: 4e41695356fb ("memory controller: soft limit reclaim on contention")
> Acked-by: Michal Hocko <[email protected]>
>
> Thanks!
>> ---
>> mm/memcontrol.c | 6 +++++-
>> 1 file changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index ed5cc78a8dbf..a51bf90732cb 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -3505,8 +3505,12 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
>> loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS))
>> break;
>> } while (!nr_reclaimed);
>> - if (next_mz)
>> + if (next_mz) {
>> + spin_lock_irq(&mctz->lock);
>> + __mem_cgroup_insert_exceeded(next_mz, mctz, excess);
>> + spin_unlock_irq(&mctz->lock);
>> css_put(&next_mz->memcg->css);
>> + }
>> return nr_reclaimed;
>> }
>>
>> --
>> 2.20.1
>

Mel,

Reviewing this patch a bit more, I realize that there is a chance that the removed
next_mz could be inserted back to the tree from a memcg_check_events
that happen in between. So we need to make sure that the next_mz
is indeed off the tree and update the excess value before adding it
back. Update the patch to the patch below.

Thanks.

Tim
---

From 412764d1fad219b04c77bcb1cc8161067c8424f2 Mon Sep 17 00:00:00 2001
From: Tim Chen <[email protected]>
Date: Tue, 2 Feb 2021 15:53:21 -0800
Subject: [PATCH v3] mm: Fix dropped memcg from mem cgroup soft limit tree
To: Andrew Morton <[email protected]>, Johannes Weiner <[email protected]>, Michal Hocko <[email protected]>,Vladimir Davydov <[email protected]>
Cc: Dave Hansen <[email protected]>, Ying Huang <[email protected]>, [email protected], [email protected], [email protected]

During soft limit memory reclaim, we will temporarily remove the target
mem cgroup from the cgroup soft limit tree. We then perform memory
reclaim, update the memory usage excess count and re-insert the mem
cgroup back into the mem cgroup soft limit tree according to the new
memory usage excess count.

However, when memory reclaim failed for a maximum number of attempts
and we bail out of the reclaim loop, we forgot to put the target mem
cgroup chosen for next reclaim back to the soft limit tree. This prevented
pages in the mem cgroup from being reclaimed in the future even though
the mem cgroup exceeded its soft limit. Fix the logic and put the mem
cgroup back on the tree when page reclaim failed for the mem cgroup.

Fixes: 4e41695356fb ("memory controller: soft limit reclaim on contention")
---
mm/memcontrol.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ed5cc78a8dbf..bc9cc73ff66b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3505,8 +3505,18 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS))
break;
} while (!nr_reclaimed);
- if (next_mz)
+ if (next_mz) {
+ /*
+ * next_mz was removed in __mem_cgroup_largest_soft_limit_node.
+ * Put it back in tree with latest excess value.
+ */
+ spin_lock_irq(&mctz->lock);
+ __mem_cgroup_remove_exceeded(next_mz, mctz);
+ excess = soft_limit_excess(next_mz->memcg);
+ __mem_cgroup_insert_exceeded(next_mz, mctz, excess);
+ spin_unlock_irq(&mctz->lock);
css_put(&next_mz->memcg->css);
+ }
return nr_reclaimed;
}

--
2.20.1

2021-03-05 09:13:26

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] mm: Fix dropped memcg from mem cgroup soft limit tree

On Thu 04-03-21 09:35:08, Tim Chen wrote:
>
>
> On 2/18/21 11:13 AM, Michal Hocko wrote:
>
> >
> > Fixes: 4e41695356fb ("memory controller: soft limit reclaim on contention")
> > Acked-by: Michal Hocko <[email protected]>
> >
> > Thanks!
> >> ---
> >> mm/memcontrol.c | 6 +++++-
> >> 1 file changed, 5 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >> index ed5cc78a8dbf..a51bf90732cb 100644
> >> --- a/mm/memcontrol.c
> >> +++ b/mm/memcontrol.c
> >> @@ -3505,8 +3505,12 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
> >> loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS))
> >> break;
> >> } while (!nr_reclaimed);
> >> - if (next_mz)
> >> + if (next_mz) {
> >> + spin_lock_irq(&mctz->lock);
> >> + __mem_cgroup_insert_exceeded(next_mz, mctz, excess);
> >> + spin_unlock_irq(&mctz->lock);
> >> css_put(&next_mz->memcg->css);
> >> + }
> >> return nr_reclaimed;
> >> }
> >>
> >> --
> >> 2.20.1
> >
>
> Mel,
>
> Reviewing this patch a bit more, I realize that there is a chance that the removed
> next_mz could be inserted back to the tree from a memcg_check_events
> that happen in between. So we need to make sure that the next_mz
> is indeed off the tree and update the excess value before adding it
> back. Update the patch to the patch below.

This scenario is certainly possible but it shouldn't really matter much
as __mem_cgroup_insert_exceeded bails out when the node is on the tree
already.
--
Michal Hocko
SUSE Labs

2021-03-05 19:10:08

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] mm: Fix dropped memcg from mem cgroup soft limit tree



On 3/5/21 1:11 AM, Michal Hocko wrote:
> On Thu 04-03-21 09:35:08, Tim Chen wrote:
>>
>>
>> On 2/18/21 11:13 AM, Michal Hocko wrote:
>>
>>>
>>> Fixes: 4e41695356fb ("memory controller: soft limit reclaim on contention")
>>> Acked-by: Michal Hocko <[email protected]>
>>>
>>> Thanks!
>>>> ---
>>>> mm/memcontrol.c | 6 +++++-
>>>> 1 file changed, 5 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>>> index ed5cc78a8dbf..a51bf90732cb 100644
>>>> --- a/mm/memcontrol.c
>>>> +++ b/mm/memcontrol.c
>>>> @@ -3505,8 +3505,12 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
>>>> loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS))
>>>> break;
>>>> } while (!nr_reclaimed);
>>>> - if (next_mz)
>>>> + if (next_mz) {
>>>> + spin_lock_irq(&mctz->lock);
>>>> + __mem_cgroup_insert_exceeded(next_mz, mctz, excess);
>>>> + spin_unlock_irq(&mctz->lock);
>>>> css_put(&next_mz->memcg->css);
>>>> + }
>>>> return nr_reclaimed;
>>>> }
>>>>
>>>> --
>>>> 2.20.1
>>>
>>
>> Mel,
>>
>> Reviewing this patch a bit more, I realize that there is a chance that the removed
>> next_mz could be inserted back to the tree from a memcg_check_events
>> that happen in between. So we need to make sure that the next_mz
>> is indeed off the tree and update the excess value before adding it
>> back. Update the patch to the patch below.
>
> This scenario is certainly possible but it shouldn't really matter much
> as __mem_cgroup_insert_exceeded bails out when the node is on the tree
> already.
>

Makes sense. We should still update the excess value with

+ excess = soft_limit_excess(next_mz->memcg);
+ __mem_cgroup_insert_exceeded(next_mz, mctz, excess);

before doing insertion. The excess value was recorded from previous
mz in the loop and needs to be updated to that of next_mz.

Tim

2021-03-08 08:48:23

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] mm: Fix dropped memcg from mem cgroup soft limit tree

On Fri 05-03-21 11:07:59, Tim Chen wrote:
>
>
> On 3/5/21 1:11 AM, Michal Hocko wrote:
> > On Thu 04-03-21 09:35:08, Tim Chen wrote:
> >>
> >>
> >> On 2/18/21 11:13 AM, Michal Hocko wrote:
> >>
> >>>
> >>> Fixes: 4e41695356fb ("memory controller: soft limit reclaim on contention")
> >>> Acked-by: Michal Hocko <[email protected]>
> >>>
> >>> Thanks!
> >>>> ---
> >>>> mm/memcontrol.c | 6 +++++-
> >>>> 1 file changed, 5 insertions(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >>>> index ed5cc78a8dbf..a51bf90732cb 100644
> >>>> --- a/mm/memcontrol.c
> >>>> +++ b/mm/memcontrol.c
> >>>> @@ -3505,8 +3505,12 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
> >>>> loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS))
> >>>> break;
> >>>> } while (!nr_reclaimed);
> >>>> - if (next_mz)
> >>>> + if (next_mz) {
> >>>> + spin_lock_irq(&mctz->lock);
> >>>> + __mem_cgroup_insert_exceeded(next_mz, mctz, excess);
> >>>> + spin_unlock_irq(&mctz->lock);
> >>>> css_put(&next_mz->memcg->css);
> >>>> + }
> >>>> return nr_reclaimed;
> >>>> }
> >>>>
> >>>> --
> >>>> 2.20.1
> >>>
> >>
> >> Mel,
> >>
> >> Reviewing this patch a bit more, I realize that there is a chance that the removed
> >> next_mz could be inserted back to the tree from a memcg_check_events
> >> that happen in between. So we need to make sure that the next_mz
> >> is indeed off the tree and update the excess value before adding it
> >> back. Update the patch to the patch below.
> >
> > This scenario is certainly possible but it shouldn't really matter much
> > as __mem_cgroup_insert_exceeded bails out when the node is on the tree
> > already.
> >
>
> Makes sense. We should still update the excess value with
>
> + excess = soft_limit_excess(next_mz->memcg);
> + __mem_cgroup_insert_exceeded(next_mz, mctz, excess);
>
> before doing insertion. The excess value was recorded from previous
> mz in the loop and needs to be updated to that of next_mz.

Yes. Sorry, I have missed that part previously.
--
Michal Hocko
SUSE Labs