From: Zhaoyang Huang <[email protected]>
current memcg protection via min,low,high asks for an evaluation of
protected entity, which could be hard for some system. Furthermore, the usage
could also be various under different scenarios(imagin keep protecting 50M when
usage change from 100M to 300M), which make the protection less meaning.
So we introduce the proportional protection over memcg's ever highest
usage(watermark) to overcome above constraints.
Signed-off-by: Zhaoyang Huang <[email protected]>
---
include/linux/page_counter.h | 3 +++
mm/memcontrol.c | 17 +++++++++++++----
2 files changed, 16 insertions(+), 4 deletions(-)
diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
index 6795913..7762629 100644
--- a/include/linux/page_counter.h
+++ b/include/linux/page_counter.h
@@ -27,6 +27,9 @@ struct page_counter {
unsigned long watermark;
unsigned long failcnt;
+ /* proportional protection */
+ unsigned long min_prop;
+ unsigned long low_prop;
/*
* 'parent' is placed here to be far from 'usage' to reduce
* cache false sharing, as 'usage' is written mostly while
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 508bcea..937c6ce 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6616,6 +6616,7 @@ void mem_cgroup_calculate_protection(struct mem_cgroup *root,
{
unsigned long usage, parent_usage;
struct mem_cgroup *parent;
+ unsigned long memcg_emin, memcg_elow, parent_emin, parent_elow;
if (mem_cgroup_disabled())
return;
@@ -6650,14 +6651,22 @@ void mem_cgroup_calculate_protection(struct mem_cgroup *root,
parent_usage = page_counter_read(&parent->memory);
+ /* use proportional protect first and take 1024 as 100% */
+ memcg_emin = READ_ONCE(memcg->memory.min_prop) ?
+ READ_ONCE(memcg->memory.min_prop) * READ_ONCE(memcg->memory.watermark) / 1024 : READ_ONCE(memcg->memory.min);
+ memcg_elow = READ_ONCE(memcg->memory.low_prop) ?
+ READ_ONCE(memcg->memory.low_prop) * READ_ONCE(memcg->memory.watermark) / 1024 : READ_ONCE(memcg->memory.low);
+ parent_emin = READ_ONCE(parent->memory.min_prop) ?
+ READ_ONCE(parent->memory.min_prop) * READ_ONCE(parent->memory.watermark) / 1024 : READ_ONCE(parent->memory.emin);
+ parent_elow = READ_ONCE(parent->memory.low_prop) ?
+ READ_ONCE(parent->memory.low_prop) * READ_ONCE(parent->memory.watermark) / 1024 : READ_ONCE(parent->memory.elow);
+
WRITE_ONCE(memcg->memory.emin, effective_protection(usage, parent_usage,
- READ_ONCE(memcg->memory.min),
- READ_ONCE(parent->memory.emin),
+ memcg_emin, parent_emin,
atomic_long_read(&parent->memory.children_min_usage)));
WRITE_ONCE(memcg->memory.elow, effective_protection(usage, parent_usage,
- READ_ONCE(memcg->memory.low),
- READ_ONCE(parent->memory.elow),
+ memcg_elow, parent_elow,
atomic_long_read(&parent->memory.children_low_usage)));
}
--
1.9.1
On Fri 25-03-22 11:08:00, Zhaoyang Huang wrote:
> On Fri, Mar 25, 2022 at 11:02 AM Zhaoyang Huang <[email protected]> wrote:
> >
> > On Thu, Mar 24, 2022 at 10:27 PM Chris Down <[email protected]> wrote:
> > >
> > > I'm confused by the aims of this patch. We already have proportional reclaim
> > > for memory.min and memory.low, and memory.high is already "proportional" by its
> > > nature to drive memory back down behind the configured threshold.
> > >
> > > Could you please be more clear about what you're trying to achieve and in what
> > > way the existing proportional reclaim mechanisms are insufficient for you?
>
> sorry for the bad formatting of previous reply, resend it in new format
>
> What I am trying to solve is that, the memcg's protection judgment[1]
> is based on a set of fixed value on current design, while the real
> scan and reclaim number[2] is based on the proportional min/low on the
> real memory usage which you mentioned above. Fixed value setting has
> some constraints as
> 1. It is an experienced value based on observation, which could be inaccurate.
> 2. working load is various from scenarios.
> 3. fixed value from [1] could be against the dynamic cgroup_size in [2].
Could you elaborate some more about those points. I guess providing an
example how you are using the new interface instead would be helpful.
--
Michal Hocko
SUSE Labs
On Fri, Mar 25, 2022 at 12:23 AM Roman Gushchin
<[email protected]> wrote:
>
> It seems like what’s being proposed is an ability to express the protection in % of the current usage rather than an absolute number.
> It’s an equivalent for something like a memory (reclaim) priority: e.g. a cgroup with 80% protection is _always_ reclaimed less aggressively than one with a 20% protection.
>
> That said, I’m not a fan of this idea.
> It might make sense in some reasonable range of usages, but if your workload is simply leaking memory and growing indefinitely, protecting it seems like a bad idea. And the first part can be easily achieved using an userspace tool.
>
> Thanks!
>
> > On Mar 24, 2022, at 7:33 AM, Chris Down <[email protected]> wrote:
> >
> > I'm confused by the aims of this patch. We already have proportional reclaim for memory.min and memory.low, and memory.high is already "proportional" by its nature to drive memory back down behind the configured threshold.
> >
> > Could you please be more clear about what you're trying to achieve and in what way the existing proportional reclaim mechanisms are insufficient for you?
ok, I think it could be fixable for memory leak issues. Please refer
to my reply on Chris's comment for more explanation.
On Fri, Mar 25, 2022 at 11:02 AM Zhaoyang Huang <[email protected]> wrote:
>
> On Thu, Mar 24, 2022 at 10:27 PM Chris Down <[email protected]> wrote:
> >
> > I'm confused by the aims of this patch. We already have proportional reclaim
> > for memory.min and memory.low, and memory.high is already "proportional" by its
> > nature to drive memory back down behind the configured threshold.
> >
> > Could you please be more clear about what you're trying to achieve and in what
> > way the existing proportional reclaim mechanisms are insufficient for you?
sorry for the bad formatting of previous reply, resend it in new format
What I am trying to solve is that, the memcg's protection judgment[1]
is based on a set of fixed value on current design, while the real
scan and reclaim number[2] is based on the proportional min/low on the
real memory usage which you mentioned above. Fixed value setting has
some constraints as
1. It is an experienced value based on observation, which could be inaccurate.
2. working load is various from scenarios.
3. fixed value from [1] could be against the dynamic cgroup_size in [2].
shrink_node_memcgs
[1] check if the memcg is protected based on fixed min/low value
mem_cgroup_calculate_protection(target_memcg, memcg);
if (mem_cgroup_below_min(memcg))
...
else if (mem_cgroup_below_low(memcg))
...
[2] calculate the number of scan size proportionally
shrink_lruvec
get_scan_count
mem_cgroup_protection
scan = lruvec_size - lruvec_size * protection /
(cgroup_size + 1);