LinuxLists.cc - Re: [RFC 0/4] memcg: Low-limit reclaim

2014-01-24 11:18:07

Subject: Re: [RFC 0/4] memcg: Low-limit reclaim

Hi, Michal!

As you can remember, I've proposed to introduce low limits about a year ago.

We had a small discussion at that time: http://marc.info/?t=136195226600004 .

Since that time we intensively use low limits in our production
(on thousands of machines). So, I'm very interested to merge this
functionality into upstream.

In my experience, low limits also require some changes in memcg page accounting
policy. For instance, an application in protected cgroup should have a guarantee
that it's filecache belongs to it's cgroup and is protected by low limit
therefore. If the filecache was created by another application in other cgroup,
it can be not so. I've solved this problem by implementing optional page
reaccouting on pagefaults and read/writes.

I can prepare my current version of patchset, if someone is interested.

Regards,
Roman

On 11.12.2013 18:15, Michal Hocko wrote:
> Hi,
> previous discussions have shown that soft limits cannot be reformed
> (http://lwn.net/Articles/555249/). This series introduces an alternative
> approach to protecting memory allocated to processes executing within
> a memory cgroup controller. It is based on a new tunable that was
> discussed with Johannes and Tejun held during the last kernel summit.
>
> This patchset introduces such low limit that is functionally similar to a
> minimum guarantee. Memcgs which are under their lowlimit are not considered
> eligible for the reclaim (both global and hardlimit). The default value of
> the limit is 0 so all groups are eligible by default and an interested
> party has to explicitly set the limit.
>
> The primary use case is to protect an amount of memory allocated to a
> workload without it being reclaimed by an unrelated activity. In some
> cases this requirement can be fulfilled by mlock but it is not suitable
> for many loads and generally requires application awareness. Such
> application awareness can be complex. It effectively forbids the
> use of memory overcommit as the application must explicitly manage
> memory residency.
> With low limits, such workloads can be placed in a memcg with a low
> limit that protects the estimated working set.
>
> Another use case might be unreclaimable groups. Some loads might be so
> sensitive to reclaim that it is better to kill and start it again (or
> since checkpoint) rather than trash. This would be trivial with low
> limit set to unlimited and the OOM killer will handle the situation as
> required (e.g. kill and restart).
>
> The hierarchical behavior of the lowlimit is described in the first
> patch. It is followed by a direct reclaim fix which is necessary to
> handle situation when a no group is eligible because all groups are
> below low limit. This is not a big deal for hardlimit reclaim because
> we simply retry the reclaim few times and then trigger memcg OOM killer
> path. It would blow up in the global case when we would loop without
> doing any progress or trigger OOM killer. I would consider configuration
> leading to this state invalid but we should handle that gracefully.
>
> The third patch finally allows setting the lowlimit.
>
> The last patch tries expedites OOM if it is clear that no group is
> eligible for reclaim. It basically breaks out of loops in the direct
> reclaim and lets kswapd sleep because it wouldn't do any progress anyway.
>
> Thoughts?
>
> Short log says:
> Michal Hocko (4):
> memcg, mm: introduce lowlimit reclaim
> mm, memcg: allow OOM if no memcg is eligible during direct reclaim
> memcg: Allow setting low_limit
> mm, memcg: expedite OOM if no memcg is reclaimable
>
> And a diffstat
> include/linux/memcontrol.h | 14 +++++++++++
> include/linux/res_counter.h | 40 ++++++++++++++++++++++++++++++
> kernel/res_counter.c | 2 ++
> mm/memcontrol.c | 60 ++++++++++++++++++++++++++++++++++++++++++++-
> mm/vmscan.c | 59 +++++++++++++++++++++++++++++++++++++++++---
> 5 files changed, 170 insertions(+), 5 deletions(-)
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

2014-01-29 18:23:08

by Michal Hocko

[permalink] [raw]

Subject: Re: [RFC 0/4] memcg: Low-limit reclaim

On Fri 24-01-14 15:07:02, Roman Gushchin wrote:
> Hi, Michal!

Hi,

> As you can remember, I've proposed to introduce low limits about a year ago.
>
> We had a small discussion at that time: http://marc.info/?t=136195226600004 .

yes I remember that discussion and vaguely remember the proposed
approach. I really wanted to prevent from introduction of a new knob but
things evolved differently than I planned since then and it turned out
that the knew knob is unavoidable. That's why I came with this approach
which is quite different from yours AFAIR.

> Since that time we intensively use low limits in our production
> (on thousands of machines). So, I'm very interested to merge this
> functionality into upstream.

Have you tried to use this implementation? Would this work as well?
My very vague recollection of your patch is that it didn't cover both
global and target reclaims and it didn't fit into the reclaim very
naturally it used its own scaling method. I will have to refresh my
memory though.

> In my experience, low limits also require some changes in memcg page accounting
> policy. For instance, an application in protected cgroup should have a guarantee
> that it's filecache belongs to it's cgroup and is protected by low limit
> therefore. If the filecache was created by another application in other cgroup,
> it can be not so. I've solved this problem by implementing optional page
> reaccouting on pagefaults and read/writes.

Memory sharing is a separate issue and we should discuss that
separately.

> I can prepare my current version of patchset, if someone is interested.

Sure, having something to compare with is always valuable.

> Regards,
> Roman
--
Michal Hocko
SUSE Labs

2014-02-12 12:40:44

by Roman Gushchin

[permalink] [raw]

Subject: Re: [RFC 0/4] memcg: Low-limit reclaim

Hi, Michal!

Sorry for a long reply.

At Wed, 29 Jan 2014 19:22:59 +0100,
Michal Hocko wrote:
> > As you can remember, I've proposed to introduce low limits about a year ago.
> >
> > We had a small discussion at that time: http://marc.info/?t=136195226600004 .
>
> yes I remember that discussion and vaguely remember the proposed
> approach. I really wanted to prevent from introduction of a new knob but
> things evolved differently than I planned since then and it turned out
> that the knew knob is unavoidable. That's why I came with this approach
> which is quite different from yours AFAIR.
>
> > Since that time we intensively use low limits in our production
> > (on thousands of machines). So, I'm very interested to merge this
> > functionality into upstream.
>
> Have you tried to use this implementation? Would this work as well?
> My very vague recollection of your patch is that it didn't cover both
> global and target reclaims and it didn't fit into the reclaim very
> naturally it used its own scaling method. I will have to refresh my
> memory though.

IMHO, the main problem of your implementation is the following:
the number of reclaimed pages is not limited at all,
if cgroup is over it's low memory limit. So, a significant number
of pages can be reclaimed, even if the memory usage is only a bit
(e.g. one page) above the low limit.

In my case, this problem is solved by scaling the number of scanned pages.

I think, an ideal solution is to limit the number of reclaimed pages by
low limit excess value. This allows to discard my scaling code, but save
the strict semantics of low limit under memory pressure. The main problem
here is how to balance scanning pressure between cgroups and LRUs.

Maybe, we should calculate the number of pages to scan in a LRU based on
the low limit excess value instead of number of pages...

> > In my experience, low limits also require some changes in memcg page accounting
> > policy. For instance, an application in protected cgroup should have a guarantee
> > that it's filecache belongs to it's cgroup and is protected by low limit
> > therefore. If the filecache was created by another application in other cgroup,
> > it can be not so. I've solved this problem by implementing optional page
> > reaccouting on pagefaults and read/writes.
>
> Memory sharing is a separate issue and we should discuss that
> separately.
>
> > I can prepare my current version of patchset, if someone is interested.
>
> Sure, having something to compare with is always valuable.

----
Subject: [PATCH] memcg: low limits for memory cgroups

Low limits for memory cgroup can be used to limit memory pressure on it.
If memory usage of a cgroup is under it's low limit, it will not be
affected by global reclaim. If it reaches it's low limit from above,
the reclaiming speed will be dropped exponentially.

Low limits don't affect soft reclaim.
Also, it's possible that a cgroup with memory usage under low limit
will be reclaimed slowly on very low scanning priorities.
---
include/linux/memcontrol.h | 7 ++++++
include/linux/res_counter.h | 17 +++++++++++++
kernel/res_counter.c | 2 ++
mm/memcontrol.c | 60 +++++++++++++++++++++++++++++++++++++++++++++
mm/vmscan.c | 9 +++++++
5 files changed, 95 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index abd0113..3905e95 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -231,6 +231,8 @@ void mem_cgroup_split_huge_fixup(struct page *head);
bool mem_cgroup_bad_page_check(struct page *page);
void mem_cgroup_print_bad_page(struct page *page);
#endif
+
+unsigned int mem_cgroup_low_limit_scale(struct mem_cgroup *memcg);
#else /* CONFIG_MEMCG */
struct mem_cgroup;

@@ -427,6 +429,11 @@ static inline void mem_cgroup_replace_page_cache(struct page *oldpage,
struct page *newpage)
{
}
+
+static inline unsigned int mem_cgroup_low_limit_scale(struct mem_cgroup *memcg)
+{
+ return 0;
+}
#endif /* CONFIG_MEMCG */

#if !defined(CONFIG_MEMCG) || !defined(CONFIG_DEBUG_VM)
diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index 201a697..7a16c2a 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -40,6 +40,10 @@ struct res_counter {
*/
unsigned long long soft_limit;
/*
+ * the secured guaranteed minimal limit of resource
+ */
+ unsigned long long low_limit;
+ /*
* the number of unsuccessful attempts to consume the resource
*/
unsigned long long failcnt;
@@ -88,6 +92,7 @@ enum {
RES_LIMIT,
RES_FAILCNT,
RES_SOFT_LIMIT,
+ RES_LOW_LIMIT,
};

/*
@@ -224,4 +229,16 @@ res_counter_set_soft_limit(struct res_counter *cnt,
return 0;
}

+static inline int
+res_counter_set_low_limit(struct res_counter *cnt,
+ unsigned long long low_limit)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&cnt->lock, flags);
+ cnt->low_limit = low_limit;
+ spin_unlock_irqrestore(&cnt->lock, flags);
+ return 0;
+}
+
#endif
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 4aa8a30..c57daf9 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -135,6 +135,8 @@ res_counter_member(struct res_counter *counter, int member)
return &counter->failcnt;
case RES_SOFT_LIMIT:
return &counter->soft_limit;
+ case RES_LOW_LIMIT:
+ return &counter->low_limit;
};

BUG();
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 53385cd..d24b768 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1883,6 +1883,46 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
NULL, "Memory cgroup out of memory");
}

+/*
+ * If a cgroup is under low limit or enough close to it,
+ * decrease speed of page scanning.
+ *
+ * mem_cgroup_low_limit_scale() returns a number
+ * from range [0, DEF_PRIORITY - 2], which is used
+ * in the reclaim code as a scanning priority modifier.
+ *
+ * If the low limit is not set, it returns 0;
+ *
+ * usage - low_limit > usage / 8 => 0
+ * usage - low_limit > usage / 16 => 1
+ * usage - low_limit > usage / 32 => 2
+ * ...
+ * usage - low_limit > usage / (2 ^ DEF_PRIORITY - 3) => DEF_PRIORITY - 3
+ * usage < low_limit => DEF_PRIORITY - 2
+ *
+ */
+unsigned int mem_cgroup_low_limit_scale(struct mem_cgroup *memcg)
+{
+ unsigned long long low_limit;
+ unsigned long long usage;
+ unsigned int i;
+
+ low_limit = res_counter_read_u64(&memcg->res, RES_LOW_LIMIT);
+ if (!low_limit)
+ return 0;
+
+ usage = res_counter_read_u64(&memcg->res, RES_USAGE);
+
+ if (usage < low_limit)
+ return DEF_PRIORITY - 2;
+
+ for (i = 0; i < DEF_PRIORITY - 2; i++)
+ if (usage - low_limit > (usage >> (i + 3)))
+ break;
+
+ return i;
+}
+
static unsigned long mem_cgroup_reclaim(struct mem_cgroup *memcg,
gfp_t gfp_mask,
unsigned long flags)
@@ -5318,6 +5358,20 @@ static int mem_cgroup_write(struct cgroup_subsys_state *css, struct cftype *cft,
else
ret = -EINVAL;
break;
+ case RES_LOW_LIMIT:
+ ret = res_counter_memparse_write_strategy(buffer, &val);
+ if (ret)
+ break;
+ /*
+ * For memsw, low limits (as also soft limits, see upper)
+ * are hard to implement in terms of semantics,
+ * for now, we support soft limits for control without swap
+ */
+ if (type == _MEM)
+ ret = res_counter_set_low_limit(&memcg->res, val);
+ else
+ ret = -EINVAL;
+ break;
default:
ret = -EINVAL; /* should be BUG() ? */
break;
@@ -6243,6 +6297,12 @@ static struct cftype mem_cgroup_files[] = {
.read_u64 = mem_cgroup_read_u64,
},
{
+ .name = "low_limit_in_bytes",
+ .private = MEMFILE_PRIVATE(_MEM, RES_LOW_LIMIT),
+ .write_string = mem_cgroup_write,
+ .read_u64 = mem_cgroup_read_u64,
+ },
+ {
.name = "soft_limit_in_bytes",
.private = MEMFILE_PRIVATE(_MEM, RES_SOFT_LIMIT),
.write_string = mem_cgroup_write,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a9c74b4..1d4eaac 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -83,6 +83,9 @@ struct scan_control {
/* Scan (total_size >> priority) pages at once */
int priority;

+ /* If memcg is under it's low limit, do not scan it aggressively */
+ int low_limit_scale;
+
/*
* The memory cgroup that hit its limit and as a result is the
* primary target of this reclaim invocation.
@@ -2003,6 +2006,10 @@ out:
/* Look ma, no brain */
BUG();
}
+
+ if (sc->low_limit_scale)
+ scan >>= sc->low_limit_scale;
+
nr[lru] = scan;
}
}
@@ -2206,6 +2213,7 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)

lruvec = mem_cgroup_zone_lruvec(zone, memcg);

+ sc->low_limit_scale = mem_cgroup_low_limit_scale(memcg);
shrink_lruvec(lruvec, sc);

/*
@@ -2640,6 +2648,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *memcg,
.may_swap = !noswap,
.order = 0,
.priority = 0,
+ .low_limit_scale = 0,
.target_mem_cgroup = memcg,
};
struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg);
--
1.8.5.3

2014-02-13 16:12:48

by Michal Hocko

[permalink] [raw]

Subject: Re: [RFC 0/4] memcg: Low-limit reclaim

On Wed 12-02-14 16:28:36, Roman Gushchin wrote:
> Hi, Michal!
>
> Sorry for a long reply.
>
> At Wed, 29 Jan 2014 19:22:59 +0100,
> Michal Hocko wrote:
> > > As you can remember, I've proposed to introduce low limits about a year ago.
> > >
> > > We had a small discussion at that time: http://marc.info/?t=136195226600004 .
> >
> > yes I remember that discussion and vaguely remember the proposed
> > approach. I really wanted to prevent from introduction of a new knob but
> > things evolved differently than I planned since then and it turned out
> > that the knew knob is unavoidable. That's why I came with this approach
> > which is quite different from yours AFAIR.
> >
> > > Since that time we intensively use low limits in our production
> > > (on thousands of machines). So, I'm very interested to merge this
> > > functionality into upstream.
> >
> > Have you tried to use this implementation? Would this work as well?
> > My very vague recollection of your patch is that it didn't cover both
> > global and target reclaims and it didn't fit into the reclaim very
> > naturally it used its own scaling method. I will have to refresh my
> > memory though.
>
> IMHO, the main problem of your implementation is the following:
> the number of reclaimed pages is not limited at all,
> if cgroup is over it's low memory limit. So, a significant number
> of pages can be reclaimed, even if the memory usage is only a bit
> (e.g. one page) above the low limit.

Yes but this is the same problem as with the regular reclaim.
We do not have any guarantee that we will reclaim only the required
amount of memory. As the reclaim priority falls down we can overreclaim.
The global reclaim tries to avoid this problem by keeping the priority
as high as possible. And the target reclaim is not a big deal because we
are limiting the number of reclaimed pages to the swap cluster.

I do not see this as a practical problem of the low_limit though,
because it protects those that are below the limit not above. Small
fluctuation around the limit should be tolerable.

> In my case, this problem is solved by scaling the number of scanned pages.
>
> I think, an ideal solution is to limit the number of reclaimed pages by
> low limit excess value. This allows to discard my scaling code, but save
> the strict semantics of low limit under memory pressure. The main problem
> here is how to balance scanning pressure between cgroups and LRUs.
>
> Maybe, we should calculate the number of pages to scan in a LRU based on
> the low limit excess value instead of number of pages...

I do not like it much and I expect other mm people to feel similar. We
already have scanning scaling based on the priority. Adding a new
variable into the picture will make the whole thing only more
complicated without a very good reason for it.

[...]
--
Michal Hocko
SUSE Labs