On Wed, Dec 11 2013, Michal Hocko wrote:
> Hi,
> previous discussions have shown that soft limits cannot be reformed
> (http://lwn.net/Articles/555249/). This series introduces an alternative
> approach to protecting memory allocated to processes executing within
> a memory cgroup controller. It is based on a new tunable that was
> discussed with Johannes and Tejun held during the last kernel summit.
>
> This patchset introduces such low limit that is functionally similar to a
> minimum guarantee. Memcgs which are under their lowlimit are not considered
> eligible for the reclaim (both global and hardlimit). The default value of
> the limit is 0 so all groups are eligible by default and an interested
> party has to explicitly set the limit.
>
> The primary use case is to protect an amount of memory allocated to a
> workload without it being reclaimed by an unrelated activity. In some
> cases this requirement can be fulfilled by mlock but it is not suitable
> for many loads and generally requires application awareness. Such
> application awareness can be complex. It effectively forbids the
> use of memory overcommit as the application must explicitly manage
> memory residency.
> With low limits, such workloads can be placed in a memcg with a low
> limit that protects the estimated working set.
>
> Another use case might be unreclaimable groups. Some loads might be so
> sensitive to reclaim that it is better to kill and start it again (or
> since checkpoint) rather than trash. This would be trivial with low
> limit set to unlimited and the OOM killer will handle the situation as
> required (e.g. kill and restart).
>
> The hierarchical behavior of the lowlimit is described in the first
> patch. It is followed by a direct reclaim fix which is necessary to
> handle situation when a no group is eligible because all groups are
> below low limit. This is not a big deal for hardlimit reclaim because
> we simply retry the reclaim few times and then trigger memcg OOM killer
> path. It would blow up in the global case when we would loop without
> doing any progress or trigger OOM killer. I would consider configuration
> leading to this state invalid but we should handle that gracefully.
>
> The third patch finally allows setting the lowlimit.
>
> The last patch tries expedites OOM if it is clear that no group is
> eligible for reclaim. It basically breaks out of loops in the direct
> reclaim and lets kswapd sleep because it wouldn't do any progress anyway.
>
> Thoughts?
>
> Short log says:
> Michal Hocko (4):
> memcg, mm: introduce lowlimit reclaim
> mm, memcg: allow OOM if no memcg is eligible during direct reclaim
> memcg: Allow setting low_limit
> mm, memcg: expedite OOM if no memcg is reclaimable
>
> And a diffstat
> include/linux/memcontrol.h | 14 +++++++++++
> include/linux/res_counter.h | 40 ++++++++++++++++++++++++++++++
> kernel/res_counter.c | 2 ++
> mm/memcontrol.c | 60 ++++++++++++++++++++++++++++++++++++++++++++-
> mm/vmscan.c | 59 +++++++++++++++++++++++++++++++++++++++++---
> 5 files changed, 170 insertions(+), 5 deletions(-)
The series looks useful. We (Google) have been using something similar.
In practice such a low_limit (or memory guarantee), doesn't nest very
well.
Example:
- parent_memcg: limit 500, low_limit 500, usage 500
1 privately charged non-reclaimable page (e.g. mlock, slab)
- child_memcg: limit 500, low_limit 500, usage 499
If a streaming file cache workload (e.g. sha1sum) starts gobbling up
page cache it will lead to an oom kill instead of reclaiming. One could
argue that this is working as intended because child_memcg was promised
500 but can only get 499. So child_memcg is oom killed rather than
being forced to operate below its promised low limit.
This has led to various internal workarounds like:
- don't charge any memory to interior tree nodes (e.g. parent_memcg);
only charge memory to cgroup leafs. This gets tricky when dealing
with reparented memory inherited to parent from child during cgroup
deletion.
- don't set low_limit on non leafs (e.g. do not set low limit on
parent_memcg). This constrains the cgroup layout a bit. Some
customers want to purchase $MEM and setup their workload with a few
child cgroups. A system daemon hands out $MEM by setting low_limit
for top-level containers (e.g. parent_memcg). Thereafter such
customers are able to partition their workload with sub memcg below
child_memcg. Example:
parent_memcg
\
child_memcg
/ \
server backup
Thereafter customers often want some weak isolation between server and
backup. To avoid undesired oom kills the server/backup isolation is
provided with a softer memory guarantee (e.g. soft_limit). The soft
limit acts like the low_limit until priority becomes desperate.
On Wed 29-01-14 11:08:46, Greg Thelen wrote:
[...]
> The series looks useful. We (Google) have been using something similar.
> In practice such a low_limit (or memory guarantee), doesn't nest very
> well.
>
> Example:
> - parent_memcg: limit 500, low_limit 500, usage 500
> 1 privately charged non-reclaimable page (e.g. mlock, slab)
> - child_memcg: limit 500, low_limit 500, usage 499
I am not sure this is a good example. Your setup basically say that no
single page should be reclaimed. I can imagine this might be useful in
some cases and I would like to allow it but it sounds too extreme (e.g.
a load which would start trashing heavily once the reclaim starts and it
makes more sense to start it again rather than crowl - think about some
mathematical simulation which might diverge).
> If a streaming file cache workload (e.g. sha1sum) starts gobbling up
> page cache it will lead to an oom kill instead of reclaiming.
Does it make any sense to protect all of such memory although it is
easily reclaimable?
> One could
> argue that this is working as intended because child_memcg was promised
> 500 but can only get 499. So child_memcg is oom killed rather than
> being forced to operate below its promised low limit.
>
> This has led to various internal workarounds like:
> - don't charge any memory to interior tree nodes (e.g. parent_memcg);
> only charge memory to cgroup leafs. This gets tricky when dealing
> with reparented memory inherited to parent from child during cgroup
> deletion.
Do those need any protection at all?
> - don't set low_limit on non leafs (e.g. do not set low limit on
> parent_memcg). This constrains the cgroup layout a bit. Some
> customers want to purchase $MEM and setup their workload with a few
> child cgroups. A system daemon hands out $MEM by setting low_limit
> for top-level containers (e.g. parent_memcg). Thereafter such
> customers are able to partition their workload with sub memcg below
> child_memcg. Example:
> parent_memcg
> \
> child_memcg
> / \
> server backup
I think that the low_limit makes sense where you actually want to
protect something from reclaim. And backup sounds like a bad fit for
that.
> Thereafter customers often want some weak isolation between server and
> backup. To avoid undesired oom kills the server/backup isolation is
> provided with a softer memory guarantee (e.g. soft_limit). The soft
> limit acts like the low_limit until priority becomes desperate.
Johannes was already suggesting that the low_limit should allow for a
weaker semantic as well. I am not very much inclined to that but I can
leave with a knob which would say oom_on_lowlimit (on by default but
allowed to be set to 0). We would fallback to the full reclaim if
no groups turn out to be reclaimable.
--
Michal Hocko
SUSE Labs