2014-12-16 13:25:53

by Chintan Pandya

[permalink] [raw]
Subject: [PATCH] memcg: Provide knob for force OOM into the memcg

We may want to use memcg to limit the total memory
footprint of all the processes within the one group.
This may lead to a situation where any arbitrary
process cannot get migrated to that one memcg
because its limits will be breached. Or, process can
get migrated but even being most recently used
process, it can get killed by in-cgroup OOM. To
avoid such scenarios, provide a convenient knob
by which we can forcefully trigger OOM and make
a room for upcoming process.

To trigger force OOM,
$ echo 1 > /<memcg_path>/memory.force_oom

Signed-off-by: Chintan Pandya <[email protected]>
---
mm/memcontrol.c | 29 +++++++++++++++++++++++++++++
1 file changed, 29 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ef91e85..4c68aa7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3305,6 +3305,30 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
return 0;
}

+static int mem_cgroup_force_oom(struct cgroup *cont, unsigned int event)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
+ int ret;
+
+ if (mem_cgroup_is_root(memcg))
+ return -EINVAL;
+
+ css_get(&memcg->css);
+ ret = mem_cgroup_handle_oom(memcg, GFP_KERNEL, 0);
+ css_put(&memcg->css);
+
+ return ret;
+}
+
+static int mem_cgroup_force_oom_write(struct cgroup *cgrp,
+ struct cftype *cft, u64 val)
+{
+ if (val > 1 || val < 1)
+ return -EINVAL;
+
+ return mem_cgroup_force_oom(cgrp, 0);
+}
+
static ssize_t mem_cgroup_force_empty_write(struct kernfs_open_file *of,
char *buf, size_t nbytes,
loff_t off)
@@ -4442,6 +4466,11 @@ static struct cftype mem_cgroup_files[] = {
.write = mem_cgroup_force_empty_write,
},
{
+ .name = "force_oom",
+ .trigger = mem_cgroup_force_oom,
+ .write_u64 = mem_cgroup_force_oom_write,
+ },
+ {
.name = "use_hierarchy",
.write_u64 = mem_cgroup_hierarchy_write,
.read_u64 = mem_cgroup_hierarchy_read,
--
Chintan Pandya

QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation


2014-12-16 13:39:41

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] memcg: Provide knob for force OOM into the memcg

On Tue 16-12-14 18:55:35, Chintan Pandya wrote:
> We may want to use memcg to limit the total memory
> footprint of all the processes within the one group.
> This may lead to a situation where any arbitrary
> process cannot get migrated to that one memcg
> because its limits will be breached. Or, process can
> get migrated but even being most recently used
> process, it can get killed by in-cgroup OOM. To
> avoid such scenarios, provide a convenient knob
> by which we can forcefully trigger OOM and make
> a room for upcoming process.
>
> To trigger force OOM,
> $ echo 1 > /<memcg_path>/memory.force_oom

What would prevent another task deplete that memory shortly after you
triggered OOM and end up in the same situation? E.g. while the moving
task is migrating its charges to the new group...

Why cannot you simply disable OOM killer in that memcg and handle it
from userspace properly?

> Signed-off-by: Chintan Pandya <[email protected]>
> ---
> mm/memcontrol.c | 29 +++++++++++++++++++++++++++++
> 1 file changed, 29 insertions(+)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ef91e85..4c68aa7 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3305,6 +3305,30 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
> return 0;
> }
>
> +static int mem_cgroup_force_oom(struct cgroup *cont, unsigned int event)
> +{
> + struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
> + int ret;
> +
> + if (mem_cgroup_is_root(memcg))
> + return -EINVAL;
> +
> + css_get(&memcg->css);
> + ret = mem_cgroup_handle_oom(memcg, GFP_KERNEL, 0);
> + css_put(&memcg->css);
> +
> + return ret;
> +}
> +
> +static int mem_cgroup_force_oom_write(struct cgroup *cgrp,
> + struct cftype *cft, u64 val)
> +{
> + if (val > 1 || val < 1)
> + return -EINVAL;
> +
> + return mem_cgroup_force_oom(cgrp, 0);
> +}
> +
> static ssize_t mem_cgroup_force_empty_write(struct kernfs_open_file *of,
> char *buf, size_t nbytes,
> loff_t off)
> @@ -4442,6 +4466,11 @@ static struct cftype mem_cgroup_files[] = {
> .write = mem_cgroup_force_empty_write,
> },
> {
> + .name = "force_oom",
> + .trigger = mem_cgroup_force_oom,
> + .write_u64 = mem_cgroup_force_oom_write,
> + },
> + {
> .name = "use_hierarchy",
> .write_u64 = mem_cgroup_hierarchy_write,
> .read_u64 = mem_cgroup_hierarchy_read,
> --
> Chintan Pandya
>
> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
> member of the Code Aurora Forum, hosted by The Linux Foundation
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Michal Hocko
SUSE Labs

2014-12-16 16:59:31

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH] memcg: Provide knob for force OOM into the memcg

On Tue, Dec 16, 2014 at 06:55:35PM +0530, Chintan Pandya wrote:
> We may want to use memcg to limit the total memory
> footprint of all the processes within the one group.
> This may lead to a situation where any arbitrary
> process cannot get migrated to that one memcg
> because its limits will be breached. Or, process can
> get migrated but even being most recently used
> process, it can get killed by in-cgroup OOM. To
> avoid such scenarios, provide a convenient knob
> by which we can forcefully trigger OOM and make
> a room for upcoming process.

Why do you move tasks around during runtime? Rather than scanning
thousands or millions of page table entries to relocate a task and its
private memory to another configuration domain, wouldn't it be easier to
just keep the task in a dedicated cgroup and reconfigure that instead?

There doesn't seem to be a strong usecase for charge migration that
couldn't be solved by doing things slightly differently from userspace.
Certainly not something that justifies the complexity that it adds to
memcg model and it's synchronization requirements from VM hotpaths.
Hence, I'm inclined to not add charge moving to version 2 of memcg.

2014-12-16 22:34:03

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH] memcg: Provide knob for force OOM into the memcg

On Tue, 16 Dec 2014, Michal Hocko wrote:

> > We may want to use memcg to limit the total memory
> > footprint of all the processes within the one group.
> > This may lead to a situation where any arbitrary
> > process cannot get migrated to that one memcg
> > because its limits will be breached. Or, process can
> > get migrated but even being most recently used
> > process, it can get killed by in-cgroup OOM. To
> > avoid such scenarios, provide a convenient knob
> > by which we can forcefully trigger OOM and make
> > a room for upcoming process.
> >
> > To trigger force OOM,
> > $ echo 1 > /<memcg_path>/memory.force_oom
>
> What would prevent another task deplete that memory shortly after you
> triggered OOM and end up in the same situation? E.g. while the moving
> task is migrating its charges to the new group...
>
> Why cannot you simply disable OOM killer in that memcg and handle it
> from userspace properly?
>

The patch is introducing a mechanism to induce a kernel oom kill for a
memcg hierarchy to make room for it in the new memcg, not disable the oom
killer so the migration fails due to the lower limits.

It doesn't have any basis since a SIGKILL coming from userspace should be
considered the same as a kernel oom kill from the memcg perspective, i.e.
the fatal_signal_pending() checks that allow charge bypass instead of a
strict reliance on TIF_MEMDIE being set.

It seems to be proposed as a shortcut so that the kernel will determine
the best process to kill. That information is available to userspace so
it should be able to just SIGKILL the desired process (either in the
destination memcg or in the source memcg to allow deletion), so this
functionality isn't needed in the kernel.

2014-12-17 11:47:55

by Chintan Pandya

[permalink] [raw]
Subject: Re: [PATCH] memcg: Provide knob for force OOM into the memcg

On 12/17/2014 04:03 AM, David Rientjes wrote:
> On Tue, 16 Dec 2014, Michal Hocko wrote:
>
>>> We may want to use memcg to limit the total memory
>>> footprint of all the processes within the one group.
>>> This may lead to a situation where any arbitrary
>>> process cannot get migrated to that one memcg
>>> because its limits will be breached. Or, process can
>>> get migrated but even being most recently used
>>> process, it can get killed by in-cgroup OOM. To
>>> avoid such scenarios, provide a convenient knob
>>> by which we can forcefully trigger OOM and make
>>> a room for upcoming process.
>>>
>>> To trigger force OOM,
>>> $ echo 1> /<memcg_path>/memory.force_oom
>>
>> What would prevent another task deplete that memory shortly after you
>> triggered OOM and end up in the same situation? E.g. while the moving
>> task is migrating its charges to the new group...

Idea was to trigger an OOM until we can migrate any particular process
onto desired cgroup.

>>
>> Why cannot you simply disable OOM killer in that memcg and handle it
>> from userspace properly?

Well, this can be done it seems. Let me explore around this. Thanks for
this suggestion.

> It seems to be proposed as a shortcut so that the kernel will determine
> the best process to kill. That information is available to userspace so
> it should be able to just SIGKILL the desired process (either in the
> destination memcg or in the source memcg to allow deletion), so this
> functionality isn't needed in the kernel.

Yes, this can be seen as a shortcut because we are off-loading some
task-selection to be killed by OOM on kernel rather than userspace
decides by itself.

--
Chintan Pandya

QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation

2014-12-17 12:11:36

by Chintan Pandya

[permalink] [raw]
Subject: Re: [PATCH] memcg: Provide knob for force OOM into the memcg


> Why do you move tasks around during runtime? Rather than scanning
> thousands or millions of page table entries to relocate a task and its
> private memory to another configuration domain, wouldn't it be easier to
> just keep the task in a dedicated cgroup and reconfigure that instead?

Your suggestion is good. But in specific cases, we may have no choice
but to migrate.

Take a case of an Android system where a process/app will never gets
killed until there is really no scope of holding it any longer in RAM.
So, when that process was running as a foreground process, it has to
belong to a group which has no memory limit and cannot be killed. Now,
when the same process goes into background and sits idle, it can be
compressed and cached into some space in RAM. These cached processes are
ever growing list and can be capped with some limit. Naturally, these
processes belongs to different category and hence different cgroup which
just controls such cached processes.

>
> There doesn't seem to be a strong usecase for charge migration that
> couldn't be solved by doing things slightly differently from userspace.
> Certainly not something that justifies the complexity that it adds to
> memcg model and it's synchronization requirements from VM hotpaths.
> Hence, I'm inclined to not add charge moving to version 2 of memcg.

Do you say charge migration is discouraged at runtime ? Difficult to
live with this limitation.

--
Chintan Pandya

QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation

2014-12-19 21:15:15

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH] memcg: Provide knob for force OOM into the memcg

On Wed, Dec 17, 2014 at 05:41:29PM +0530, Chintan Pandya wrote:
>
> >Why do you move tasks around during runtime? Rather than scanning
> >thousands or millions of page table entries to relocate a task and its
> >private memory to another configuration domain, wouldn't it be easier to
> >just keep the task in a dedicated cgroup and reconfigure that instead?
>
> Your suggestion is good. But in specific cases, we may have no choice but to
> migrate.
>
> Take a case of an Android system where a process/app will never gets killed
> until there is really no scope of holding it any longer in RAM. So, when
> that process was running as a foreground process, it has to belong to a
> group which has no memory limit and cannot be killed. Now, when the same
> process goes into background and sits idle, it can be compressed and cached
> into some space in RAM. These cached processes are ever growing list and can
> be capped with some limit. Naturally, these processes belongs to different
> category and hence different cgroup which just controls such cached
> processes.

This is a valid usecase that should supported, but there has to be a
better way to do it then to move potentially hundreds of megabytes,
page by page, during the task switch. It's a massive amount of work.

But there are also several problematic design implications in moving
memory along with a task.

For one, moving a task involves two separate cgroups that can have an
arbitrary number of controllers attached to them. If this complex
operation fails, you have no idea which component is at fault - was it
a generic cgroup problem, or does the move violate a rule in one of
the controllers? And *if* it fails, what action do you take? Does
the app simply remain a foreground app? Likewise, if you move the
task but fail to migrate some pages, do you leave them behind in the
foreground group where they are exempt from reclaim? Conflating task
organization and resource control like this just isn't a good idea.

Secondly, memory tracked through cgroups does not belong to the task,
it belongs to the cgroup, and we are limited to unreliable heuristics
when determining a task-page relationship. The pages that can't be
attributed properly - like unmapped page cache - again will be left
behind, polluting your reclaim-exempt foreground group.

Another aspect is that the foreground domain is variable in size, but
you are assigning it a minimum amount of space by statically limiting
the background apps. If the foreground doesn't use that space, you
end up killing cached apps for no reason and waste even more memory.
Imagine you run a large app like Maps and then switch to a tiny note
taking app to look something up. Now you're kicking Maps out of the
cache for no reason.

The last point is a much more generic problem. Static limits are not
suited to efficiently partition a machine for realistic workloads.
This is why version 2 will move away from the idea of static hard
limits as the primary means of partitioning, and towards a model that
has the user configure upper and lower boundaries for the expected
workingset size of each group. Static limits will be relegated to
failsafe measures and hard requirements.

Setting a group's lower boundary will tell the kernel how much memory
the group requires at a minimum to function properly, and the kernel
will try to avoid reclaiming and OOM killing groups within their lower
boundary at the expense of groups that are in excess of theirs. Your
configuration can be rephrased using this: by putting all apps into
their own groups, and setting the lower boundary to infinity for the
foreground apps and to zero for the background apps, the kernel will
always reclaim and OOM kill the background apps first.

You get the same foreground app protection as before, but with several
advantages.

Firstly, it separates the task grouping of an app from memory policy,
which allows you to track your apps as self-contained bundles of tasks
and memory. You are no longer required to conflate unrelated apps for
the sake of memory policy, only to reshuffle and break them apart
again using inaccurate separation heuristics that will end up
polluting *both* domains.

Secondly, background apps will no longer get killed based on a static
quota, but based on actual memory pressure. You configure the policy,
and the kernel decides on demand where to get the required memory.

And lastly, you don't have to physically move thousands of pages on
every task switch anymore, AND pay the synchronization overhead that
stems from pages changing cgroups during runtime.

Your use case is valid, but charge migration doesn't seem to be the
right answer here. And I really doubt it's ever the right answer.