Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754941AbZAWO7Q (ORCPT ); Fri, 23 Jan 2009 09:59:16 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1751070AbZAWO7A (ORCPT ); Fri, 23 Jan 2009 09:59:00 -0500 Received: from ns.suse.de ([195.135.220.2]:39404 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750785AbZAWO67 (ORCPT ); Fri, 23 Jan 2009 09:58:59 -0500 From: Nikanth Karthikesan Organization: suse.de To: David Rientjes Subject: Re: [RFC] [PATCH] Cgroup based OOM killer controller Date: Fri, 23 Jan 2009 20:26:15 +0530 User-Agent: KMail/1.10.3 (Linux/2.6.27.7-9-default; KDE/4.1.3; x86_64; ; ) Cc: Evgeniy Polyakov , Andrew Morton , Alan Cox , linux-kernel@vger.kernel.org, Linus Torvalds , Chris Snook , Arve =?iso-8859-1?q?Hj=F8nnev=E5g?= , Paul Menage , containers@lists.linux-foundation.org References: <200901211638.23101.knikanth@suse.de> <200901231515.37442.knikanth@suse.de> In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200901232026.16778.knikanth@suse.de> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 14847 Lines: 468 On Friday 23 January 2009 16:03:49 David Rientjes wrote: > On Fri, 23 Jan 2009, Nikanth Karthikesan wrote: > > > Of course, because the oom killer must be aware that tasks in disjoint > > > cpusets are more likely than not to result in no memory freeing for > > > current's subsequent allocations. > > > > Yes, the problem is cpuset does not track the tasks which has allocated > > from this node - who has either moved or changed it set of allowable > > nodes. And because of that it does not limit oom killing to the tasks > > with in those tasks and could kill some innocent tasks at times. > > Right, the logic to prefer tasks that share the same set of allowable > nodes as the oom-triggering task is implemented in badness() and being in > a completely disjoint cpuset does not specifically exclude a task from > being chosen as the oom killer's victim. That's because, as you said, it > could have allocated memory elsewhere before changing cpusets or its set > of allowable mems. > > > As it is unable to take deterministic decision as memcg does, it plays > > with badness value and only suggests but does not restricts within those > > tasks that has to be killed. > > > > This bug is present even without this patch. > > It's not a bug, it can actually help in a couple instances: > > - a much larger memory hogging task is identified in a disjoint cpuset > and the liklihood it has allocated memory elsewhere either previously > or atomically, or > > - an administrator tunes the oom_adj value for such a task to describe > the above behavior even for smaller tasks and their liklihood to > allocate outside of their exclusive cpuset. > In other instances, It can actually also kill some innocent tasks unless the administrator tunes oom_adj, say something like kvm which would have a huge memory accounted, but might be from a different node altogether. Killing a single vm is killing all of the processes in that OS ;) Don't you think this has to be fixed^Wimproved? > > This patch adds one more easier way for the administrator to over-ride. > > Yeah, I know. But the problem with the approach is that it specifies an > oom priority for both global unconstrained ooms and cpuset-constrained > ooms. > > It's quite possible with your patch to identify an aggregate of tasks that > should be killed first whenever the system is completely out of memory. > That's great, and solves your problem. But that same system cannot > correctly use cpusets that have the potential to ever oom because your > patch completely overrides the victim priority and could needlessly kill > tasks that will not lead to future memory freeing. > > That's my objection to the proposal: it doesn't behave appropriately for > both global unconstrained ooms and cpuset-constrained ooms at the same > time. > So you are against specifying order when it is a cpuset-constrained oom. Here is a revised version of the patch which adds oom.cpuset_constraint, when set to 1 would disable the ordering imposed by this controller for cpuset constrained ooms! Will this work for you? Thanks Nikanth From: Nikanth Karthikesan Cgroup based OOM killer controller Signed-off-by: Nikanth Karthikesan --- This is a container group based approach to override the oom killer selection without losing all the benefits of the current oom killer heuristics and oom_adj interface. It adds a tunable oom.victim to the oom cgroup. The oom killer will kill the process using the usual badness value but only within the cgroup with the maximum value for oom.victim before killing any process from a cgroup with a lesser oom.victim number. Oom killing could be disabled by setting oom.victim=0. Difference from oom_adj: This controller allows users to specify "strict order" for oom kills. While oom_adj just works as a hint for the kernel, OOM Killer Controller gives users full control to decide the order in which processes should be killed. It is very hard to specify oom-kill order for several applications using just oom_adj because one needs to adjust oom_adj of various task based on oom_score of several tasks which keeps changing. CPUSET constrained OOM: Also the tunable oom.cpuset_constrained when enabled, would disable the ordering imposed by this controller for cpuset constrained OOMs. diff --git a/Documentation/cgroups/oom.txt b/Documentation/cgroups/oom.txt new file mode 100644 index 0000000..772fb41 --- /dev/null +++ b/Documentation/cgroups/oom.txt @@ -0,0 +1,34 @@ +OOM Killer controller +--- ------ ---------- + +The OOM killer kills the process based on a set of heuristics such that only +minimum amount of work done will be lost, a large amount of memory would be +recovered and minimum no of processes are killed. + +The user can adjust the score used to select the processes to be killed using +/proc//oom_adj. Giving it a high score will increase the likelihood of +this process being killed by the oom-killer. Valid values are in the range +-16 to +15, plus the special value -17, which disables oom-killing altogether +for that process. + +But it is very difficult to suggest an order among tasks to be killed during +Out Of Memory situation. The OOM Killer controller aids in doing that. + +USAGE +----- + +Mount the oom controller by passing 'oom' when mounting cgroups. Echo +a value in oom.victim file to change the order. The oom killer would kill +all the processes in a cgroup with a higher oom.victim value before killing a +process in a cgroup with lower oom.victim value. Among those tasks with same +oom.victim value, the usual badness heuristics would be applied. The +/proc//oom_adj still helps adjusting the oom killer score. Also having +oom.victim = 0 would disable oom killing for the tasks in that cgroup. + +Note: If this is used without proper consideration, innocent processes may +get killed unnecesarily. + +CPUSET constrained OOM: +Setting oom.cpuset_constraint=1 would disable the ordering during a cpuset +constrained oom. Setting oom.cpuset_constraint=0 would not distinguish +between a cpuset constrained oom and system wide oom. diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h index 9c8d31b..6944f99 100644 --- a/include/linux/cgroup_subsys.h +++ b/include/linux/cgroup_subsys.h @@ -59,4 +59,8 @@ SUBSYS(freezer) SUBSYS(net_cls) #endif +#ifdef CONFIG_CGROUP_OOM +SUBSYS(oom) +#endif + /* */ diff --git a/include/linux/oomcontrol.h b/include/linux/oomcontrol.h new file mode 100644 index 0000000..d4c4c72 --- /dev/null +++ b/include/linux/oomcontrol.h @@ -0,0 +1,19 @@ +#ifndef _LINUX_OOMCONTROL_H +#define _LINUX_OOMCONTROL_H + +#ifdef CONFIG_CGROUP_OOM +struct oom_cgroup { + struct cgroup_subsys_state css; + /* + * the order to be victimized for this group + */ + u64 victim; + + /* + * disable during cpuset constrained oom + */ + bool *cpuset_constraint; +}; + +#endif +#endif diff --git a/init/Kconfig b/init/Kconfig index 2af8382..99ed0de 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -354,6 +354,15 @@ config CGROUP_DEBUG Say N if unsure. +config CGROUP_OOM + bool "Oom cgroup subsystem" + depends on CGROUPS + help + This provides a cgroup subsystem which aids controlling + the order in which tasks whould be killed during + out of memory situations. + + config CGROUP_NS bool "Namespace cgroup subsystem" depends on CGROUPS diff --git a/mm/Makefile b/mm/Makefile index 72255be..a5d7222 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -33,3 +33,4 @@ obj-$(CONFIG_MIGRATION) += migrate.o obj-$(CONFIG_SMP) += allocpercpu.o obj-$(CONFIG_QUICKLIST) += quicklist.o obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o +obj-$(CONFIG_CGROUP_OOM) += oomcontrol.o diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 40ba050..555ecb6 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -26,6 +26,7 @@ #include #include #include +#include #include int sysctl_panic_on_oom; @@ -200,11 +201,15 @@ static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist, * (not docbooked, we don't want this one cluttering up the manual) */ static struct task_struct *select_bad_process(unsigned long *ppoints, - struct mem_cgroup *mem) + struct mem_cgroup *mem, int cpuset_constrained) { struct task_struct *g, *p; struct task_struct *chosen = NULL; struct timespec uptime; +#ifdef CONFIG_CGROUP_OOM + u64 chosenvictim = 1, taskvictim; + bool honour_cpuset_constraint; +#endif *ppoints = 0; do_posix_clock_monotonic_gettime(&uptime); @@ -257,10 +262,30 @@ static struct task_struct *select_bad_process(unsigned long *ppoints, continue; points = badness(p, uptime.tv_sec); +#ifdef CONFIG_CGROUP_OOM + taskvictim = (container_of(p->cgroups->subsys[oom_subsys_id], + struct oom_cgroup, css))->victim; + honour_cpuset_constraint = *(container_of(p->cgroups- >subsys[oom_subsys_id], + struct oom_cgroup, css))- >cpuset_constraint; + + if (taskvictim > chosenvictim || + (((taskvictim == chosenvictim) || + (cpuset_constrained && honour_cpuset_constraint)) + && points > *ppoints) || + (taskvictim && !chosen)) { + + chosen = p; + *ppoints = points; + chosenvictim = taskvictim; + + } + +#else if (points > *ppoints || !chosen) { chosen = p; *ppoints = points; } +#endif } while_each_thread(g, p); return chosen; @@ -431,7 +456,7 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask) read_lock(&tasklist_lock); retry: - p = select_bad_process(&points, mem); + p = select_bad_process(&points, mem, 0); /* not cpuset constrained */ if (PTR_ERR(p) == -1UL) goto out; @@ -513,7 +538,7 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask) /* * Must be called with tasklist_lock held for read. */ -static void __out_of_memory(gfp_t gfp_mask, int order) +static void __out_of_memory(gfp_t gfp_mask, int order, int cpuset_constrained) { if (sysctl_oom_kill_allocating_task) { oom_kill_process(current, gfp_mask, order, 0, NULL, @@ -528,7 +553,7 @@ retry: * Rambo mode: Shoot down a process and hope it solves whatever * issues we may have. */ - p = select_bad_process(&points, NULL); + p = select_bad_process(&points, NULL, cpuset_constrained); if (PTR_ERR(p) == -1UL) return; @@ -569,7 +594,8 @@ void pagefault_out_of_memory(void) panic("out of memory from page fault. panic_on_oom is selected.\n"); read_lock(&tasklist_lock); - __out_of_memory(0, 0); /* unknown gfp_mask and order */ + /* unknown gfp_mask and order and not cpuset constrained */ + __out_of_memory(0, 0, 0); read_unlock(&tasklist_lock); /* @@ -623,7 +649,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order) panic("out of memory. panic_on_oom is selected\n"); /* Fall-through */ case CONSTRAINT_CPUSET: - __out_of_memory(gfp_mask, order); + __out_of_memory(gfp_mask, order, 1); break; } diff --git a/mm/oomcontrol.c b/mm/oomcontrol.c new file mode 100644 index 0000000..24ee0da --- /dev/null +++ b/mm/oomcontrol.c @@ -0,0 +1,138 @@ +/* + * kernel/cgroup_oom.c - oom handler cgroup. + */ + +#include +#include +#include +#include + +/* + * Helper to retrieve oom controller data from cgroup + */ +static struct oom_cgroup *oom_css_from_cgroup(struct cgroup *cgrp) +{ + return container_of(cgroup_subsys_state(cgrp, + oom_subsys_id), struct oom_cgroup, + css); +} + + +static struct cgroup_subsys_state *oom_create(struct cgroup_subsys *ss, + struct cgroup *cont) +{ + struct oom_cgroup *oom_css = kzalloc(sizeof(*oom_css), GFP_KERNEL); + struct oom_cgroup *parent; + + if (!oom_css) + return ERR_PTR(-ENOMEM); + + /* + * if root last/only group to be victimized + * else inherit parents value + */ + if (cont->parent == NULL) { + oom_css->victim = 1; + oom_css->cpuset_constraint = + kzalloc(sizeof(*oom_css->cpuset_constraint), GFP_KERNEL); + *oom_css->cpuset_constraint = false; + } else { + parent = oom_css_from_cgroup(cont->parent); + oom_css->victim = parent->victim; + oom_css->cpuset_constraint = parent->cpuset_constraint; + } + + return &oom_css->css; +} + +static void oom_destroy(struct cgroup_subsys *ss, struct cgroup *cont) +{ + struct oom_cgroup *oom_css = oom_css_from_cgroup(cont); + if (cont->parent == NULL) + kfree(oom_css->cpuset_constraint); + kfree(cont->subsys[oom_subsys_id]); +} + +static int oom_victim_write(struct cgroup *cgrp, struct cftype *cft, + u64 val) +{ + + cgroup_lock(); + + (oom_css_from_cgroup(cgrp))->victim = val; + + cgroup_unlock(); + + return 0; +} + +static u64 oom_victim_read(struct cgroup *cgrp, struct cftype *cft) +{ + u64 victim = (oom_css_from_cgroup(cgrp))->victim; + + return victim; +} + +static int oom_cpuset_write(struct cgroup *cont, struct cftype *cft, + const char *buffer) +{ + if (buffer[0] == '1' && buffer[1] == 0) + *(oom_css_from_cgroup(cont))->cpuset_constraint = true; + else if (buffer[0] == '0' && buffer[1] == 0) + *(oom_css_from_cgroup(cont))->cpuset_constraint = false; + else + return -EINVAL; + return 0; +} + +static u64 oom_cpuset_read(struct cgroup *cgrp, struct cftype *cft) +{ + if (*(oom_css_from_cgroup(cgrp))->cpuset_constraint) + return 1; + else + return 0; +} + +static struct cftype oom_cgroup_files[] = { + { + .name = "victim", + .read_u64 = oom_victim_read, + .write_u64 = oom_victim_write, + }, +}; + +static struct cftype oom_cgroup_root_files[] = { + { + .name = "victim", + .read_u64 = oom_victim_read, + .write_u64 = oom_victim_write, + }, + { + .name = "cpuset_constraint", + .read_u64 = oom_cpuset_read, + .write_string = oom_cpuset_write, + }, +}; + +static int oom_populate(struct cgroup_subsys *ss, + struct cgroup *cont) +{ + int ret; + + if (cont->parent == NULL) { + ret = cgroup_add_files(cont, ss, oom_cgroup_root_files, + ARRAY_SIZE(oom_cgroup_root_files)); + } else { + ret = cgroup_add_files(cont, ss, oom_cgroup_files, + ARRAY_SIZE(oom_cgroup_files)); + } + return ret; +} + +struct cgroup_subsys oom_subsys = { + .name = "oom", + .subsys_id = oom_subsys_id, + .create = oom_create, + .destroy = oom_destroy, + .populate = oom_populate, +}; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/