Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755323AbZKRCNu (ORCPT ); Tue, 17 Nov 2009 21:13:50 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1752343AbZKRCNu (ORCPT ); Tue, 17 Nov 2009 21:13:50 -0500 Received: from smtp-out.google.com ([216.239.45.13]:21575 "EHLO smtp-out.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751825AbZKRCNt (ORCPT ); Tue, 17 Nov 2009 21:13:49 -0500 DomainKey-Signature: a=rsa-sha1; s=beta; d=google.com; c=nofws; q=dns; h=date:from:x-x-sender:to:cc:subject:in-reply-to:message-id: references:user-agent:mime-version:content-type:x-system-of-record; b=gYe9uobumppQANdTkOMQSwMpcJJHK9u2ykWgCGCcXqA6RpNvyjpzPJec334HdeIbw VgP8rf/yatGovZaA6+mJg== Date: Tue, 17 Nov 2009 18:13:48 -0800 (PST) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: KAMEZAWA Hiroyuki cc: Daisuke Nishimura , KOSAKI Motohiro , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton , Christoph Lameter Subject: Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2 In-Reply-To: <20091118095824.076c211f.kamezawa.hiroyu@jp.fujitsu.com> Message-ID: References: <20091110162121.361B.A69D9226@jp.fujitsu.com> <20091110162445.c6db7521.kamezawa.hiroyu@jp.fujitsu.com> <20091110163419.361E.A69D9226@jp.fujitsu.com> <20091110164055.a1b44a4b.kamezawa.hiroyu@jp.fujitsu.com> <20091110170338.9f3bb417.nishimura@mxp.nes.nec.co.jp> <20091110171704.3800f081.kamezawa.hiroyu@jp.fujitsu.com> <20091111112404.0026e601.kamezawa.hiroyu@jp.fujitsu.com> <20091111134514.4edd3011.kamezawa.hiroyu@jp.fujitsu.com> <20091111142811.eb16f062.kamezawa.hiroyu@jp.fujitsu.com> <20091111152004.3d585cee.kamezawa.hiroyu@jp.fujitsu.com> <20091111153414.3c263842.kamezawa.hiroyu@jp.fujitsu.com> <20091118095824.076c211f.kamezawa.hiroyu@jp.fujitsu.com> User-Agent: Alpine 2.00 (DEB 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-System-Of-Record: true Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 7977 Lines: 213 On Wed, 18 Nov 2009, KAMEZAWA Hiroyuki wrote: > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > index ab04537..4c5c58b 100644 > > --- a/mm/oom_kill.c > > +++ b/mm/oom_kill.c > > @@ -27,6 +27,7 @@ > > #include > > #include > > #include > > +#include > > > > int sysctl_panic_on_oom; > > int sysctl_oom_kill_allocating_task; > > @@ -35,18 +36,30 @@ static DEFINE_SPINLOCK(zone_scan_lock); > > /* #define DEBUG */ > > > > /* > > - * Is all threads of the target process nodes overlap ours? > > + * Do the nodes allowed by any of tsk's threads overlap ours? > > */ > > -static int has_intersects_mems_allowed(struct task_struct *tsk) > > +static int has_intersects_mems_allowed(struct task_struct *tsk, > > + nodemask_t *nodemask) > > { > > - struct task_struct *t; > > + struct task_struct *start = tsk; > > + NODEMASK_ALLOC(nodemask_t, mpol_nodemask, GFP_KERNEL); > > > > - t = tsk; > > + if (!nodemask) > > + mpol_nodemask = NULL; > > do { > > - if (cpuset_mems_allowed_intersects(current, t)) > > + if (mpol_nodemask) { > > + mpol_get(tsk->mempolicy); > > + if (init_nodemask_of_task_mempolicy(tsk, mpol_nodemask) && > > + nodes_intersects(*nodemask, *mpol_nodemask)) { > > + mpol_put(tsk->mempolicy); > > + return 1; > > + } > > + mpol_put(tsk->mempolicy); > > + } > > Hmm this mpol_get()/mpol_put() are necessary under tasklist_lock held ? They are, we don't hold tasklist_lock while dropping the reference count in do_exit(). > And...I wonder > > if (!init_nodemask_of_task_mempolicy(tsk, mpol_nodemask)) > return 1; /* this task uses default policy */ > > > > + if (cpuset_mems_allowed_intersects(current, tsk)) > > return 1; > > - t = next_thread(t); > > - } while (t != tsk); > > + tsk = next_thread(tsk); > > + } while (tsk != start); > > > > Sigh...we has to scan all threads, again. > Could you have an idea to improve this ? > > For example, > mm->mask_of_nodes_which_a_page_was_allocated_on > or > mm->mask_of_nodes_made_by_some_magical_technique > some ? > (maybe per-node rss is over kill.) > The same criticism could be said for the CONSTRAINT_CPUSET. We don't actually know in either case, mempolicy or cpusets, if memory was ever allocated on a particular node in tsk->mempolicy->v.nodes or tsk->mems_allowed, respectively. We assume, however, that if a node is included in a mempolicy nodemask or cpuset mems that it is an allowed node to allocate from for all attached tasks (and those tasks aren't solely allocating on a subset) so that killing tasks based on their potential for allocating on oom nodes is actually helpful. > > > return 0; > > } > > @@ -55,6 +68,8 @@ static int has_intersects_mems_allowed(struct task_struct *tsk) > > * badness - calculate a numeric value for how bad this task has been > > * @p: task struct of which task we should calculate > > * @uptime: current uptime in seconds > > + * @constraint: type of oom constraint > > + * @nodemask: nodemask passed to page allocator > > * > > * The formula used is relatively simple and documented inline in the > > * function. The main rationale is that we want to select a good task > > @@ -70,7 +85,8 @@ static int has_intersects_mems_allowed(struct task_struct *tsk) > > * of least surprise ... (be careful when you change it) > > */ > > > > -unsigned long badness(struct task_struct *p, unsigned long uptime) > > +unsigned long badness(struct task_struct *p, unsigned long uptime, > > + enum oom_constraint constraint, nodemask_t *nodemask) > > { > > unsigned long points, cpu_time, run_time; > > struct mm_struct *mm; > > @@ -171,7 +187,9 @@ unsigned long badness(struct task_struct *p, unsigned long uptime) > > * because p may have allocated or otherwise mapped memory on > > * this node before. However it will be less likely. > > */ > > - if (!has_intersects_mems_allowed(p)) > > + if (!has_intersects_mems_allowed(p, > > + constraint == CONSTRAINT_MEMORY_POLICY ? nodemask : > > + NULL)) > > points /= 8; > > > > /* > > @@ -244,7 +262,8 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist, > > * (not docbooked, we don't want this one cluttering up the manual) > > */ > > static struct task_struct *select_bad_process(unsigned long *ppoints, > > - struct mem_cgroup *mem) > > + struct mem_cgroup *mem, enum oom_constraint constraint, > > + nodemask_t *nodemask) > > { > > struct task_struct *p; > > struct task_struct *chosen = NULL; > > @@ -300,7 +319,7 @@ static struct task_struct *select_bad_process(unsigned long *ppoints, > > if (p->signal->oom_adj == OOM_DISABLE) > > continue; > > > > - points = badness(p, uptime.tv_sec); > > + points = badness(p, uptime.tv_sec, constraint, nodemask); > > if (points > *ppoints || !chosen) { > > chosen = p; > > *ppoints = points; > > @@ -472,7 +491,7 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask) > > > > read_lock(&tasklist_lock); > > retry: > > - p = select_bad_process(&points, mem); > > + p = select_bad_process(&points, mem, NULL); > > if (PTR_ERR(p) == -1UL) > > goto out; > > > > @@ -554,7 +573,8 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask) > > /* > > * Must be called with tasklist_lock held for read. > > */ > > -static void __out_of_memory(gfp_t gfp_mask, int order) > > +static void __out_of_memory(gfp_t gfp_mask, int order, > > + enum oom_constraint constraint, nodemask_t *nodemask) > > { > > struct task_struct *p; > > unsigned long points; > > @@ -568,7 +588,7 @@ retry: > > * Rambo mode: Shoot down a process and hope it solves whatever > > * issues we may have. > > */ > > - p = select_bad_process(&points, NULL); > > + p = select_bad_process(&points, NULL, constraint, nodemask); > > > > if (PTR_ERR(p) == -1UL) > > return; > > @@ -609,7 +629,8 @@ void pagefault_out_of_memory(void) > > panic("out of memory from page fault. panic_on_oom is selected.\n"); > > > > read_lock(&tasklist_lock); > > - __out_of_memory(0, 0); /* unknown gfp_mask and order */ > > + /* unknown gfp_mask and order */ > > + __out_of_memory(0, 0, CONSTRAINT_NONE, NULL); > > read_unlock(&tasklist_lock); > > > > /* > > @@ -656,11 +677,6 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, > > read_lock(&tasklist_lock); > > > > switch (constraint) { > > - case CONSTRAINT_MEMORY_POLICY: > > - oom_kill_process(current, gfp_mask, order, 0, NULL, > > - "No available memory (MPOL_BIND)"); > > - break; > > - > > case CONSTRAINT_NONE: > > if (sysctl_panic_on_oom) { > > dump_header(gfp_mask, order, NULL); > > @@ -668,7 +684,8 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, > > } > > /* Fall-through */ > > case CONSTRAINT_CPUSET: > > - __out_of_memory(gfp_mask, order); > > + case CONSTRAINT_MEMORY_POLICY: > > + __out_of_memory(gfp_mask, order, constraint, nodemask); > > break; > > } > maybe good. But hmm...does this work well with per-vma mempolicy ? > > I wonder > mm->mask_of_nodes_made_by_some_magical_technique > will be necessary for completeness. > I think that would probably be rejected because of its implications on the allocation fastpath. The change here isn't causing the oom killing to be any less ideal; current may never have allocated any memory on its mempolicy nodes prior to the oom and so killing it may be entirely useless. It's better to use our heuristics for determining the ideal task to kill in that case and restricting our subset of eligible tasks by the same criteria that we use for cpusets. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/