Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756292AbZKRBBZ (ORCPT ); Tue, 17 Nov 2009 20:01:25 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1753800AbZKRBBY (ORCPT ); Tue, 17 Nov 2009 20:01:24 -0500 Received: from fgwmail5.fujitsu.co.jp ([192.51.44.35]:49888 "EHLO fgwmail5.fujitsu.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752531AbZKRBBX (ORCPT ); Tue, 17 Nov 2009 20:01:23 -0500 X-SecurityPolicyCheck-FJ: OK by FujitsuOutboundMailChecker v1.3.1 Date: Wed, 18 Nov 2009 09:58:24 +0900 From: KAMEZAWA Hiroyuki To: David Rientjes Cc: Daisuke Nishimura , KOSAKI Motohiro , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton , Christoph Lameter Subject: Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2 Message-Id: <20091118095824.076c211f.kamezawa.hiroyu@jp.fujitsu.com> In-Reply-To: References: <20091110162121.361B.A69D9226@jp.fujitsu.com> <20091110162445.c6db7521.kamezawa.hiroyu@jp.fujitsu.com> <20091110163419.361E.A69D9226@jp.fujitsu.com> <20091110164055.a1b44a4b.kamezawa.hiroyu@jp.fujitsu.com> <20091110170338.9f3bb417.nishimura@mxp.nes.nec.co.jp> <20091110171704.3800f081.kamezawa.hiroyu@jp.fujitsu.com> <20091111112404.0026e601.kamezawa.hiroyu@jp.fujitsu.com> <20091111134514.4edd3011.kamezawa.hiroyu@jp.fujitsu.com> <20091111142811.eb16f062.kamezawa.hiroyu@jp.fujitsu.com> <20091111152004.3d585cee.kamezawa.hiroyu@jp.fujitsu.com> <20091111153414.3c263842.kamezawa.hiroyu@jp.fujitsu.com> Organization: FUJITSU Co. LTD. X-Mailer: Sylpheed 2.5.0 (GTK+ 2.10.14; i686-pc-mingw32) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 12557 Lines: 345 Hi, On Tue, 17 Nov 2009 16:11:58 -0800 (PST) David Rientjes wrote: > On Wed, 11 Nov 2009, KAMEZAWA Hiroyuki wrote: > > > Fixing node-oriented allocation handling in oom-kill.c > > I myself think this as bugfix not as ehnancement. > > > > In these days, things are changed as > > - alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask(). > > - mempolicy don't maintain its own private zonelists. > > (And cpuset doesn't use nodemask for __alloc_pages_nodemask()) > > > > So, current oom-killer's check function is wrong. > > > > This patch does > > - check nodemask, if nodemask && nodemask doesn't cover all > > node_states[N_HIGH_MEMORY], this is CONSTRAINT_MEMORY_POLICY. > > - Scan all zonelist under nodemask, if it hits cpuset's wall > > this faiulre is from cpuset. > > And > > - modifies the caller of out_of_memory not to call oom if __GFP_THISNODE. > > This doesn't change "current" behavior. If callers use __GFP_THISNODE > > it should handle "page allocation failure" by itself. > > > > - handle __GFP_NOFAIL+__GFP_THISNODE path. > > This is something like a FIXME but this gfpmask is not used now. > > > > Now that we're passing the nodemask into the oom killer, we should be able > to do more intelligent CONSTRAINT_MEMORY_POLICY selection. current is not > always the ideal task to kill, so it's better to scan the tasklist and > determine the best task depending on our heuristics, similiar to how we > penalize candidates if they do not share the same cpuset. > > Something like the following (untested) patch. Comments? Hm, yes. I think this direction is good. I have my own but your version looks nicer. (I'm busy with troubles in these days, sorry.) > --- > diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h > --- a/include/linux/mempolicy.h > +++ b/include/linux/mempolicy.h > @@ -201,7 +201,9 @@ extern void mpol_fix_fork_child_flag(struct task_struct *p); > extern struct zonelist *huge_zonelist(struct vm_area_struct *vma, > unsigned long addr, gfp_t gfp_flags, > struct mempolicy **mpol, nodemask_t **nodemask); > -extern bool init_nodemask_of_mempolicy(nodemask_t *mask); > +extern bool init_nodemask_of_task_mempolicy(struct task_struct *tsk, > + nodemask_t *mask); > +extern bool init_nodemask_of_current_mempolicy(nodemask_t *mask); > extern unsigned slab_node(struct mempolicy *policy); > > extern enum zone_type policy_zone; > @@ -329,7 +331,16 @@ static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma, > return node_zonelist(0, gfp_flags); > } > > -static inline bool init_nodemask_of_mempolicy(nodemask_t *m) { return false; } > +static inline bool init_nodemask_of_task_mempolicy(struct task_struct *tsk, > + nodemask_t *mask) > +{ > + return false; > +} > + > +static inline bool init_nodemask_of_current_mempolicy(nodemask_t *mask) > +{ > + return false; > +} > > static inline int do_migrate_pages(struct mm_struct *mm, > const nodemask_t *from_nodes, > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index ed3f392..3ab3021 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -1373,7 +1373,7 @@ static ssize_t nr_hugepages_store_common(bool obey_mempolicy, > * global hstate attribute > */ > if (!(obey_mempolicy && > - init_nodemask_of_mempolicy(nodes_allowed))) { > + init_nodemask_of_current_mempolicy(nodes_allowed))) { > NODEMASK_FREE(nodes_allowed); > nodes_allowed = &node_states[N_HIGH_MEMORY]; > } > @@ -1860,7 +1860,7 @@ static int hugetlb_sysctl_handler_common(bool obey_mempolicy, > NODEMASK_ALLOC(nodemask_t, nodes_allowed, > GFP_KERNEL | __GFP_NORETRY); > if (!(obey_mempolicy && > - init_nodemask_of_mempolicy(nodes_allowed))) { > + init_nodemask_of_current_mempolicy(nodes_allowed))) { > NODEMASK_FREE(nodes_allowed); > nodes_allowed = &node_states[N_HIGH_MEMORY]; > } > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index f11fdad..23c84bb 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -1568,24 +1568,18 @@ struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr, > } > return zl; > } > +#endif > > /* > - * init_nodemask_of_mempolicy > - * > - * If the current task's mempolicy is "default" [NULL], return 'false' > - * to indicate default policy. Otherwise, extract the policy nodemask > - * for 'bind' or 'interleave' policy into the argument nodemask, or > - * initialize the argument nodemask to contain the single node for > - * 'preferred' or 'local' policy and return 'true' to indicate presence > - * of non-default mempolicy. > - * > - * We don't bother with reference counting the mempolicy [mpol_get/put] > - * because the current task is examining it's own mempolicy and a task's > - * mempolicy is only ever changed by the task itself. > + * If tsk's mempolicy is "default" [NULL], return 'false' to indicate default > + * policy. Otherwise, extract the policy nodemask for 'bind' or 'interleave' > + * policy into the argument nodemask, or initialize the argument nodemask to > + * contain the single node for 'preferred' or 'local' policy and return 'true' > + * to indicate presence of non-default mempolicy. > * > * N.B., it is the caller's responsibility to free a returned nodemask. > */ > -bool init_nodemask_of_mempolicy(nodemask_t *mask) > +bool init_nodemask_of_task_mempolicy(struct task_struct *tsk, nodemask_t *mask) > { > struct mempolicy *mempolicy; > int nid; > @@ -1615,7 +1609,16 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask) > > return true; > } > -#endif > + > +/* > + * We don't bother with reference counting the mempolicy [mpol_get/put] > + * because the current task is examining it's own mempolicy and a task's > + * mempolicy is only ever changed by the task itself. > + */ > +bool init_nodemask_of_current_mempolicy(nodemask_t *mask) > +{ > + return init_nodemask_of_task_mempolicy(current, mask); > +} > > /* Allocate a page in interleaved policy. > Own path because it needs to do special accounting. */ > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index ab04537..4c5c58b 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -27,6 +27,7 @@ > #include > #include > #include > +#include > > int sysctl_panic_on_oom; > int sysctl_oom_kill_allocating_task; > @@ -35,18 +36,30 @@ static DEFINE_SPINLOCK(zone_scan_lock); > /* #define DEBUG */ > > /* > - * Is all threads of the target process nodes overlap ours? > + * Do the nodes allowed by any of tsk's threads overlap ours? > */ > -static int has_intersects_mems_allowed(struct task_struct *tsk) > +static int has_intersects_mems_allowed(struct task_struct *tsk, > + nodemask_t *nodemask) > { > - struct task_struct *t; > + struct task_struct *start = tsk; > + NODEMASK_ALLOC(nodemask_t, mpol_nodemask, GFP_KERNEL); > > - t = tsk; > + if (!nodemask) > + mpol_nodemask = NULL; > do { > - if (cpuset_mems_allowed_intersects(current, t)) > + if (mpol_nodemask) { > + mpol_get(tsk->mempolicy); > + if (init_nodemask_of_task_mempolicy(tsk, mpol_nodemask) && > + nodes_intersects(*nodemask, *mpol_nodemask)) { > + mpol_put(tsk->mempolicy); > + return 1; > + } > + mpol_put(tsk->mempolicy); > + } Hmm this mpol_get()/mpol_put() are necessary under tasklist_lock held ? And...I wonder if (!init_nodemask_of_task_mempolicy(tsk, mpol_nodemask)) return 1; /* this task uses default policy */ > + if (cpuset_mems_allowed_intersects(current, tsk)) > return 1; > - t = next_thread(t); > - } while (t != tsk); > + tsk = next_thread(tsk); > + } while (tsk != start); > Sigh...we has to scan all threads, again. Could you have an idea to improve this ? For example, mm->mask_of_nodes_which_a_page_was_allocated_on or mm->mask_of_nodes_made_by_some_magical_technique some ? (maybe per-node rss is over kill.) > return 0; > } > @@ -55,6 +68,8 @@ static int has_intersects_mems_allowed(struct task_struct *tsk) > * badness - calculate a numeric value for how bad this task has been > * @p: task struct of which task we should calculate > * @uptime: current uptime in seconds > + * @constraint: type of oom constraint > + * @nodemask: nodemask passed to page allocator > * > * The formula used is relatively simple and documented inline in the > * function. The main rationale is that we want to select a good task > @@ -70,7 +85,8 @@ static int has_intersects_mems_allowed(struct task_struct *tsk) > * of least surprise ... (be careful when you change it) > */ > > -unsigned long badness(struct task_struct *p, unsigned long uptime) > +unsigned long badness(struct task_struct *p, unsigned long uptime, > + enum oom_constraint constraint, nodemask_t *nodemask) > { > unsigned long points, cpu_time, run_time; > struct mm_struct *mm; > @@ -171,7 +187,9 @@ unsigned long badness(struct task_struct *p, unsigned long uptime) > * because p may have allocated or otherwise mapped memory on > * this node before. However it will be less likely. > */ > - if (!has_intersects_mems_allowed(p)) > + if (!has_intersects_mems_allowed(p, > + constraint == CONSTRAINT_MEMORY_POLICY ? nodemask : > + NULL)) > points /= 8; > > /* > @@ -244,7 +262,8 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist, > * (not docbooked, we don't want this one cluttering up the manual) > */ > static struct task_struct *select_bad_process(unsigned long *ppoints, > - struct mem_cgroup *mem) > + struct mem_cgroup *mem, enum oom_constraint constraint, > + nodemask_t *nodemask) > { > struct task_struct *p; > struct task_struct *chosen = NULL; > @@ -300,7 +319,7 @@ static struct task_struct *select_bad_process(unsigned long *ppoints, > if (p->signal->oom_adj == OOM_DISABLE) > continue; > > - points = badness(p, uptime.tv_sec); > + points = badness(p, uptime.tv_sec, constraint, nodemask); > if (points > *ppoints || !chosen) { > chosen = p; > *ppoints = points; > @@ -472,7 +491,7 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask) > > read_lock(&tasklist_lock); > retry: > - p = select_bad_process(&points, mem); > + p = select_bad_process(&points, mem, NULL); > if (PTR_ERR(p) == -1UL) > goto out; > > @@ -554,7 +573,8 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask) > /* > * Must be called with tasklist_lock held for read. > */ > -static void __out_of_memory(gfp_t gfp_mask, int order) > +static void __out_of_memory(gfp_t gfp_mask, int order, > + enum oom_constraint constraint, nodemask_t *nodemask) > { > struct task_struct *p; > unsigned long points; > @@ -568,7 +588,7 @@ retry: > * Rambo mode: Shoot down a process and hope it solves whatever > * issues we may have. > */ > - p = select_bad_process(&points, NULL); > + p = select_bad_process(&points, NULL, constraint, nodemask); > > if (PTR_ERR(p) == -1UL) > return; > @@ -609,7 +629,8 @@ void pagefault_out_of_memory(void) > panic("out of memory from page fault. panic_on_oom is selected.\n"); > > read_lock(&tasklist_lock); > - __out_of_memory(0, 0); /* unknown gfp_mask and order */ > + /* unknown gfp_mask and order */ > + __out_of_memory(0, 0, CONSTRAINT_NONE, NULL); > read_unlock(&tasklist_lock); > > /* > @@ -656,11 +677,6 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, > read_lock(&tasklist_lock); > > switch (constraint) { > - case CONSTRAINT_MEMORY_POLICY: > - oom_kill_process(current, gfp_mask, order, 0, NULL, > - "No available memory (MPOL_BIND)"); > - break; > - > case CONSTRAINT_NONE: > if (sysctl_panic_on_oom) { > dump_header(gfp_mask, order, NULL); > @@ -668,7 +684,8 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, > } > /* Fall-through */ > case CONSTRAINT_CPUSET: > - __out_of_memory(gfp_mask, order); > + case CONSTRAINT_MEMORY_POLICY: > + __out_of_memory(gfp_mask, order, constraint, nodemask); > break; > } maybe good. But hmm...does this work well with per-vma mempolicy ? I wonder mm->mask_of_nodes_made_by_some_magical_technique will be necessary for completeness. Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/