The per-task oom_adj value is a characteristic of its mm more than the
task itself since it's not possible to oom kill any thread that shares
the mm. If a task were to be killed while attached to an mm that could
not be freed because another thread were set to OOM_DISABLE, it would
have needlessly been terminated since there is no potential for future
memory freeing.
This patch moves oomkilladj (now more appropriately named oom_adj) from
struct task_struct to struct mm_struct. This requires task_lock() on a
task to check its oom_adj value to protect against exec, but it's already
necessary to take the lock when dereferencing the mm to find the total VM
size for the badness heuristic.
This fixes a livelock if the oom killer chooses a task and another thread
sharing the same memory has an oom_adj value of OOM_DISABLE. This occurs
because oom_kill_task() repeatedly returns 1 and refuses to kill the
chosen task while select_bad_process() will repeatedly choose the same
task during the next retry.
Taking task_lock() in select_bad_process() to check for OOM_DISABLE and
in oom_kill_task() to check for threads sharing the same memory will be
removed in the next patch in this series where it will no longer be
necessary.
Writing to /proc/pid/oom_adj for a kthread will now return -EINVAL since
these threads are immune from oom killing already. They simply report an
oom_adj value of OOM_DISABLE.
Cc: Nick Piggin <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: David Rientjes <[email protected]>
---
Documentation/filesystems/proc.txt | 15 ++++++++++-----
fs/proc/base.c | 19 ++++++++++++++++---
include/linux/mm_types.h | 2 ++
include/linux/sched.h | 1 -
mm/oom_kill.c | 34 ++++++++++++++++++++++------------
5 files changed, 50 insertions(+), 21 deletions(-)
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -1003,11 +1003,13 @@ CHAPTER 3: PER-PROCESS PARAMETERS
3.1 /proc/<pid>/oom_adj - Adjust the oom-killer score
------------------------------------------------------
-This file can be used to adjust the score used to select which processes
-should be killed in an out-of-memory situation. Giving it a high score will
-increase the likelihood of this process being killed by the oom-killer. Valid
-values are in the range -16 to +15, plus the special value -17, which disables
-oom-killing altogether for this process.
+This file can be used to adjust the score used to select which processes should
+be killed in an out-of-memory situation. The oom_adj value is a characteristic
+of the task's mm, so all threads that share an mm with pid will have the same
+oom_adj value. A high value will increase the likelihood of this process being
+killed by the oom-killer. Valid values are in the range -16 to +15 as
+explained below and a special value of -17, which disables oom-killing
+altogether for threads sharing pid's mm.
The process to be killed in an out-of-memory situation is selected among all others
based on its badness score. This value equals the original memory size of the process
@@ -1021,6 +1023,9 @@ the parent's score if they do not share the same memory. Thus forking servers
are the prime candidates to be killed. Having only one 'hungry' child will make
parent less preferable than the child.
+/proc/<pid>/oom_adj cannot be changed for kthreads since they are immune from
+oom-killing already.
+
/proc/<pid>/oom_score shows process' current badness score.
The following heuristics are then applied:
diff --git a/fs/proc/base.c b/fs/proc/base.c
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1006,7 +1006,12 @@ static ssize_t oom_adjust_read(struct file *file, char __user *buf,
if (!task)
return -ESRCH;
- oom_adjust = task->oomkilladj;
+ task_lock(task);
+ if (task->mm)
+ oom_adjust = task->mm->oom_adj;
+ else
+ oom_adjust = OOM_DISABLE;
+ task_unlock(task);
put_task_struct(task);
len = snprintf(buffer, sizeof(buffer), "%i\n", oom_adjust);
@@ -1035,11 +1040,19 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
task = get_proc_task(file->f_path.dentry->d_inode);
if (!task)
return -ESRCH;
- if (oom_adjust < task->oomkilladj && !capable(CAP_SYS_RESOURCE)) {
+ task_lock(task);
+ if (!task->mm) {
+ task_unlock(task);
+ put_task_struct(task);
+ return -EINVAL;
+ }
+ if (oom_adjust < task->mm->oom_adj && !capable(CAP_SYS_RESOURCE)) {
+ task_unlock(task);
put_task_struct(task);
return -EACCES;
}
- task->oomkilladj = oom_adjust;
+ task->mm->oom_adj = oom_adjust;
+ task_unlock(task);
put_task_struct(task);
if (end - buffer == 0)
return -EIO;
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -232,6 +232,8 @@ struct mm_struct {
unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
+ s8 oom_adj; /* OOM kill score adjustment (bit shift) */
+
cpumask_t cpu_vm_mask;
/* Architecture-specific MM context */
diff --git a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1149,7 +1149,6 @@ struct task_struct {
* a short time
*/
unsigned char fpu_counter;
- s8 oomkilladj; /* OOM kill score adjustment (bit shift). */
#ifdef CONFIG_BLK_DEV_IO_TRACE
unsigned int btrace_seq;
#endif
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -58,6 +58,7 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
unsigned long points, cpu_time, run_time;
struct mm_struct *mm;
struct task_struct *child;
+ int oom_adj;
task_lock(p);
mm = p->mm;
@@ -65,6 +66,7 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
task_unlock(p);
return 0;
}
+ oom_adj = mm->oom_adj;
/*
* The memory size of the process is the basis for the badness.
@@ -148,15 +150,15 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
points /= 8;
/*
- * Adjust the score by oomkilladj.
+ * Adjust the score by oom_adj.
*/
- if (p->oomkilladj) {
- if (p->oomkilladj > 0) {
+ if (oom_adj) {
+ if (oom_adj > 0) {
if (!points)
points = 1;
- points <<= p->oomkilladj;
+ points <<= oom_adj;
} else
- points >>= -(p->oomkilladj);
+ points >>= -(oom_adj);
}
#ifdef DEBUG
@@ -251,8 +253,12 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
*ppoints = ULONG_MAX;
}
- if (p->oomkilladj == OOM_DISABLE)
+ task_lock(p);
+ if (p->mm && p->mm->oom_adj == OOM_DISABLE) {
+ task_unlock(p);
continue;
+ }
+ task_unlock(p);
points = badness(p, uptime.tv_sec);
if (points > *ppoints || !chosen) {
@@ -304,8 +310,7 @@ static void dump_tasks(const struct mem_cgroup *mem)
}
printk(KERN_INFO "[%5d] %5d %5d %8lu %8lu %3d %3d %s\n",
p->pid, __task_cred(p)->uid, p->tgid, mm->total_vm,
- get_mm_rss(mm), (int)task_cpu(p), p->oomkilladj,
- p->comm);
+ get_mm_rss(mm), (int)task_cpu(p), mm->oom_adj, p->comm);
task_unlock(p);
} while_each_thread(g, p);
}
@@ -367,8 +372,12 @@ static int oom_kill_task(struct task_struct *p)
* Don't kill the process if any threads are set to OOM_DISABLE
*/
do_each_thread(g, q) {
- if (q->mm == mm && q->oomkilladj == OOM_DISABLE)
+ task_lock(q);
+ if (q->mm == mm && q->mm && q->mm->oom_adj == OOM_DISABLE) {
+ task_unlock(q);
return 1;
+ }
+ task_unlock(q);
} while_each_thread(g, q);
__oom_kill_task(p, 1);
@@ -393,10 +402,11 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
struct task_struct *c;
if (printk_ratelimit()) {
- printk(KERN_WARNING "%s invoked oom-killer: "
- "gfp_mask=0x%x, order=%d, oomkilladj=%d\n",
- current->comm, gfp_mask, order, current->oomkilladj);
task_lock(current);
+ printk(KERN_WARNING "%s invoked oom-killer: "
+ "gfp_mask=0x%x, order=%d, oom_adj=%d\n",
+ current->comm, gfp_mask, order,
+ current->mm ? current->mm->oom_adj : OOM_DISABLE);
cpuset_print_task_mems_allowed(current);
task_unlock(current);
dump_stack();
This moves the check for OOM_DISABLE to the badness heuristic so it is
only necessary to hold task_lock() once. If the mm is OOM_DISABLE, the
score is 0, which is also correctly exported via /proc/pid/oom_score.
This requires that tasks with badness scores of 0 are prohibited from
being oom killed, which makes sense since they would not allow for
future memory freeing anyway.
Since the oom_adj value is a characteristic of an mm and not a task, it
is no longer necessary to check the oom_adj value for threads sharing the
same memory (except when simply issuing SIGKILLs for threads in other
thread groups).
Cc: Nick Piggin <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: David Rientjes <[email protected]>
---
mm/oom_kill.c | 42 ++++++++++--------------------------------
1 files changed, 10 insertions(+), 32 deletions(-)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -67,6 +67,10 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
return 0;
}
oom_adj = mm->oom_adj;
+ if (oom_adj == OOM_DISABLE) {
+ task_unlock(p);
+ return 0;
+ }
/*
* The memory size of the process is the basis for the badness.
@@ -253,15 +257,8 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
*ppoints = ULONG_MAX;
}
- task_lock(p);
- if (p->mm && p->mm->oom_adj == OOM_DISABLE) {
- task_unlock(p);
- continue;
- }
- task_unlock(p);
-
points = badness(p, uptime.tv_sec);
- if (points > *ppoints || !chosen) {
+ if (points > *ppoints) {
chosen = p;
*ppoints = points;
}
@@ -354,32 +351,13 @@ static int oom_kill_task(struct task_struct *p)
struct mm_struct *mm;
struct task_struct *g, *q;
+ task_lock(p);
mm = p->mm;
-
- /* WARNING: mm may not be dereferenced since we did not obtain its
- * value from get_task_mm(p). This is OK since all we need to do is
- * compare mm to q->mm below.
- *
- * Furthermore, even if mm contains a non-NULL value, p->mm may
- * change to NULL at any time since we do not hold task_lock(p).
- * However, this is of no concern to us.
- */
-
- if (mm == NULL)
+ if (!mm || mm->oom_adj == OOM_DISABLE) {
+ task_unlock(p);
return 1;
-
- /*
- * Don't kill the process if any threads are set to OOM_DISABLE
- */
- do_each_thread(g, q) {
- task_lock(q);
- if (q->mm == mm && q->mm && q->mm->oom_adj == OOM_DISABLE) {
- task_unlock(q);
- return 1;
- }
- task_unlock(q);
- } while_each_thread(g, q);
-
+ }
+ task_unlock(p);
__oom_kill_task(p, 1);
/*
The oom killer must be invoked regardless of the order if the allocation
is __GFP_NOFAIL, otherwise it will loop forever when reclaim fails to
free some memory.
Cc: Nick Piggin <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Dave Hansen <[email protected]>
Signed-off-by: David Rientjes <[email protected]>
---
mm/page_alloc.c | 12 +++++++-----
1 files changed, 7 insertions(+), 5 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1547,7 +1547,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
goto out;
/* The OOM killer will not help higher order allocs */
- if (order > PAGE_ALLOC_COSTLY_ORDER)
+ if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_NOFAIL))
goto out;
/* Exhausted what can be done so it's blamo time */
@@ -1765,11 +1765,13 @@ rebalance:
goto got_pg;
/*
- * The OOM killer does not trigger for high-order allocations
- * but if no progress is being made, there are no other
- * options and retrying is unlikely to help
+ * The OOM killer does not trigger for high-order
+ * ~__GFP_NOFAIL allocations so if no progress is being
+ * made, there are no other options and retrying is
+ * unlikely to help.
*/
- if (order > PAGE_ALLOC_COSTLY_ORDER)
+ if (order > PAGE_ALLOC_COSTLY_ORDER &&
+ !(gfp_mask & __GFP_NOFAIL))
goto nopage;
goto restart;
On Mon, 1 Jun 2009 18:31:14 -0700 (PDT) David Rientjes <[email protected]> wrote:
> The oom killer must be invoked regardless of the order if the allocation
> is __GFP_NOFAIL, otherwise it will loop forever when reclaim fails to
> free some memory.
>
> Cc: Nick Piggin <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Christoph Lameter <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Signed-off-by: David Rientjes <[email protected]>
> ---
> mm/page_alloc.c | 12 +++++++-----
> 1 files changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1547,7 +1547,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> goto out;
>
> /* The OOM killer will not help higher order allocs */
> - if (order > PAGE_ALLOC_COSTLY_ORDER)
> + if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_NOFAIL))
> goto out;
>
> /* Exhausted what can be done so it's blamo time */
> @@ -1765,11 +1765,13 @@ rebalance:
> goto got_pg;
>
> /*
> - * The OOM killer does not trigger for high-order allocations
> - * but if no progress is being made, there are no other
> - * options and retrying is unlikely to help
> + * The OOM killer does not trigger for high-order
> + * ~__GFP_NOFAIL allocations so if no progress is being
> + * made, there are no other options and retrying is
> + * unlikely to help.
> */
> - if (order > PAGE_ALLOC_COSTLY_ORDER)
> + if (order > PAGE_ALLOC_COSTLY_ORDER &&
> + !(gfp_mask & __GFP_NOFAIL))
> goto nopage;
>
> goto restart;
I really think/hope/expect that this is unneeded.
Do we know of any callsites which do greater-than-order-0 allocations
with GFP_NOFAIL? If so, we should fix them.
Then just ban order>0 && GFP_NOFAIL allocations.
On Mon, Jun 01, 2009 at 10:56:02PM -0700, Andrew Morton wrote:
> On Mon, 1 Jun 2009 18:31:14 -0700 (PDT) David Rientjes <[email protected]> wrote:
> Do we know of any callsites which do greater-than-order-0 allocations
> with GFP_NOFAIL? If so, we should fix them.
Yeah, we have GFP_NOFAIL going to the slab allocator, so all bets are
off in regard to the order.
Hmm, checkpatch should warn about it too shouldn't it? We recently
acquired one right on the IO submission path, unfortunately.
7ba1ba12eeef0aa7113beb16410ef8b7c748e18b
Hi Nick,
On Mon, Jun 01, 2009 at 10:56:02PM -0700, Andrew Morton wrote:
>> Do we know of any callsites which do greater-than-order-0 allocations
>> with GFP_NOFAIL? ?If so, we should fix them.
On Tue, Jun 2, 2009 at 9:27 AM, Nick Piggin <[email protected]> wrote:
> Yeah, we have GFP_NOFAIL going to the slab allocator, so all bets are
> off in regard to the order.
SLUB is supports variable order allocations so we can certainly do
GFP_NOFAIL special casing there for < PAGE_SIZE allocations to enforce
zero order allocation for those. I suspect we could do something
similar in SLQB too?
Pekka
On Mon, 1 Jun 2009, Andrew Morton wrote:
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1547,7 +1547,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> > goto out;
> >
> > /* The OOM killer will not help higher order allocs */
> > - if (order > PAGE_ALLOC_COSTLY_ORDER)
> > + if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_NOFAIL))
> > goto out;
> >
> > /* Exhausted what can be done so it's blamo time */
> > @@ -1765,11 +1765,13 @@ rebalance:
> > goto got_pg;
> >
> > /*
> > - * The OOM killer does not trigger for high-order allocations
> > - * but if no progress is being made, there are no other
> > - * options and retrying is unlikely to help
> > + * The OOM killer does not trigger for high-order
> > + * ~__GFP_NOFAIL allocations so if no progress is being
> > + * made, there are no other options and retrying is
> > + * unlikely to help.
> > */
> > - if (order > PAGE_ALLOC_COSTLY_ORDER)
> > + if (order > PAGE_ALLOC_COSTLY_ORDER &&
> > + !(gfp_mask & __GFP_NOFAIL))
> > goto nopage;
> >
> > goto restart;
>
> I really think/hope/expect that this is unneeded.
>
> Do we know of any callsites which do greater-than-order-0 allocations
> with GFP_NOFAIL? If so, we should fix them.
>
> Then just ban order>0 && GFP_NOFAIL allocations.
>
That seems like a different topic: banning higher-order __GFP_NOFAIL
allocations or just deprecating __GFP_NOFAIL altogether and slowly
switching users over is a worthwhile effort, but is unrelated.
This patch is necessary because we explicitly deny the oom killer from
being used when the order is greater than PAGE_ALLOC_COSTLY_ORDER because
of an assumption that it won't help. That assumption isn't always true,
especially for large memory-hogging tasks that have mlocked large chunks
of contiguous memory, for example. The only thing we do know is that
direct reclaim has not made any progress so we're unlikely to get a
substantial amount of memory freeing in the immediate future. Such an
instance will simply loop forever without killing that rogue task for a
__GFP_NOFAIL allocation.
So while it's better in the long-term to deprecate the flag as much as
possible and perhaps someday remove it from the page allocator entirely,
we're faced with the current behavior of either looping endlessly or
freeing memory so the kernel allocation may succeed when direct reclaim
has failed, which also makes this a rare instance where the oom killer
will never needlessly kill a task.
On Tue, 2009-06-02 at 00:26 -0700, David Rientjes wrote:
> > I really think/hope/expect that this is unneeded.
> >
> > Do we know of any callsites which do greater-than-order-0 allocations
> > with GFP_NOFAIL? If so, we should fix them.
> >
> > Then just ban order>0 && GFP_NOFAIL allocations.
> >
>
> That seems like a different topic: banning higher-order __GFP_NOFAIL
> allocations or just deprecating __GFP_NOFAIL altogether and slowly
> switching users over is a worthwhile effort, but is unrelated.
>
> This patch is necessary because we explicitly deny the oom killer from
> being used when the order is greater than PAGE_ALLOC_COSTLY_ORDER because
> of an assumption that it won't help. That assumption isn't always true,
> especially for large memory-hogging tasks that have mlocked large chunks
> of contiguous memory, for example. The only thing we do know is that
> direct reclaim has not made any progress so we're unlikely to get a
> substantial amount of memory freeing in the immediate future. Such an
> instance will simply loop forever without killing that rogue task for a
> __GFP_NOFAIL allocation.
>
> So while it's better in the long-term to deprecate the flag as much as
> possible and perhaps someday remove it from the page allocator entirely,
> we're faced with the current behavior of either looping endlessly or
> freeing memory so the kernel allocation may succeed when direct reclaim
> has failed, which also makes this a rare instance where the oom killer
> will never needlessly kill a task.
I would really prefer if we do as Andrew suggests. Both will fix this
problem, so I don't see it as a different topic at all.
Eradicating __GFP_NOFAIL is a fine goal, but very hard work (people have
been wanting to do that for many years). But simply limiting it to
0-order allocation should be much(?) easier.
On Tue, Jun 02, 2009 at 09:34:55AM +0200, Peter Zijlstra wrote:
> On Tue, 2009-06-02 at 00:26 -0700, David Rientjes wrote:
> > > I really think/hope/expect that this is unneeded.
> > >
> > > Do we know of any callsites which do greater-than-order-0 allocations
> > > with GFP_NOFAIL? If so, we should fix them.
> > >
> > > Then just ban order>0 && GFP_NOFAIL allocations.
> > >
> >
> > That seems like a different topic: banning higher-order __GFP_NOFAIL
> > allocations or just deprecating __GFP_NOFAIL altogether and slowly
> > switching users over is a worthwhile effort, but is unrelated.
> >
> > This patch is necessary because we explicitly deny the oom killer from
> > being used when the order is greater than PAGE_ALLOC_COSTLY_ORDER because
> > of an assumption that it won't help. That assumption isn't always true,
> > especially for large memory-hogging tasks that have mlocked large chunks
> > of contiguous memory, for example. The only thing we do know is that
> > direct reclaim has not made any progress so we're unlikely to get a
> > substantial amount of memory freeing in the immediate future. Such an
> > instance will simply loop forever without killing that rogue task for a
> > __GFP_NOFAIL allocation.
> >
> > So while it's better in the long-term to deprecate the flag as much as
> > possible and perhaps someday remove it from the page allocator entirely,
> > we're faced with the current behavior of either looping endlessly or
> > freeing memory so the kernel allocation may succeed when direct reclaim
> > has failed, which also makes this a rare instance where the oom killer
> > will never needlessly kill a task.
>
> I would really prefer if we do as Andrew suggests. Both will fix this
> problem, so I don't see it as a different topic at all.
Well, his patch, as it stands, is a good one. Because we do have
potential higher order GFP_NOFAIL.
I don't particularly want to add complexity (not a single branch)
to SLQB to handle this (and how does the caller *really* know
anyway? they know the exact object size, the hardware alignment
constraints, the page size, etc. in order to know that all of the
many slab allocators will be able to satisfy it with an order-0
allocation?)
> Eradicating __GFP_NOFAIL is a fine goal, but very hard work (people have
> been wanting to do that for many years). But simply limiting it to
> 0-order allocation should be much(?) easier.
Some of them may be hard work, but I don't think anybody has
been working too hard at it ;)
On Tue, 2 Jun 2009, Nick Piggin wrote:
> > I would really prefer if we do as Andrew suggests. Both will fix this
> > problem, so I don't see it as a different topic at all.
>
> Well, his patch, as it stands, is a good one. Because we do have
> potential higher order GFP_NOFAIL.
>
There's currently an inconsistency in the definition of __GFP_NOFAIL and
its implementation. The clearly defined purpose of the flag is:
* __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
* cannot handle allocation failures.
Yet __GFP_NOFAIL allocations may fail if no progress is made via direct
reclaim and order > PAGE_ALLOC_COSTLY_ORDER. That's the behavior in the
git HEAD and Mel's allocator rework in mmotm.
I've been addressing this implicitly by requiring __GFP_NOFAIL to always
abide by the definition: we simply can never return NULL because the
caller can't handle it (and, by definition, shouldn't even be responsible
for considering it).
With my patch, we kill a memory hogging task that will free some memory so
the allocation will succeed (or multiple tasks if insufficient contiguous
memory is available). Kernel allocations use __GFP_NOFAIL, so the fault
of this memory freeing is entirely on the caller, not the page allocator.
My preference for handling this is to merge my patch (obviously :), and
then hopefully deprecate __GFP_NOFAIL as much as possible although I don't
suspect it could be eradicated forever.
On Tue, 2 Jun 2009, David Rientjes wrote:
> With my patch, we kill a memory hogging task that will free some memory so
> the allocation will succeed (or multiple tasks if insufficient contiguous
> memory is available). Kernel allocations use __GFP_NOFAIL, so the fault
> of this memory freeing is entirely on the caller, not the page allocator.
>
> My preference for handling this is to merge my patch (obviously :), and
> then hopefully deprecate __GFP_NOFAIL as much as possible although I don't
> suspect it could be eradicated forever.
>
I really hope this patch isn't getting dropped because it fixes the
possibility that a __GFP_NOFAIL allocation will fail when its definition
is to the contrary. Depending on the size of the allocation, that can
cause a panic in at least the reiserfs, ntfs, cxgb3, and gfs2 cases.
As I mentioned before, it's a noble goal to deprecate __GFP_NOFAIL as much
as possible and (at the least) prevent it from trying high-order
allocation attempts. The current implementation of the flag is
problematic, however, and this patch addresses it by attempting to free
some memory when direct reclaim fails.
This change has already been acked by Rik van Riel in
http://marc.info/?l=linux-kernel&m=124199910508930 and Mel Gorman in
http://marc.info/?l=linux-kernel&m=124203850416159.
On Wed, 3 Jun 2009 15:10:38 -0700 (PDT)
David Rientjes <[email protected]> wrote:
> On Tue, 2 Jun 2009, David Rientjes wrote:
>
> > With my patch, we kill a memory hogging task that will free some memory so
> > the allocation will succeed (or multiple tasks if insufficient contiguous
> > memory is available). Kernel allocations use __GFP_NOFAIL, so the fault
> > of this memory freeing is entirely on the caller, not the page allocator.
> >
> > My preference for handling this is to merge my patch (obviously :), and
> > then hopefully deprecate __GFP_NOFAIL as much as possible although I don't
> > suspect it could be eradicated forever.
> >
>
> I really hope this patch isn't getting dropped because it fixes the
> possibility that a __GFP_NOFAIL allocation will fail when its definition
> is to the contrary. Depending on the size of the allocation, that can
> cause a panic in at least the reiserfs, ntfs, cxgb3, and gfs2 cases.
>
> As I mentioned before, it's a noble goal to deprecate __GFP_NOFAIL as much
> as possible and (at the least) prevent it from trying high-order
> allocation attempts. The current implementation of the flag is
> problematic, however, and this patch addresses it by attempting to free
> some memory when direct reclaim fails.
>
Sigh, all right, but we suck.
Divy, could we please at least remove __GFP_NOFAIL from
drivers/net/cxgb? It's really quite inappropriate for a driver to
assume that core VM can do magic. Drivers should test the return value
and handle the -ENOMEM in the old-fashioned way, please.
Ditto-in-spades cfq-iosched.c. We discussed that recently but I forgot the
upshot. The code and its comment are still in flagrant disagreement?
Andrew Morton wrote:
> On Wed, 3 Jun 2009 15:10:38 -0700 (PDT)
> David Rientjes <[email protected]> wrote:
>
>
>> On Tue, 2 Jun 2009, David Rientjes wrote:
>>
>>
>>> With my patch, we kill a memory hogging task that will free some memory so
>>> the allocation will succeed (or multiple tasks if insufficient contiguous
>>> memory is available). Kernel allocations use __GFP_NOFAIL, so the fault
>>> of this memory freeing is entirely on the caller, not the page allocator.
>>>
>>> My preference for handling this is to merge my patch (obviously :), and
>>> then hopefully deprecate __GFP_NOFAIL as much as possible although I don't
>>> suspect it could be eradicated forever.
>>>
>>>
>> I really hope this patch isn't getting dropped because it fixes the
>> possibility that a __GFP_NOFAIL allocation will fail when its definition
>> is to the contrary. Depending on the size of the allocation, that can
>> cause a panic in at least the reiserfs, ntfs, cxgb3, and gfs2 cases.
>>
>> As I mentioned before, it's a noble goal to deprecate __GFP_NOFAIL as much
>> as possible and (at the least) prevent it from trying high-order
>> allocation attempts. The current implementation of the flag is
>> problematic, however, and this patch addresses it by attempting to free
>> some memory when direct reclaim fails.
>>
>>
>
> Sigh, all right, but we suck.
>
> Divy, could we please at least remove __GFP_NOFAIL from
> drivers/net/cxgb? It's really quite inappropriate for a driver to
> assume that core VM can do magic. Drivers should test the return value
> and handle the -ENOMEM in the old-fashioned way, please.
>
We started working on it, and got to eliminate them out of
cxgb3_main.c::init_tp_parity().
We're still looking at eliminating its use in
cxgb3_offload.c::t3_process_tid_release_list().
I'll post the patches as soon as they are ready.
cheers,
Divy