2010-02-10 16:32:14

by David Rientjes

[permalink] [raw]
Subject: [patch 0/7 -mm] oom killer rewrite

This patchset is a rewrite of the out of memory killer to address several
issues that have been raised recently. The most notable change is a
complete rewrite of the badness heuristic that determines which task is
killed; the goal was to make it as simple and predictable as possible
while still addressing issues that plague the VM.

This patchset is based on mmotm-2010-02-05-15-06 because of the following
dependencies:

[patch 4/7] oom: badness heuristic rewrite:
mm-count-swap-usage.patch

[patch 5/7] oom: replace sysctls with quick mode:
sysctl-clean-up-vm-related-variable-delcarations.patch

To apply to mainline, download 2.6.33-rc7 and apply

mm-clean-up-mm_counter.patch
mm-avoid-false-sharing-of-mm_counter.patch
mm-avoid-false_sharing-of-mm_counter-checkpatch-fixes.patch
mm-count-swap-usage.patch
mm-count-swap-usage-checkpatch-fixes.patch
mm-introduce-dump_page-and-print-symbolic-flag-names.patch
sysctl-clean-up-vm-related-variable-declarations.patch
sysctl-clean-up-vm-related-variable-declarations-fix.patch

from http://userweb.kernel.org/~akpm/mmotm/broken-out.tar.gz first.
---
Documentation/filesystems/proc.txt | 78 ++++---
Documentation/sysctl/vm.txt | 51 ++---
fs/proc/base.c | 13 +-
include/linux/mempolicy.h | 13 +-
include/linux/oom.h | 18 +-
kernel/sysctl.c | 15 +-
mm/mempolicy.c | 39 +++
mm/oom_kill.c | 455 +++++++++++++++++++-----------------
mm/page_alloc.c | 3 +
9 files changed, 383 insertions(+), 302 deletions(-)


2010-02-10 16:32:19

by David Rientjes

[permalink] [raw]
Subject: [patch 2/7 -mm] oom: sacrifice child with highest badness score for parent

When a task is chosen for oom kill, the oom killer first attempts to
sacrifice a child not sharing its parent's memory instead.
Unfortunately, this often kills in a seemingly random fashion based on
the ordering of the selected task's child list. Additionally, it is not
guaranteed at all to free a large amount of memory that we need to
prevent additional oom killing in the very near future.

Instead, we now only attempt to sacrifice the worst child not sharing its
parent's memory, if one exists. The worst child is indicated with the
highest badness() score. This serves two advantages: we kill a
memory-hogging task more often, and we allow the configurable
/proc/pid/oom_adj value to be considered as a factor in which child to
kill.

Reviewers may observe that the previous implementation would iterate
through the children and attempt to kill each until one was successful
and then the parent if none were found while the new code simply kills
the most memory-hogging task or the parent. Note that the only time
oom_kill_task() fails, however, is when a child does not have an mm or
has a /proc/pid/oom_adj of OOM_DISABLE. badness() returns 0 for both
cases, so the final oom_kill_task() will always succeed.

Signed-off-by: David Rientjes <[email protected]>
---
mm/oom_kill.c | 23 +++++++++++++++++------
1 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -432,7 +432,10 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
unsigned long points, struct mem_cgroup *mem,
const char *message)
{
+ struct task_struct *victim = p;
struct task_struct *c;
+ unsigned long victim_points = 0;
+ struct timespec uptime;

if (printk_ratelimit())
dump_header(p, gfp_mask, order, mem);
@@ -446,17 +449,25 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
return 0;
}

- printk(KERN_ERR "%s: kill process %d (%s) score %li or a child\n",
- message, task_pid_nr(p), p->comm, points);
+ pr_err("%s: Kill process %d (%s) with score %lu or sacrifice child\n",
+ message, task_pid_nr(p), p->comm, points);

- /* Try to kill a child first */
+ /* Try to sacrifice the worst child first */
+ do_posix_clock_monotonic_gettime(&uptime);
list_for_each_entry(c, &p->children, sibling) {
+ unsigned long cpoints;
+
if (c->mm == p->mm)
continue;
- if (!oom_kill_task(c))
- return 0;
+
+ /* badness() returns 0 if the thread is unkillable */
+ cpoints = badness(c, uptime.tv_sec);
+ if (cpoints > victim_points) {
+ victim = c;
+ victim_points = cpoints;
+ }
}
- return oom_kill_task(p);
+ return oom_kill_task(victim);
}

#ifdef CONFIG_CGROUP_MEM_RES_CTLR

2010-02-10 16:32:24

by David Rientjes

[permalink] [raw]
Subject: [patch 4/7 -mm] oom: badness heuristic rewrite

This a complete rewrite of the oom killer's badness() heuristic which is
used to determine which task to kill in oom conditions. The goal is to
make it as simple and predictable as possible so the results are better
understood and we end up killing the task which will lead to the most
memory freeing while still respecting the fine-tuning from userspace.

The baseline for the heuristic is a proportion of memory that each task
is currently using in memory plus swap compared to the amount of
"allowable" memory. "Allowable," in this sense, means the system-wide
resources for unconstrained oom conditions, the set of mempolicy nodes,
the mems attached to current's cpuset, or a memory controller's limit.
The proportion is given on a scale of 0 (never kill) to 1000 (always
kill), roughly meaning that if a task has a badness() score of 500 that
the task consumes approximately 50% of allowable memory resident in RAM
or in swap space.

The proportion is always relative to the amount of "allowable" memory and
not the total amount of RAM systemwide so that mempolicies and cpusets
may operate in isolation; they shall not need to know the true size of
the machine on which they are running if they are bound to a specific set
of nodes or mems, respectively.

Forkbomb detection is done in a completely different way: a threshold is
configurable from userspace to determine how many first-generation execve
children (those with their own address spaces) a task may have before it
is considered a forkbomb. This can be tuned by altering the value in
/proc/sys/vm/oom_forkbomb_thres, which defaults to 1000.

When a task has more than 1000 first-generation children with different
address spaces than itself, a penalty of

(average rss of children) * (# of 1st generation execve children)
-----------------------------------------------------------------
oom_forkbomb_thres

is assessed. So, for example, using the default oom_forkbomb_thres of
1000, the penalty is twice the average rss of all its execve children if
there are 2000 such tasks. A task is considered to count toward the
threshold if its total runtime is less than one second; for 1000 of such
tasks to exist, the parent process must be forking at an extremely high
rate either erroneously or maliciously.

Even though a particular task may be designated a forkbomb and selected
as the victim, the oom killer will still kill the 1st generation execve
child with the highest badness() score in its place. The avoids killing
important servers or system daemons.

Root tasks are given 3% extra memory just like __vm_enough_memory()
provides in LSMs. In the event of two tasks consuming similar amounts of
memory, it is generally better to save root's task.

Because of the change in the badness() heuristic's baseline, a change
must also be made to the user interface used to tune it. Instead of a
scale from -16 to +15 to indicate a bitshift on the point value given to
a task, which was very difficult to tune accurately or with any degree of
precision, /proc/pid/oom_adj now ranges from -1000 to +1000. That is, it
can be used to polarize the heuristic such that certain tasks are never
considered for oom kill while others are always considered. The value is
added directly into the badness() score so a value of -500, for example,
means to discount 50% of its memory consumption in comparison to other
tasks either on the system, bound to the mempolicy, or in the cpuset.

OOM_ADJUST_MIN and OOM_ADJUST_MAX have been exported to userspace since
2006 via include/linux/oom.h. This alters their values from -16 to -1000
and from +15 to +1000, respectively. OOM_DISABLE is now the equivalent
of the lowest possible value, OOM_ADJUST_MIN. Adding its value, -1000,
to any badness score will always return 0.

Although redefining these values may be controversial, it is much easier
to understand when the units are fully understood as described above.
In the short-term, there may be userspace breakage for tasks that
hardcode -17 meaning OOM_DISABLE, for example, but the long-term will
make the semantics much easier to understand and oom killing much more
effective.

Signed-off-by: David Rientjes <[email protected]>
---
Documentation/filesystems/proc.txt | 78 ++++++----
Documentation/sysctl/vm.txt | 21 +++
fs/proc/base.c | 13 +-
include/linux/oom.h | 15 ++-
kernel/sysctl.c | 8 +
mm/oom_kill.c | 298 +++++++++++++++++++----------------
6 files changed, 256 insertions(+), 177 deletions(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -1195,39 +1195,51 @@ CHAPTER 3: PER-PROCESS PARAMETERS
3.1 /proc/<pid>/oom_adj - Adjust the oom-killer score
------------------------------------------------------

-This file can be used to adjust the score used to select which processes
-should be killed in an out-of-memory situation. Giving it a high score will
-increase the likelihood of this process being killed by the oom-killer. Valid
-values are in the range -16 to +15, plus the special value -17, which disables
-oom-killing altogether for this process.
-
-The process to be killed in an out-of-memory situation is selected among all others
-based on its badness score. This value equals the original memory size of the process
-and is then updated according to its CPU time (utime + stime) and the
-run time (uptime - start time). The longer it runs the smaller is the score.
-Badness score is divided by the square root of the CPU time and then by
-the double square root of the run time.
-
-Swapped out tasks are killed first. Half of each child's memory size is added to
-the parent's score if they do not share the same memory. Thus forking servers
-are the prime candidates to be killed. Having only one 'hungry' child will make
-parent less preferable than the child.
-
-/proc/<pid>/oom_score shows process' current badness score.
-
-The following heuristics are then applied:
- * if the task was reniced, its score doubles
- * superuser or direct hardware access tasks (CAP_SYS_ADMIN, CAP_SYS_RESOURCE
- or CAP_SYS_RAWIO) have their score divided by 4
- * if oom condition happened in one cpuset and checked process does not belong
- to it, its score is divided by 8
- * the resulting score is multiplied by two to the power of oom_adj, i.e.
- points <<= oom_adj when it is positive and
- points >>= -(oom_adj) otherwise
-
-The task with the highest badness score is then selected and its children
-are killed, process itself will be killed in an OOM situation when it does
-not have children or some of them disabled oom like described above.
+This file can be used to adjust the badness heuristic used to select which
+process gets killed in out of memory conditions.
+
+The badness heuristic assigns a value to each candidate task ranging from 0
+(never kill) to 1000 (always kill) to determine which process is targeted. The
+units are roughly a proportion along that range of allowed memory the process
+may allocate from based on an estimation of its current memory and swap use.
+For example, if a task is using all allowed memory, its badness score will be
+1000. If it is using half of its allowed memory, its score will be 500.
+
+There are a couple of additional factors included in the badness score: root
+processes are given 3% extra memory over other tasks, and tasks which forkbomb
+an excessive number of child processes are penalized by their average size.
+The number of child processes considered to be a forkbomb is configurable
+via /proc/sys/vm/oom_forkbomb_thres (see Documentation/sysctl/vm.txt).
+
+The amount of "allowed" memory depends on the context in which the oom killer
+was called. If it is due to the memory assigned to the allocating task's cpuset
+being exhausted, the allowed memory represents the set of mems assigned to that
+cpuset. If it is due to a mempolicy's node(s) being exhausted, the allowed
+memory represents the set of mempolicy nodes. If it is due to a memory
+limit (or swap limit) being reached, the allowed memory is that configured
+limit. Finally, if it is due to the entire system being out of memory, the
+allowed memory represents all allocatable resources.
+
+The value of /proc/<pid>/oom_adj is added to the badness score before it is used
+to determine which task to kill. Acceptable values range from -1000
+(OOM_MIN_ADJUST) to +1000 (OOM_MAX_ADJUST). This allows userspace to polarize
+the preference for oom killing either by always preferring a certain task or
+completely disabling it. OOM_DISABLE is equivalent to the lowest possible
+value, -1000, since such tasks will always report a badness score of 0.
+
+Consequently, it is very simple for userspace to define the amount of memory to
+consider for each task. Setting a /proc/<pid>/oom_adj value of +500, for
+example, is roughly equivalent to allowing the remainder of tasks sharing the
+same system, cpuset, mempolicy, or memory controller resources to use at least
+50% more memory. A value of -500, on the other hand, would be roughly
+equivalent to discounting 50% of the task's allowed memory from being considered
+as scoring against the task.
+
+Caveat: when a parent task is selected, the oom killer will sacrifice any first
+generation children with seperate address spaces instead, if possible. This
+avoids servers and important system daemons from being killed and loses the
+minimal amount of work.
+

3.2 /proc/<pid>/oom_score - Display current oom-killer score
-------------------------------------------------------------
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -44,6 +44,7 @@ Currently, these files are in /proc/sys/vm:
- nr_trim_pages (only if CONFIG_MMU=n)
- numa_zonelist_order
- oom_dump_tasks
+- oom_forkbomb_thres
- oom_kill_allocating_task
- overcommit_memory
- overcommit_ratio
@@ -490,6 +491,26 @@ The default value is 0.

==============================================================

+oom_forkbomb_thres
+
+This value defines how many children with a seperate address space a specific
+task may have before being considered as a possible forkbomb. Tasks with more
+children not sharing the same address space as the parent will be penalized by a
+quantity of memory equaling
+
+ (average rss of execve children) * (# of 1st generation execve children)
+ ------------------------------------------------------------------------
+ oom_forkbomb_thres
+
+in the oom killer's badness heuristic. Such tasks may be protected with a lower
+oom_adj value (see Documentation/filesystems/proc.txt) if necessary.
+
+A value of 0 will disable forkbomb detection.
+
+The default value is 1000.
+
+==============================================================
+
oom_kill_allocating_task

This enables or disables killing the OOM-triggering task in
diff --git a/fs/proc/base.c b/fs/proc/base.c
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -81,6 +81,7 @@
#include <linux/elf.h>
#include <linux/pid_namespace.h>
#include <linux/fs_struct.h>
+#include <linux/swap.h>
#include "internal.h"

/* NOTE:
@@ -458,7 +459,6 @@ static const struct file_operations proc_lstats_operations = {
#endif

/* The badness from the OOM killer */
-unsigned long badness(struct task_struct *p, unsigned long uptime);
static int proc_oom_score(struct task_struct *task, char *buffer)
{
unsigned long points;
@@ -466,7 +466,13 @@ static int proc_oom_score(struct task_struct *task, char *buffer)

do_posix_clock_monotonic_gettime(&uptime);
read_lock(&tasklist_lock);
- points = badness(task->group_leader, uptime.tv_sec);
+ points = oom_badness(task->group_leader,
+ global_page_state(NR_INACTIVE_ANON) +
+ global_page_state(NR_ACTIVE_ANON) +
+ global_page_state(NR_INACTIVE_FILE) +
+ global_page_state(NR_ACTIVE_FILE) +
+ total_swap_pages,
+ uptime.tv_sec);
read_unlock(&tasklist_lock);
return sprintf(buffer, "%lu\n", points);
}
@@ -1137,8 +1143,7 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
err = strict_strtol(strstrip(buffer), 0, &oom_adjust);
if (err)
return -EINVAL;
- if ((oom_adjust < OOM_ADJUST_MIN || oom_adjust > OOM_ADJUST_MAX) &&
- oom_adjust != OOM_DISABLE)
+ if (oom_adjust < OOM_ADJUST_MIN || oom_adjust > OOM_ADJUST_MAX)
return -EINVAL;

task = get_proc_task(file->f_path.dentry->d_inode);
diff --git a/include/linux/oom.h b/include/linux/oom.h
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -1,14 +1,18 @@
#ifndef __INCLUDE_LINUX_OOM_H
#define __INCLUDE_LINUX_OOM_H

-/* /proc/<pid>/oom_adj set to -17 protects from the oom-killer */
-#define OOM_DISABLE (-17)
+/* /proc/<pid>/oom_adj set to -1000 disables OOM killing for pid */
+#define OOM_DISABLE (-1000)
/* inclusive */
-#define OOM_ADJUST_MIN (-16)
-#define OOM_ADJUST_MAX 15
+#define OOM_ADJUST_MIN OOM_DISABLE
+#define OOM_ADJUST_MAX 1000
+
+/* See Documentation/sysctl/vm.txt */
+#define DEFAULT_OOM_FORKBOMB_THRES 1000

#ifdef __KERNEL__

+#include <linux/sched.h>
#include <linux/types.h>
#include <linux/nodemask.h>

@@ -24,6 +28,8 @@ enum oom_constraint {
CONSTRAINT_MEMORY_POLICY,
};

+extern unsigned int oom_badness(struct task_struct *p,
+ unsigned long totalpages, unsigned long uptime);
extern int try_set_zone_oom(struct zonelist *zonelist, gfp_t gfp_flags);
extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);

@@ -47,6 +53,7 @@ static inline void oom_killer_enable(void)
extern int sysctl_panic_on_oom;
extern int sysctl_oom_kill_allocating_task;
extern int sysctl_oom_dump_tasks;
+extern int sysctl_oom_forkbomb_thres;

#endif /* __KERNEL__*/
#endif /* _INCLUDE_LINUX_OOM_H */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -963,6 +963,14 @@ static struct ctl_table vm_table[] = {
.proc_handler = proc_dointvec,
},
{
+ .procname = "oom_forkbomb_thres",
+ .data = &sysctl_oom_forkbomb_thres,
+ .maxlen = sizeof(sysctl_oom_forkbomb_thres),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = &zero,
+ },
+ {
.procname = "overcommit_ratio",
.data = &sysctl_overcommit_ratio,
.maxlen = sizeof(sysctl_overcommit_ratio),
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -4,6 +4,8 @@
* Copyright (C) 1998,2000 Rik van Riel
* Thanks go out to Claus Fischer for some serious inspiration and
* for goading me into coding this file...
+ * Copyright (C) 2010 Google, Inc
+ * Rewritten by David Rientjes
*
* The routines in this file are used to kill a process when
* we're seriously out of memory. This gets called from __alloc_pages()
@@ -32,8 +34,8 @@
int sysctl_panic_on_oom;
int sysctl_oom_kill_allocating_task;
int sysctl_oom_dump_tasks;
+int sysctl_oom_forkbomb_thres = DEFAULT_OOM_FORKBOMB_THRES;
static DEFINE_SPINLOCK(zone_scan_lock);
-/* #define DEBUG */

/*
* Do all threads of the target process overlap our allowed nodes?
@@ -68,138 +70,125 @@ static bool has_intersects_mems_allowed(struct task_struct *tsk,
return false;
}

-/**
- * badness - calculate a numeric value for how bad this task has been
- * @p: task struct of which task we should calculate
- * @uptime: current uptime in seconds
+/*
+ * Tasks that fork a very large number of children with seperate address spaces
+ * may be the result of a bug, user error, or a malicious application. The oom
+ * killer assesses a penalty equaling
*
- * The formula used is relatively simple and documented inline in the
- * function. The main rationale is that we want to select a good task
- * to kill when we run out of memory.
+ * (average rss of children) * (# of 1st generation execve children)
+ * -----------------------------------------------------------------
+ * sysctl_oom_forkbomb_thres
*
- * Good in this context means that:
- * 1) we lose the minimum amount of work done
- * 2) we recover a large amount of memory
- * 3) we don't kill anything innocent of eating tons of memory
- * 4) we want to kill the minimum amount of processes (one)
- * 5) we try to kill the process the user expects us to kill, this
- * algorithm has been meticulously tuned to meet the principle
- * of least surprise ... (be careful when you change it)
+ * for such tasks to target the parent. oom_kill_process() will attempt to
+ * first kill a child, so there's no risk of killing an important system daemon
+ * via this method. The goal is to give the user a chance to recover from the
+ * error rather than deplete all memory.
*/
-
-unsigned long badness(struct task_struct *p, unsigned long uptime)
+static unsigned long oom_forkbomb_penalty(struct task_struct *tsk)
{
- unsigned long points, cpu_time, run_time;
- struct mm_struct *mm;
struct task_struct *child;
- int oom_adj = p->signal->oom_adj;
- struct task_cputime task_time;
- unsigned long utime;
- unsigned long stime;
+ unsigned long child_rss = 0;
+ int forkcount = 0;

- if (oom_adj == OOM_DISABLE)
+ if (!sysctl_oom_forkbomb_thres)
return 0;
+ list_for_each_entry(child, &tsk->children, sibling) {
+ struct task_cputime task_time;
+ unsigned long runtime;

- task_lock(p);
- mm = p->mm;
- if (!mm) {
- task_unlock(p);
- return 0;
+ task_lock(child);
+ if (!child->mm || child->mm == tsk->mm) {
+ task_unlock(child);
+ continue;
+ }
+ thread_group_cputime(child, &task_time);
+ runtime = cputime_to_jiffies(task_time.utime) +
+ cputime_to_jiffies(task_time.stime);
+ /*
+ * Only threads that have run for less than a second are
+ * considered toward the forkbomb penalty, these threads rarely
+ * get to execute at all in such cases anyway.
+ */
+ if (runtime < HZ) {
+ child_rss += get_mm_rss(child->mm);
+ forkcount++;
+ }
+ task_unlock(child);
}

/*
- * The memory size of the process is the basis for the badness.
+ * Forkbombs get penalized by:
+ *
+ * (average rss of children) * (# of first-generation execve children) /
+ * sysctl_oom_forkbomb_thres
*/
- points = mm->total_vm;
+ return forkcount > sysctl_oom_forkbomb_thres ?
+ (child_rss / sysctl_oom_forkbomb_thres) : 0;
+}
+
+/**
+ * oom_badness - heuristic function to determine which candidate task to kill
+ * @p: task struct of which task we should calculate
+ * @totalpages: total present RAM allowed for page allocation
+ * @uptime: current uptime in seconds
+ *
+ * The heuristic for determining which task to kill is made to be as simple and
+ * predictable as possible. The goal is to return the highest value for the
+ * task consuming the most memory to avoid subsequent oom conditions.
+ */
+unsigned int oom_badness(struct task_struct *p, unsigned long totalpages,
+ unsigned long uptime)
+{
+ struct mm_struct *mm;
+ int points;

/*
- * After this unlock we can no longer dereference local variable `mm'
+ * Shortcut check for OOM_DISABLE so the entire heuristic doesn't need
+ * to be executed for something that can't be killed.
*/
- task_unlock(p);
+ if (p->signal->oom_adj == OOM_DISABLE)
+ return 0;

/*
- * swapoff can easily use up all memory, so kill those first.
+ * When the PF_OOM_ORIGIN bit is set, it indicates the task should have
+ * priority for oom killing.
*/
if (p->flags & PF_OOM_ORIGIN)
- return ULONG_MAX;
+ return 1000;

- /*
- * Processes which fork a lot of child processes are likely
- * a good choice. We add half the vmsize of the children if they
- * have an own mm. This prevents forking servers to flood the
- * machine with an endless amount of children. In case a single
- * child is eating the vast majority of memory, adding only half
- * to the parents will make the child our kill candidate of choice.
- */
- list_for_each_entry(child, &p->children, sibling) {
- task_lock(child);
- if (child->mm != mm && child->mm)
- points += child->mm->total_vm/2 + 1;
- task_unlock(child);
+ task_lock(p);
+ mm = p->mm;
+ if (!mm) {
+ task_unlock(p);
+ return 0;
}

/*
- * CPU time is in tens of seconds and run time is in thousands
- * of seconds. There is no particular reason for this other than
- * that it turned out to work very well in practice.
- */
- thread_group_cputime(p, &task_time);
- utime = cputime_to_jiffies(task_time.utime);
- stime = cputime_to_jiffies(task_time.stime);
- cpu_time = (utime + stime) >> (SHIFT_HZ + 3);
-
-
- if (uptime >= p->start_time.tv_sec)
- run_time = (uptime - p->start_time.tv_sec) >> 10;
- else
- run_time = 0;
-
- if (cpu_time)
- points /= int_sqrt(cpu_time);
- if (run_time)
- points /= int_sqrt(int_sqrt(run_time));
-
- /*
- * Niced processes are most likely less important, so double
- * their badness points.
+ * The baseline for the badness score is the proportion of RAM that each
+ * task's rss and swap space use.
*/
- if (task_nice(p) > 0)
- points *= 2;
-
- /*
- * Superuser processes are usually more important, so we make it
- * less likely that we kill those.
- */
- if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
- has_capability_noaudit(p, CAP_SYS_RESOURCE))
- points /= 4;
+ points = (get_mm_rss(mm) + get_mm_counter(mm, MM_SWAPENTS)) * 1000 /
+ totalpages;
+ task_unlock(p);
+ points += oom_forkbomb_penalty(p);

/*
- * We don't want to kill a process with direct hardware access.
- * Not only could that mess up the hardware, but usually users
- * tend to only have this flag set on applications they think
- * of as important.
+ * Root processes get 3% bonus, just like the __vm_enough_memory() used
+ * by LSMs.
*/
- if (has_capability_noaudit(p, CAP_SYS_RAWIO))
- points /= 4;
+ if (has_capability_noaudit(p, CAP_SYS_ADMIN))
+ points -= 30;

/*
- * Adjust the score by oom_adj.
+ * /proc/pid/oom_adj ranges from -1000 to +1000 such that the range
+ * may either completely disable oom killing or always prefer a certain
+ * task.
*/
- if (oom_adj) {
- if (oom_adj > 0) {
- if (!points)
- points = 1;
- points <<= oom_adj;
- } else
- points >>= -(oom_adj);
- }
+ points += p->signal->oom_adj;

-#ifdef DEBUG
- printk(KERN_DEBUG "OOMkill: task %d (%s) got %lu points\n",
- p->pid, p->comm, points);
-#endif
- return points;
+ if (points < 0)
+ return 0;
+ return (points <= 1000) ? points : 1000;
}

/*
@@ -207,11 +196,21 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
*/
#ifdef CONFIG_NUMA
static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
- gfp_t gfp_mask, nodemask_t *nodemask)
+ gfp_t gfp_mask, nodemask_t *nodemask,
+ unsigned long *totalpages)
{
struct zone *zone;
struct zoneref *z;
enum zone_type high_zoneidx = gfp_zone(gfp_mask);
+ bool cpuset_limited = false;
+ int nid;
+
+ /* Default to all anonymous memory, page cache, and swap */
+ *totalpages = global_page_state(NR_INACTIVE_ANON) +
+ global_page_state(NR_ACTIVE_ANON) +
+ global_page_state(NR_INACTIVE_FILE) +
+ global_page_state(NR_ACTIVE_FILE) +
+ total_swap_pages;

/*
* Reach here only when __GFP_NOFAIL is used. So, we should avoid
@@ -222,25 +221,41 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
return CONSTRAINT_NONE;

/*
- * The nodemask here is a nodemask passed to alloc_pages(). Now,
- * cpuset doesn't use this nodemask for its hardwall/softwall/hierarchy
- * feature. mempolicy is an only user of nodemask here.
- * check mempolicy's nodemask contains all N_HIGH_MEMORY
+ * This is not a __GFP_THISNODE allocation, so a truncated nodemask in
+ * the page allocator means a mempolicy is in effect. Cpuset policy
+ * is enforced in get_page_from_freelist().
*/
- if (nodemask && !nodes_subset(node_states[N_HIGH_MEMORY], *nodemask))
+ if (nodemask && !nodes_subset(node_states[N_HIGH_MEMORY], *nodemask)) {
+ *totalpages = total_swap_pages;
+ for_each_node_mask(nid, *nodemask)
+ *totalpages += node_page_state(nid, NR_INACTIVE_ANON) +
+ node_page_state(nid, NR_ACTIVE_ANON) +
+ node_page_state(nid, NR_INACTIVE_FILE) +
+ node_page_state(nid, NR_ACTIVE_FILE);
return CONSTRAINT_MEMORY_POLICY;
+ }

/* Check this allocation failure is caused by cpuset's wall function */
for_each_zone_zonelist_nodemask(zone, z, zonelist,
high_zoneidx, nodemask)
if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
- return CONSTRAINT_CPUSET;
-
+ cpuset_limited = true;
+
+ if (cpuset_limited) {
+ *totalpages = total_swap_pages;
+ for_each_node_mask(nid, cpuset_current_mems_allowed)
+ *totalpages += node_page_state(nid, NR_INACTIVE_ANON) +
+ node_page_state(nid, NR_ACTIVE_ANON) +
+ node_page_state(nid, NR_INACTIVE_FILE) +
+ node_page_state(nid, NR_ACTIVE_FILE);
+ return CONSTRAINT_CPUSET;
+ }
return CONSTRAINT_NONE;
}
#else
static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
- gfp_t gfp_mask, nodemask_t *nodemask)
+ gfp_t gfp_mask, nodemask_t *nodemask,
+ unsigned long *totalpages)
{
return CONSTRAINT_NONE;
}
@@ -252,9 +267,9 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
*
* (not docbooked, we don't want this one cluttering up the manual)
*/
-static struct task_struct *select_bad_process(unsigned long *ppoints,
- struct mem_cgroup *mem, enum oom_constraint constraint,
- const nodemask_t *mask)
+static struct task_struct *select_bad_process(unsigned int *ppoints,
+ unsigned long totalpages, struct mem_cgroup *mem,
+ enum oom_constraint constraint, const nodemask_t *mask)
{
struct task_struct *p;
struct task_struct *chosen = NULL;
@@ -263,7 +278,7 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,

do_posix_clock_monotonic_gettime(&uptime);
for_each_process(p) {
- unsigned long points;
+ unsigned int points;

/*
* skip kernel threads and tasks which have already released
@@ -308,13 +323,13 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
return ERR_PTR(-1UL);

chosen = p;
- *ppoints = ULONG_MAX;
+ *ppoints = 1000;
}

if (p->signal->oom_adj == OOM_DISABLE)
continue;

- points = badness(p, uptime.tv_sec);
+ points = oom_badness(p, totalpages, uptime.tv_sec);
if (points > *ppoints || !chosen) {
chosen = p;
*ppoints = points;
@@ -449,12 +464,12 @@ static int oom_kill_task(struct task_struct *p)
}

static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
- unsigned long points, struct mem_cgroup *mem,
- const char *message)
+ unsigned int points, unsigned long totalpages,
+ struct mem_cgroup *mem, const char *message)
{
struct task_struct *victim = p;
struct task_struct *c;
- unsigned long victim_points = 0;
+ unsigned int victim_points = 0;
struct timespec uptime;

if (printk_ratelimit())
@@ -469,19 +484,19 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
return 0;
}

- pr_err("%s: Kill process %d (%s) with score %lu or sacrifice child\n",
+ pr_err("%s: Kill process %d (%s) with score %d or sacrifice child\n",
message, task_pid_nr(p), p->comm, points);

/* Try to sacrifice the worst child first */
do_posix_clock_monotonic_gettime(&uptime);
list_for_each_entry(c, &p->children, sibling) {
- unsigned long cpoints;
+ unsigned int cpoints;

if (c->mm == p->mm)
continue;

- /* badness() returns 0 if the thread is unkillable */
- cpoints = badness(c, uptime.tv_sec);
+ /* oom_badness() returns 0 if the thread is unkillable */
+ cpoints = oom_badness(c, totalpages, uptime.tv_sec);
if (cpoints > victim_points) {
victim = c;
victim_points = cpoints;
@@ -493,19 +508,22 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
#ifdef CONFIG_CGROUP_MEM_RES_CTLR
void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
{
- unsigned long points = 0;
+ unsigned int points = 0;
struct task_struct *p;
+ unsigned long limit;

+ limit = (res_counter_read_u64(&mem->res, RES_LIMIT) >> PAGE_SHIFT) +
+ (res_counter_read_u64(&mem->memsw, RES_LIMIT) >> PAGE_SHIFT);
read_lock(&tasklist_lock);
retry:
- p = select_bad_process(&points, mem, CONSTRAINT_NONE, NULL);
+ p = select_bad_process(&points, limit, mem, CONSTRAINT_NONE, NULL);
if (PTR_ERR(p) == -1UL)
goto out;

if (!p)
p = current;

- if (oom_kill_process(p, gfp_mask, 0, points, mem,
+ if (oom_kill_process(p, gfp_mask, 0, points, limit, mem,
"Memory cgroup out of memory"))
goto retry;
out:
@@ -580,22 +598,22 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask)
/*
* Must be called with tasklist_lock held for read.
*/
-static void __out_of_memory(gfp_t gfp_mask, int order,
+static void __out_of_memory(gfp_t gfp_mask, int order, unsigned long totalpages,
enum oom_constraint constraint, const nodemask_t *mask)
{
struct task_struct *p;
- unsigned long points;
+ unsigned int points;

if (sysctl_oom_kill_allocating_task)
- if (!oom_kill_process(current, gfp_mask, order, 0, NULL,
- "Out of memory (oom_kill_allocating_task)"))
+ if (!oom_kill_process(current, gfp_mask, order, 0, totalpages,
+ NULL, "Out of memory (oom_kill_allocating_task)"))
return;
retry:
/*
* Rambo mode: Shoot down a process and hope it solves whatever
* issues we may have.
*/
- p = select_bad_process(&points, NULL, constraint, mask);
+ p = select_bad_process(&points, totalpages, NULL, constraint, mask);

if (PTR_ERR(p) == -1UL)
return;
@@ -607,7 +625,7 @@ retry:
panic("Out of memory and no killable processes...\n");
}

- if (oom_kill_process(p, gfp_mask, order, points, NULL,
+ if (oom_kill_process(p, gfp_mask, order, points, totalpages, NULL,
"Out of memory"))
goto retry;
}
@@ -618,6 +636,7 @@ retry:
*/
void pagefault_out_of_memory(void)
{
+ unsigned long totalpages;
unsigned long freed = 0;

blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
@@ -635,9 +654,14 @@ void pagefault_out_of_memory(void)
if (sysctl_panic_on_oom)
panic("out of memory from page fault. panic_on_oom is selected.\n");

+ totalpages = global_page_state(NR_INACTIVE_ANON) +
+ global_page_state(NR_ACTIVE_ANON) +
+ global_page_state(NR_INACTIVE_FILE) +
+ global_page_state(NR_ACTIVE_FILE) +
+ total_swap_pages;
read_lock(&tasklist_lock);
/* unknown gfp_mask and order */
- __out_of_memory(0, 0, CONSTRAINT_NONE, NULL);
+ __out_of_memory(0, 0, totalpages, CONSTRAINT_NONE, NULL);
read_unlock(&tasklist_lock);

/*
@@ -665,6 +689,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
int order, nodemask_t *nodemask)
{
unsigned long freed = 0;
+ unsigned long totalpages = 0;
enum oom_constraint constraint;

blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
@@ -681,7 +706,8 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
* Check if there were limitations on the allocation (only relevant for
* NUMA) that may require different handling.
*/
- constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
+ constraint = constrained_alloc(zonelist, gfp_mask, nodemask,
+ &totalpages);
read_lock(&tasklist_lock);
if (unlikely(sysctl_panic_on_oom)) {
/*
@@ -694,7 +720,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
panic("Out of memory: panic_on_oom is enabled\n");
}
}
- __out_of_memory(gfp_mask, order, constraint, nodemask);
+ __out_of_memory(gfp_mask, order, totalpages, constraint, nodemask);
read_unlock(&tasklist_lock);

/*

2010-02-10 16:32:30

by David Rientjes

[permalink] [raw]
Subject: [patch 6/7 -mm] oom: avoid oom killer for lowmem allocations

If memory has been depleted in lowmem zones even with the protection
afforded to it by /proc/sys/vm/lowmem_reserve_ratio, it is unlikely that
killing current users will help. The memory is either reclaimable (or
migratable) already, in which case we should not invoke the oom killer at
all, or it is pinned by an application for I/O. Killing such an
application may leave the hardware in an unspecified state and there is
no guarantee that it will be able to make a timely exit.

Lowmem allocations are now failed in oom conditions so that the task can
perhaps recover or try again later. Killing current is an unnecessary
result for simply making a GFP_DMA or GFP_DMA32 page allocation and no
lowmem allocations use the now-deprecated __GFP_NOFAIL bit so retrying is
unnecessary.

Previously, the heuristic provided some protection for those tasks with
CAP_SYS_RAWIO, but this is no longer necessary since we will not be
killing tasks for the purposes of ISA allocations.

Signed-off-by: David Rientjes <[email protected]>
---
mm/page_alloc.c | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1914,6 +1914,9 @@ rebalance:
* running out of options and have to consider going OOM
*/
if (!did_some_progress) {
+ /* The oom killer won't necessarily free lowmem */
+ if (high_zoneidx < ZONE_NORMAL)
+ goto nopage;
if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
if (oom_killer_disabled)
goto nopage;

2010-02-10 16:32:38

by David Rientjes

[permalink] [raw]
Subject: [patch 7/7 -mm] oom: remove unnecessary code and cleanup

Remove the redundancy in __oom_kill_task() since:

- init can never be passed to this function: it will never be PF_EXITING
or selectable from select_bad_process(), and

- it will never be passed a task from oom_kill_task() without an ->mm
and we're unconcerned about detachment from exiting tasks, there's no
reason to protect them against SIGKILL or access to memory reserves.

Also moves the kernel log message to a higher level since the verbosity
is not always emitted here; we need not print an error message if an
exiting task is given a longer timeslice.

Signed-off-by: David Rientjes <[email protected]>
---
mm/oom_kill.c | 64 ++++++++++++++------------------------------------------
1 files changed, 16 insertions(+), 48 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -400,67 +400,35 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
dump_tasks(mem);
}

-#define K(x) ((x) << (PAGE_SHIFT-10))
-
/*
- * Send SIGKILL to the selected process irrespective of CAP_SYS_RAW_IO
- * flag though it's unlikely that we select a process with CAP_SYS_RAW_IO
- * set.
+ * Give the oom killed task high priority and access to memory reserves so that
+ * it may quickly exit and free its memory.
*/
-static void __oom_kill_task(struct task_struct *p, int verbose)
+static void __oom_kill_task(struct task_struct *p)
{
- if (is_global_init(p)) {
- WARN_ON(1);
- printk(KERN_WARNING "tried to kill init!\n");
- return;
- }
-
- task_lock(p);
- if (!p->mm) {
- WARN_ON(1);
- printk(KERN_WARNING "tried to kill an mm-less task %d (%s)!\n",
- task_pid_nr(p), p->comm);
- task_unlock(p);
- return;
- }
-
- if (verbose)
- printk(KERN_ERR "Killed process %d (%s) "
- "vsz:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
- task_pid_nr(p), p->comm,
- K(p->mm->total_vm),
- K(get_mm_counter(p->mm, MM_ANONPAGES)),
- K(get_mm_counter(p->mm, MM_FILEPAGES)));
- task_unlock(p);
-
- /*
- * We give our sacrificial lamb high priority and access to
- * all the memory it needs. That way it should be able to
- * exit() and clear out its resources quickly...
- */
p->rt.time_slice = HZ;
set_tsk_thread_flag(p, TIF_MEMDIE);
-
force_sig(SIGKILL, p);
}

+#define K(x) ((x) << (PAGE_SHIFT-10))
static int oom_kill_task(struct task_struct *p)
{
- /* WARNING: mm may not be dereferenced since we did not obtain its
- * value from get_task_mm(p). This is OK since all we need to do is
- * compare mm to q->mm below.
- *
- * Furthermore, even if mm contains a non-NULL value, p->mm may
- * change to NULL at any time since we do not hold task_lock(p).
- * However, this is of no concern to us.
- */
- if (!p->mm || p->signal->oom_adj == OOM_DISABLE)
+ task_lock(p);
+ if (!p->mm || p->signal->oom_adj == OOM_DISABLE) {
+ task_unlock(p);
return 1;
+ }
+ pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
+ task_pid_nr(p), p->comm, K(p->mm->total_vm),
+ K(get_mm_counter(p->mm, MM_ANONPAGES)),
+ K(get_mm_counter(p->mm, MM_FILEPAGES)));
+ task_unlock(p);

- __oom_kill_task(p, 1);
-
+ __oom_kill_task(p);
return 0;
}
+#undef K

static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
unsigned int points, unsigned long totalpages,
@@ -479,7 +447,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
* its children or threads, just set TIF_MEMDIE so it can die quickly
*/
if (p->flags & PF_EXITING) {
- __oom_kill_task(p, 0);
+ __oom_kill_task(p);
return 0;
}

2010-02-10 16:32:56

by David Rientjes

[permalink] [raw]
Subject: [patch 5/7 -mm] oom: replace sysctls with quick mode

Two VM sysctls, oom dump_tasks and oom_kill_allocating_task, were
implemented for very large systems to avoid excessively long tasklist
scans. The former suppresses helpful diagnostic messages that are
emitted for each thread group leader that are candidates for oom kill
including their pid, uid, vm size, rss, oom_adj value, and name; this
information is very helpful to users in understanding why a particular
task was chosen for kill over others. The latter simply kills current,
the task triggering the oom condition, instead of iterating through the
tasklist looking for the worst offender.

Both of these sysctls are combined into one for use on the aforementioned
large systems: oom_kill_quick. This disables the now-default
oom_dump_tasks and kills current whenever the oom killer is called.

The oom killer rewrite is the perfect opportunity to combine both sysctls
into one instead of carrying around the others for years to come for
nothing else than legacy purposes.

Signed-off-by: David Rientjes <[email protected]>
---
Documentation/sysctl/vm.txt | 44 +++++-------------------------------------
include/linux/oom.h | 3 +-
kernel/sysctl.c | 13 ++---------
mm/oom_kill.c | 9 +++----
4 files changed, 14 insertions(+), 55 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -43,9 +43,8 @@ Currently, these files are in /proc/sys/vm:
- nr_pdflush_threads
- nr_trim_pages (only if CONFIG_MMU=n)
- numa_zonelist_order
-- oom_dump_tasks
- oom_forkbomb_thres
-- oom_kill_allocating_task
+- oom_kill_quick
- overcommit_memory
- overcommit_ratio
- page-cluster
@@ -470,27 +469,6 @@ this is causing problems for your system/application.

==============================================================

-oom_dump_tasks
-
-Enables a system-wide task dump (excluding kernel threads) to be
-produced when the kernel performs an OOM-killing and includes such
-information as pid, uid, tgid, vm size, rss, cpu, oom_adj score, and
-name. This is helpful to determine why the OOM killer was invoked
-and to identify the rogue task that caused it.
-
-If this is set to zero, this information is suppressed. On very
-large systems with thousands of tasks it may not be feasible to dump
-the memory state information for each one. Such systems should not
-be forced to incur a performance penalty in OOM conditions when the
-information may not be desired.
-
-If this is set to non-zero, this information is shown whenever the
-OOM killer actually kills a memory-hogging task.
-
-The default value is 0.
-
-==============================================================
-
oom_forkbomb_thres

This value defines how many children with a seperate address space a specific
@@ -511,22 +489,12 @@ The default value is 1000.

==============================================================

-oom_kill_allocating_task
-
-This enables or disables killing the OOM-triggering task in
-out-of-memory situations.
-
-If this is set to zero, the OOM killer will scan through the entire
-tasklist and select a task based on heuristics to kill. This normally
-selects a rogue memory-hogging task that frees up a large amount of
-memory when killed.
-
-If this is set to non-zero, the OOM killer simply kills the task that
-triggered the out-of-memory condition. This avoids the expensive
-tasklist scan.
+oom_kill_quick

-If panic_on_oom is selected, it takes precedence over whatever value
-is used in oom_kill_allocating_task.
+When enabled, this will always kill the task that triggered the oom killer, i.e.
+the task that attempted to allocate memory that could not be found. It also
+suppresses the tasklist dump to the kernel log whenever the oom killer is
+called. Typically set on systems with an extremely large number of tasks.

The default value is 0.

diff --git a/include/linux/oom.h b/include/linux/oom.h
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -51,8 +51,7 @@ static inline void oom_killer_enable(void)
}
/* for sysctl */
extern int sysctl_panic_on_oom;
-extern int sysctl_oom_kill_allocating_task;
-extern int sysctl_oom_dump_tasks;
+extern int sysctl_oom_kill_quick;
extern int sysctl_oom_forkbomb_thres;

#endif /* __KERNEL__*/
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -949,16 +949,9 @@ static struct ctl_table vm_table[] = {
.proc_handler = proc_dointvec,
},
{
- .procname = "oom_kill_allocating_task",
- .data = &sysctl_oom_kill_allocating_task,
- .maxlen = sizeof(sysctl_oom_kill_allocating_task),
- .mode = 0644,
- .proc_handler = proc_dointvec,
- },
- {
- .procname = "oom_dump_tasks",
- .data = &sysctl_oom_dump_tasks,
- .maxlen = sizeof(sysctl_oom_dump_tasks),
+ .procname = "oom_kill_quick",
+ .data = &sysctl_oom_kill_quick,
+ .maxlen = sizeof(sysctl_oom_kill_quick),
.mode = 0644,
.proc_handler = proc_dointvec,
},
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -32,9 +32,8 @@
#include <linux/security.h>

int sysctl_panic_on_oom;
-int sysctl_oom_kill_allocating_task;
-int sysctl_oom_dump_tasks;
int sysctl_oom_forkbomb_thres = DEFAULT_OOM_FORKBOMB_THRES;
+int sysctl_oom_kill_quick;
static DEFINE_SPINLOCK(zone_scan_lock);

/*
@@ -397,7 +396,7 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
dump_stack();
mem_cgroup_print_oom_info(mem, p);
show_mem();
- if (sysctl_oom_dump_tasks)
+ if (!sysctl_oom_kill_quick)
dump_tasks(mem);
}

@@ -604,9 +603,9 @@ static void __out_of_memory(gfp_t gfp_mask, int order, unsigned long totalpages,
struct task_struct *p;
unsigned int points;

- if (sysctl_oom_kill_allocating_task)
+ if (sysctl_oom_kill_quick)
if (!oom_kill_process(current, gfp_mask, order, 0, totalpages,
- NULL, "Out of memory (oom_kill_allocating_task)"))
+ NULL, "Out of memory (quick mode)"))
return;
retry:
/*

2010-02-10 16:33:15

by David Rientjes

[permalink] [raw]
Subject: [patch 3/7 -mm] oom: select task from tasklist for mempolicy ooms

The oom killer presently kills current whenever there is no more memory
free or reclaimable on its mempolicy's nodes. There is no guarantee that
current is a memory-hogging task or that killing it will free any
substantial amount of memory, however.

In such situations, it is better to scan the tasklist for nodes that are
allowed to allocate on current's set of nodes and kill the task with the
highest badness() score. This ensures that the most memory-hogging task,
or the one configured by the user with /proc/pid/oom_adj, is always
selected in such scenarios.

Signed-off-by: David Rientjes <[email protected]>
---
include/linux/mempolicy.h | 13 +++++++-
mm/mempolicy.c | 39 +++++++++++++++++++++++
mm/oom_kill.c | 77 +++++++++++++++++++++++++++-----------------
3 files changed, 98 insertions(+), 31 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -202,6 +202,8 @@ extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
unsigned long addr, gfp_t gfp_flags,
struct mempolicy **mpol, nodemask_t **nodemask);
extern bool init_nodemask_of_mempolicy(nodemask_t *mask);
+extern bool mempolicy_nodemask_intersects(struct task_struct *tsk,
+ const nodemask_t *mask);
extern unsigned slab_node(struct mempolicy *policy);

extern enum zone_type policy_zone;
@@ -329,7 +331,16 @@ static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma,
return node_zonelist(0, gfp_flags);
}

-static inline bool init_nodemask_of_mempolicy(nodemask_t *m) { return false; }
+static inline bool init_nodemask_of_mempolicy(nodemask_t *m)
+{
+ return false;
+}
+
+static inline bool mempolicy_nodemask_intersects(struct task_struct *tsk,
+ const nodemask_t *mask)
+{
+ return false;
+}

static inline int do_migrate_pages(struct mm_struct *mm,
const nodemask_t *from_nodes,
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1638,6 +1638,45 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
}
#endif

+/*
+ * mempolicy_nodemask_intersects
+ *
+ * If tsk's mempolicy is "default" [NULL], return 'true' to indicate default
+ * policy. Otherwise, check for intersection between mask and the policy
+ * nodemask for 'bind' or 'interleave' policy, or mask to contain the single
+ * node for 'preferred' or 'local' policy.
+ */
+bool mempolicy_nodemask_intersects(struct task_struct *tsk,
+ const nodemask_t *mask)
+{
+ struct mempolicy *mempolicy;
+ bool ret = true;
+
+ mempolicy = tsk->mempolicy;
+ mpol_get(mempolicy);
+ if (!mask || !mempolicy)
+ goto out;
+
+ switch (mempolicy->mode) {
+ case MPOL_PREFERRED:
+ if (mempolicy->flags & MPOL_F_LOCAL)
+ ret = node_isset(numa_node_id(), *mask);
+ else
+ ret = node_isset(mempolicy->v.preferred_node,
+ *mask);
+ break;
+ case MPOL_BIND:
+ case MPOL_INTERLEAVE:
+ ret = nodes_intersects(mempolicy->v.nodes, *mask);
+ break;
+ default:
+ BUG();
+ }
+out:
+ mpol_put(mempolicy);
+ return ret;
+}
+
/* Allocate a page in interleaved policy.
Own path because it needs to do special accounting. */
static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -26,6 +26,7 @@
#include <linux/module.h>
#include <linux/notifier.h>
#include <linux/memcontrol.h>
+#include <linux/mempolicy.h>
#include <linux/security.h>

int sysctl_panic_on_oom;
@@ -36,19 +37,35 @@ static DEFINE_SPINLOCK(zone_scan_lock);

/*
* Do all threads of the target process overlap our allowed nodes?
+ * @tsk: task struct of which task to consider
+ * @mask: nodemask passed to page allocator for mempolicy ooms
*/
-static int has_intersects_mems_allowed(struct task_struct *tsk)
+static bool has_intersects_mems_allowed(struct task_struct *tsk,
+ const nodemask_t *mask)
{
- struct task_struct *t;
+ struct task_struct *start = tsk;

- t = tsk;
do {
- if (cpuset_mems_allowed_intersects(current, t))
- return 1;
- t = next_thread(t);
- } while (t != tsk);
-
- return 0;
+ if (mask) {
+ /*
+ * If this is a mempolicy constrained oom, tsk's
+ * cpuset is irrelevant. Only return true if its
+ * mempolicy intersects current, otherwise it may be
+ * needlessly killed.
+ */
+ if (mempolicy_nodemask_intersects(tsk, mask))
+ return true;
+ } else {
+ /*
+ * This is not a mempolicy constrained oom, so only
+ * check the mems of tsk's cpuset.
+ */
+ if (cpuset_mems_allowed_intersects(current, tsk))
+ return true;
+ }
+ tsk = next_thread(tsk);
+ } while (tsk != start);
+ return false;
}

/**
@@ -236,7 +253,8 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
* (not docbooked, we don't want this one cluttering up the manual)
*/
static struct task_struct *select_bad_process(unsigned long *ppoints,
- struct mem_cgroup *mem)
+ struct mem_cgroup *mem, enum oom_constraint constraint,
+ const nodemask_t *mask)
{
struct task_struct *p;
struct task_struct *chosen = NULL;
@@ -258,7 +276,9 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
continue;
if (mem && !task_in_mem_cgroup(p, mem))
continue;
- if (!has_intersects_mems_allowed(p))
+ if (!has_intersects_mems_allowed(p,
+ constraint == CONSTRAINT_MEMORY_POLICY ? mask :
+ NULL))
continue;

/*
@@ -478,7 +498,7 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)

read_lock(&tasklist_lock);
retry:
- p = select_bad_process(&points, mem);
+ p = select_bad_process(&points, mem, CONSTRAINT_NONE, NULL);
if (PTR_ERR(p) == -1UL)
goto out;

@@ -560,7 +580,8 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask)
/*
* Must be called with tasklist_lock held for read.
*/
-static void __out_of_memory(gfp_t gfp_mask, int order)
+static void __out_of_memory(gfp_t gfp_mask, int order,
+ enum oom_constraint constraint, const nodemask_t *mask)
{
struct task_struct *p;
unsigned long points;
@@ -574,7 +595,7 @@ retry:
* Rambo mode: Shoot down a process and hope it solves whatever
* issues we may have.
*/
- p = select_bad_process(&points, NULL);
+ p = select_bad_process(&points, NULL, constraint, mask);

if (PTR_ERR(p) == -1UL)
return;
@@ -615,7 +636,8 @@ void pagefault_out_of_memory(void)
panic("out of memory from page fault. panic_on_oom is selected.\n");

read_lock(&tasklist_lock);
- __out_of_memory(0, 0); /* unknown gfp_mask and order */
+ /* unknown gfp_mask and order */
+ __out_of_memory(0, 0, CONSTRAINT_NONE, NULL);
read_unlock(&tasklist_lock);

/*
@@ -632,6 +654,7 @@ rest_and_return:
* @zonelist: zonelist pointer
* @gfp_mask: memory allocation flags
* @order: amount of memory being requested as a power of 2
+ * @nodemask: nodemask passed to page allocator
*
* If we run out of memory, we have the choice between either
* killing a random task (bad), letting the system crash (worse)
@@ -660,24 +683,18 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
*/
constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
read_lock(&tasklist_lock);
-
- switch (constraint) {
- case CONSTRAINT_MEMORY_POLICY:
- oom_kill_process(current, gfp_mask, order, 0, NULL,
- "No available memory (MPOL_BIND)");
- break;
-
- case CONSTRAINT_NONE:
- if (sysctl_panic_on_oom) {
+ if (unlikely(sysctl_panic_on_oom)) {
+ /*
+ * panic_on_oom only affects CONSTRAINT_NONE, the kernel
+ * should not panic for cpuset or mempolicy induced memory
+ * failures.
+ */
+ if (constraint == CONSTRAINT_NONE) {
dump_header(NULL, gfp_mask, order, NULL);
- panic("out of memory. panic_on_oom is selected\n");
+ panic("Out of memory: panic_on_oom is enabled\n");
}
- /* Fall-through */
- case CONSTRAINT_CPUSET:
- __out_of_memory(gfp_mask, order);
- break;
}
-
+ __out_of_memory(gfp_mask, order, constraint, nodemask);
read_unlock(&tasklist_lock);

/*

2010-02-10 16:33:35

by David Rientjes

[permalink] [raw]
Subject: [patch 1/7 -mm] oom: filter tasks not sharing the same cpuset

Tasks that do not share the same set of allowed nodes with the task that
triggered the oom should not be considered as candidates for oom kill.

Tasks in other cpusets with a disjoint set of mems would be unfairly
penalized otherwise because of oom conditions elsewhere; an extreme
example could unfairly kill all other applications on the system if a
single task in a user's cpuset sets itself to OOM_DISABLE and then uses
more memory than allowed.

Killing tasks outside of current's cpuset rarely would free memory for
current anyway.

Signed-off-by: David Rientjes <[email protected]>
---
mm/oom_kill.c | 12 +++---------
1 files changed, 3 insertions(+), 9 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -35,7 +35,7 @@ static DEFINE_SPINLOCK(zone_scan_lock);
/* #define DEBUG */

/*
- * Is all threads of the target process nodes overlap ours?
+ * Do all threads of the target process overlap our allowed nodes?
*/
static int has_intersects_mems_allowed(struct task_struct *tsk)
{
@@ -167,14 +167,6 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
points /= 4;

/*
- * If p's nodes don't overlap ours, it may still help to kill p
- * because p may have allocated or otherwise mapped memory on
- * this node before. However it will be less likely.
- */
- if (!has_intersects_mems_allowed(p))
- points /= 8;
-
- /*
* Adjust the score by oom_adj.
*/
if (oom_adj) {
@@ -266,6 +258,8 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
continue;
if (mem && !task_in_mem_cgroup(p, mem))
continue;
+ if (!has_intersects_mems_allowed(p))
+ continue;

/*
* This task already has access to memory reserves and is

2010-02-10 17:09:38

by Rik van Riel

[permalink] [raw]
Subject: Re: [patch 1/7 -mm] oom: filter tasks not sharing the same cpuset

On 02/10/2010 11:32 AM, David Rientjes wrote:
> Tasks that do not share the same set of allowed nodes with the task that
> triggered the oom should not be considered as candidates for oom kill.
>
> Tasks in other cpusets with a disjoint set of mems would be unfairly
> penalized otherwise because of oom conditions elsewhere; an extreme
> example could unfairly kill all other applications on the system if a
> single task in a user's cpuset sets itself to OOM_DISABLE and then uses
> more memory than allowed.
>
> Killing tasks outside of current's cpuset rarely would free memory for
> current anyway.
>
> Signed-off-by: David Rientjes<[email protected]>

Acked-by: Rik van Riel <[email protected]>

--
All rights reversed.

2010-02-10 20:53:28

by Rik van Riel

[permalink] [raw]
Subject: Re: [patch 2/7 -mm] oom: sacrifice child with highest badness score for parent

On 02/10/2010 11:32 AM, David Rientjes wrote:
> When a task is chosen for oom kill, the oom killer first attempts to
> sacrifice a child not sharing its parent's memory instead.
> Unfortunately, this often kills in a seemingly random fashion based on
> the ordering of the selected task's child list. Additionally, it is not
> guaranteed at all to free a large amount of memory that we need to
> prevent additional oom killing in the very near future.
>
> Instead, we now only attempt to sacrifice the worst child not sharing its
> parent's memory, if one exists. The worst child is indicated with the
> highest badness() score. This serves two advantages: we kill a
> memory-hogging task more often, and we allow the configurable
> /proc/pid/oom_adj value to be considered as a factor in which child to
> kill.
>
> Reviewers may observe that the previous implementation would iterate
> through the children and attempt to kill each until one was successful
> and then the parent if none were found while the new code simply kills
> the most memory-hogging task or the parent. Note that the only time
> oom_kill_task() fails, however, is when a child does not have an mm or
> has a /proc/pid/oom_adj of OOM_DISABLE. badness() returns 0 for both
> cases, so the final oom_kill_task() will always succeed.
>
> Signed-off-by: David Rientjes<[email protected]>

Acked-by: Rik van Riel <[email protected]>

--
All rights reversed.

2010-02-10 22:48:16

by Rik van Riel

[permalink] [raw]
Subject: Re: [patch 3/7 -mm] oom: select task from tasklist for mempolicy ooms

On 02/10/2010 11:32 AM, David Rientjes wrote:
> The oom killer presently kills current whenever there is no more memory
> free or reclaimable on its mempolicy's nodes. There is no guarantee that
> current is a memory-hogging task or that killing it will free any
> substantial amount of memory, however.
>
> In such situations, it is better to scan the tasklist for nodes that are
> allowed to allocate on current's set of nodes and kill the task with the
> highest badness() score. This ensures that the most memory-hogging task,
> or the one configured by the user with /proc/pid/oom_adj, is always
> selected in such scenarios.
>
> Signed-off-by: David Rientjes<[email protected]>

Acked-by: Rik van Riel <[email protected]>

--
All rights reversed.

2010-02-11 04:11:21

by Rik van Riel

[permalink] [raw]
Subject: Re: [patch 4/7 -mm] oom: badness heuristic rewrite

On 02/10/2010 11:32 AM, David Rientjes wrote:

> OOM_ADJUST_MIN and OOM_ADJUST_MAX have been exported to userspace since
> 2006 via include/linux/oom.h. This alters their values from -16 to -1000
> and from +15 to +1000, respectively.

That seems like a bad idea. Google may have the luxury of
being able to recompile all its in-house applications, but
this will not be true for many other users of /proc/<pid>/oom_adj

> +/*
> + * Tasks that fork a very large number of children with seperate address spaces
> + * may be the result of a bug, user error, or a malicious application. The oom
> + * killer assesses a penalty equaling

It could also be the result of the system getting many client
connections - think of overloaded mail, web or database servers.

--
All rights reversed.

2010-02-11 04:13:32

by Rik van Riel

[permalink] [raw]
Subject: Re: [patch 6/7 -mm] oom: avoid oom killer for lowmem allocations

On 02/10/2010 11:32 AM, David Rientjes wrote:

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1914,6 +1914,9 @@ rebalance:
> * running out of options and have to consider going OOM
> */
> if (!did_some_progress) {
> + /* The oom killer won't necessarily free lowmem */
> + if (high_zoneidx< ZONE_NORMAL)
> + goto nopage;
> if ((gfp_mask& __GFP_FS)&& !(gfp_mask& __GFP_NORETRY)) {
> if (oom_killer_disabled)
> goto nopage;

Are there architectures that only have one memory zone?

s390 or one of the other virtualized-only architectures perhaps?

--
All rights reversed.

2010-02-11 09:14:57

by David Rientjes

[permalink] [raw]
Subject: Re: [patch 4/7 -mm] oom: badness heuristic rewrite

On Wed, 10 Feb 2010, Rik van Riel wrote:

> > OOM_ADJUST_MIN and OOM_ADJUST_MAX have been exported to userspace since
> > 2006 via include/linux/oom.h. This alters their values from -16 to -1000
> > and from +15 to +1000, respectively.
>
> That seems like a bad idea. Google may have the luxury of
> being able to recompile all its in-house applications, but
> this will not be true for many other users of /proc/<pid>/oom_adj
>

Changing any value that may have a tendency to be hardcoded elsewhere is
always controversial, but I think the nature of /proc/pid/oom_adj allows
us to do so for two specific reasons:

- hardcoded values tend not the fall within a range, they tend to either
always prefer a certain task for oom kill first or disable oom killing
entirely. The current implementation uses this as a bitshift on a
seemingly unpredictable and unscientific heuristic that is very
difficult to predict at runtime. This means that fewer and fewer
applications would hardcode a value of '8', for example, because its
semantics depends entirely on RAM capacity of the system to begin with
since badness() scores are only useful when used in comparison with
other tasks.

- the badness() heuristic is radically changed from what it is currently
so this gives applications that hardcoded /proc/pid/oom_adj values into
their software a reason to notice the change and adjust to the new
semantics of the badness score. Using /proc/pid/oom_adj as a bitshift
has no real application to any sane heuristic that represents scores in
units of meaning, so users should end up with a net benefit of the
change by being able to better tune the oom killing behavior with a
much more powerful and easier to understand heuristic that requires
them to recalculate exactly what oom_adj should be for any given
application in terms of real units and business goals.

As mentioned in the changelog, we've exported these minimum and maximum
values via a kernel header file since at least 2006. At what point do we
assume they are going to be used and not hardcoded into applications?
That was certainly the intention when making them user visible.

> > +/*
> > + * Tasks that fork a very large number of children with seperate address
> > spaces
> > + * may be the result of a bug, user error, or a malicious application. The
> > oom
> > + * killer assesses a penalty equaling
>
> It could also be the result of the system getting many client
> connections - think of overloaded mail, web or database servers.
>

True, that's a great example of why child tasks should be sacrificed for
the parent: if the oom killer is being called then we are truly overloaded
and there's no shame in killing excessive client connections to recover,
otherwise we might find the entire server becoming unresponsive. The user
can easily tune to /proc/sys/vm/oom_forkbomb_thres to define what
"excessive" is to assess the penalty, if any. I'll add that to the
comment if we require a second revision.

Thanks for your speedy review of this patchset so far, Rik!

2010-02-11 09:19:50

by David Rientjes

[permalink] [raw]
Subject: Re: [patch 6/7 -mm] oom: avoid oom killer for lowmem allocations

On Wed, 10 Feb 2010, Rik van Riel wrote:

> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1914,6 +1914,9 @@ rebalance:
> > * running out of options and have to consider going OOM
> > */
> > if (!did_some_progress) {
> > + /* The oom killer won't necessarily free lowmem */
> > + if (high_zoneidx< ZONE_NORMAL)
> > + goto nopage;
> > if ((gfp_mask& __GFP_FS)&& !(gfp_mask& __GFP_NORETRY)) {
> > if (oom_killer_disabled)
> > goto nopage;
>
> Are there architectures that only have one memory zone?
>

It actually ends up not to matter because of how gfp_zone() is implemented
(and you can do it with mem= on architectures with larger ZONE_DMA zones
such as ia64). ZONE_NORMAL is always guaranteed to be defined regardless
of architecture or configuration because it's the default zone for memory
allocation unless specified by a bit in GFP_ZONEMASK, it doesn't matter
whether it actually has memory or not. high_zoneidx in this case is just
gfp_zone(gfp_flags) which always defaults to ZONE_NORMAL when one of the
GFP_ZONEMASK bits is not set. Thus, the only way to for the conditional
in this patch to be true is when __GFP_DMA, or __GFP_DMA32 for x86_64, is
passed to the page allocator and CONFIG_ZONE_DMA or CONFIG_ZONE_DMA32 is
enabled, respectively.

2010-02-11 14:09:26

by Rik van Riel

[permalink] [raw]
Subject: Re: [patch 6/7 -mm] oom: avoid oom killer for lowmem allocations

On 02/11/2010 04:19 AM, David Rientjes wrote:
> On Wed, 10 Feb 2010, Rik van Riel wrote:
>
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -1914,6 +1914,9 @@ rebalance:
>>> * running out of options and have to consider going OOM
>>> */
>>> if (!did_some_progress) {
>>> + /* The oom killer won't necessarily free lowmem */
>>> + if (high_zoneidx< ZONE_NORMAL)
>>> + goto nopage;
>>> if ((gfp_mask& __GFP_FS)&& !(gfp_mask& __GFP_NORETRY)) {
>>> if (oom_killer_disabled)
>>> goto nopage;
>>
>> Are there architectures that only have one memory zone?
>>
>
> It actually ends up not to matter because of how gfp_zone() is implemented
> (and you can do it with mem= on architectures with larger ZONE_DMA zones
> such as ia64). ZONE_NORMAL is always guaranteed to be defined regardless
> of architecture or configuration because it's the default zone for memory
> allocation unless specified by a bit in GFP_ZONEMASK, it doesn't matter
> whether it actually has memory or not. high_zoneidx in this case is just
> gfp_zone(gfp_flags) which always defaults to ZONE_NORMAL when one of the
> GFP_ZONEMASK bits is not set. Thus, the only way to for the conditional
> in this patch to be true is when __GFP_DMA, or __GFP_DMA32 for x86_64, is
> passed to the page allocator and CONFIG_ZONE_DMA or CONFIG_ZONE_DMA32 is
> enabled, respectively.

Fair enough.

Acked-by: Rik van Riel <[email protected]>

--
All rights reversed.

2010-02-11 15:07:56

by Nick Bowler

[permalink] [raw]
Subject: Re: [patch 4/7 -mm] oom: badness heuristic rewrite

On 01:14 Thu 11 Feb , David Rientjes wrote:
> On Wed, 10 Feb 2010, Rik van Riel wrote:
>
> > > OOM_ADJUST_MIN and OOM_ADJUST_MAX have been exported to userspace since
> > > 2006 via include/linux/oom.h. This alters their values from -16 to -1000
> > > and from +15 to +1000, respectively.

<snip>

> As mentioned in the changelog, we've exported these minimum and maximum
> values via a kernel header file since at least 2006. At what point do we
> assume they are going to be used and not hardcoded into applications?
> That was certainly the intention when making them user visible.

The thing is, even when the macros are used, their values are hardcoded
into programs once the code is run through a compiler. That's why it's
called an ABI.

--
Nick Bowler, Elliptic Technologies (http://www.elliptictech.com/)

2010-02-11 21:02:11

by David Rientjes

[permalink] [raw]
Subject: Re: [patch 4/7 -mm] oom: badness heuristic rewrite

On Thu, 11 Feb 2010, Nick Bowler wrote:

> > As mentioned in the changelog, we've exported these minimum and maximum
> > values via a kernel header file since at least 2006. At what point do we
> > assume they are going to be used and not hardcoded into applications?
> > That was certainly the intention when making them user visible.
>
> The thing is, even when the macros are used, their values are hardcoded
> into programs once the code is run through a compiler. That's why it's
> called an ABI.
>

Right, that's the second point that I listed: since the semantics of the
tunable have radically changed from the bitshift to an actual unit
(proportion of available memory), those applications need to change how
they use oom_adj anyway. The bitshift simply isn't extendable with any
sane heuristic that is predictable or works with any reasonable amount of
granularity, so this change seems inevitable in the long term.

We may be forced to abandon /proc/pid/oom_adj itself and introduce the
tunable with a different name: oom_score_adj, for example, to make it
clear that it's a different entity.

2010-02-11 21:44:24

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 4/7 -mm] oom: badness heuristic rewrite

On Thu, 11 Feb 2010 01:14:43 -0800 (PST)
David Rientjes <[email protected]> wrote:

> On Wed, 10 Feb 2010, Rik van Riel wrote:
>
> > > OOM_ADJUST_MIN and OOM_ADJUST_MAX have been exported to userspace since
> > > 2006 via include/linux/oom.h. This alters their values from -16 to -1000
> > > and from +15 to +1000, respectively.
> >
> > That seems like a bad idea. Google may have the luxury of
> > being able to recompile all its in-house applications, but
> > this will not be true for many other users of /proc/<pid>/oom_adj
> >
>
> Changing any value that may have a tendency to be hardcoded elsewhere is
> always controversial, but I think the nature of /proc/pid/oom_adj allows
> us to do so for two specific reasons:
>
> - hardcoded values tend not the fall within a range, they tend to either
> always prefer a certain task for oom kill first or disable oom killing
> entirely. The current implementation uses this as a bitshift on a
> seemingly unpredictable and unscientific heuristic that is very
> difficult to predict at runtime. This means that fewer and fewer
> applications would hardcode a value of '8', for example, because its
> semantics depends entirely on RAM capacity of the system to begin with
> since badness() scores are only useful when used in comparison with
> other tasks.

You'd be amazed what dumb things applications do. Get thee to
http://google.com/codesearch?hl=en&lr=&q=[^a-z]oom_adj[^a-z]&sbtn=Search
and start reading. All 641 matches ;)

Here's one which which writes -16:
http://google.com/codesearch/p?hl=en#eN5TNOm7KtI/trunk/wlan/vendor/asus/eeepc/init.rc&q=[^a-z]oom_adj[^a-z]&sa=N&cd=70&ct=rc

Let's not change the ABI please.

2010-02-11 21:51:47

by David Rientjes

[permalink] [raw]
Subject: Re: [patch 4/7 -mm] oom: badness heuristic rewrite

On Thu, 11 Feb 2010, Andrew Morton wrote:

> > Changing any value that may have a tendency to be hardcoded elsewhere is
> > always controversial, but I think the nature of /proc/pid/oom_adj allows
> > us to do so for two specific reasons:
> >
> > - hardcoded values tend not the fall within a range, they tend to either
> > always prefer a certain task for oom kill first or disable oom killing
> > entirely. The current implementation uses this as a bitshift on a
> > seemingly unpredictable and unscientific heuristic that is very
> > difficult to predict at runtime. This means that fewer and fewer
> > applications would hardcode a value of '8', for example, because its
> > semantics depends entirely on RAM capacity of the system to begin with
> > since badness() scores are only useful when used in comparison with
> > other tasks.
>
> You'd be amazed what dumb things applications do. Get thee to
> http://google.com/codesearch?hl=en&lr=&q=[^a-z]oom_adj[^a-z]&sbtn=Search
> and start reading. All 641 matches ;)
>
> Here's one which which writes -16:
> http://google.com/codesearch/p?hl=en#eN5TNOm7KtI/trunk/wlan/vendor/asus/eeepc/init.rc&q=[^a-z]oom_adj[^a-z]&sa=N&cd=70&ct=rc
>
> Let's not change the ABI please.
>

Sigh, this is going to require the amount of system memory to be
partitioned into OOM_ADJUST_MAX, 15, chunks and that's going to be the
granularity at which we'll be able to either bias or discount memory usage
of individual tasks by: instead of being able to do this with 0.1%
granularity we'll now be limited to 100 / 15, or ~7%. That's ~9GB on my
128GB system just because this was originally a bitshift. The upside is
that it's now linear and not exponential.

2010-02-11 22:31:51

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 4/7 -mm] oom: badness heuristic rewrite

On Thu, 11 Feb 2010 13:51:36 -0800 (PST)
David Rientjes <[email protected]> wrote:

> On Thu, 11 Feb 2010, Andrew Morton wrote:
>
> > > Changing any value that may have a tendency to be hardcoded elsewhere is
> > > always controversial, but I think the nature of /proc/pid/oom_adj allows
> > > us to do so for two specific reasons:
> > >
> > > - hardcoded values tend not the fall within a range, they tend to either
> > > always prefer a certain task for oom kill first or disable oom killing
> > > entirely. The current implementation uses this as a bitshift on a
> > > seemingly unpredictable and unscientific heuristic that is very
> > > difficult to predict at runtime. This means that fewer and fewer
> > > applications would hardcode a value of '8', for example, because its
> > > semantics depends entirely on RAM capacity of the system to begin with
> > > since badness() scores are only useful when used in comparison with
> > > other tasks.
> >
> > You'd be amazed what dumb things applications do. Get thee to
> > http://google.com/codesearch?hl=en&lr=&q=[^a-z]oom_adj[^a-z]&sbtn=Search
> > and start reading. All 641 matches ;)
> >
> > Here's one which which writes -16:
> > http://google.com/codesearch/p?hl=en#eN5TNOm7KtI/trunk/wlan/vendor/asus/eeepc/init.rc&q=[^a-z]oom_adj[^a-z]&sa=N&cd=70&ct=rc
> >
> > Let's not change the ABI please.
> >
>
> Sigh, this is going to require the amount of system memory to be
> partitioned into OOM_ADJUST_MAX, 15, chunks and that's going to be the
> granularity at which we'll be able to either bias or discount memory usage
> of individual tasks by: instead of being able to do this with 0.1%
> granularity we'll now be limited to 100 / 15, or ~7%. That's ~9GB on my
> 128GB system just because this was originally a bitshift. The upside is
> that it's now linear and not exponential.

Can you add newly-named knobs (rather than modifying the existing
ones), deprecate the old ones and then massage writes to the old ones
so that they talk into the new framework?

2010-02-11 22:42:47

by David Rientjes

[permalink] [raw]
Subject: Re: [patch 4/7 -mm] oom: badness heuristic rewrite

On Thu, 11 Feb 2010, Andrew Morton wrote:

> > Sigh, this is going to require the amount of system memory to be
> > partitioned into OOM_ADJUST_MAX, 15, chunks and that's going to be the
> > granularity at which we'll be able to either bias or discount memory usage
> > of individual tasks by: instead of being able to do this with 0.1%
> > granularity we'll now be limited to 100 / 15, or ~7%. That's ~9GB on my
> > 128GB system just because this was originally a bitshift. The upside is
> > that it's now linear and not exponential.
>
> Can you add newly-named knobs (rather than modifying the existing
> ones), deprecate the old ones and then massage writes to the old ones
> so that they talk into the new framework?
>

That's what I was thinking, add /proc/pid/oom_score_adj that is just added
into the badness score (and is then exported with /proc/pid/oom_score)
like this patch did with oom_adj and then scale it into oom_adj units for
that tunable. A write to either oom_adj or oom_score_adj would change the
other, the same thing I did for /proc/sys/vm/dirty_{bytes,ratio} and
/proc/sys/vm/dirty_background_{bytes,ratio} which I guess we have to
support forever since the predecessors are part of the ABI and there's no
way to deprecate them since they'll never be removed for that reason.

2010-02-11 23:12:36

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 4/7 -mm] oom: badness heuristic rewrite

On Thu, 11 Feb 2010 14:42:39 -0800 (PST)
David Rientjes <[email protected]> wrote:

> On Thu, 11 Feb 2010, Andrew Morton wrote:
>
> > > Sigh, this is going to require the amount of system memory to be
> > > partitioned into OOM_ADJUST_MAX, 15, chunks and that's going to be the
> > > granularity at which we'll be able to either bias or discount memory usage
> > > of individual tasks by: instead of being able to do this with 0.1%
> > > granularity we'll now be limited to 100 / 15, or ~7%. That's ~9GB on my
> > > 128GB system just because this was originally a bitshift. The upside is
> > > that it's now linear and not exponential.
> >
> > Can you add newly-named knobs (rather than modifying the existing
> > ones), deprecate the old ones and then massage writes to the old ones
> > so that they talk into the new framework?
> >
>
> That's what I was thinking, add /proc/pid/oom_score_adj that is just added
> into the badness score (and is then exported with /proc/pid/oom_score)
> like this patch did with oom_adj and then scale it into oom_adj units for
> that tunable. A write to either oom_adj or oom_score_adj would change the
> other,

How ugly is all this?

> the same thing I did for /proc/sys/vm/dirty_{bytes,ratio} and
> /proc/sys/vm/dirty_background_{bytes,ratio} which I guess we have to
> support forever since the predecessors are part of the ABI and there's no
> way to deprecate them since they'll never be removed for that reason.

Ah, OK, I was trying to remember where we did that ;)

There _are_ things we can do though. Detect a write to the old file and
emit a WARN_ON_ONCE("you suck"). Wait a year, turn it into
WARN_ON("you really suck"). Wait a year, then remove it.

2010-02-11 23:31:22

by David Rientjes

[permalink] [raw]
Subject: Re: [patch 4/7 -mm] oom: badness heuristic rewrite

On Thu, 11 Feb 2010, Andrew Morton wrote:

> > > > Sigh, this is going to require the amount of system memory to be
> > > > partitioned into OOM_ADJUST_MAX, 15, chunks and that's going to be the
> > > > granularity at which we'll be able to either bias or discount memory usage
> > > > of individual tasks by: instead of being able to do this with 0.1%
> > > > granularity we'll now be limited to 100 / 15, or ~7%. That's ~9GB on my
> > > > 128GB system just because this was originally a bitshift. The upside is
> > > > that it's now linear and not exponential.
> > >
> > > Can you add newly-named knobs (rather than modifying the existing
> > > ones), deprecate the old ones and then massage writes to the old ones
> > > so that they talk into the new framework?
> > >
> >
> > That's what I was thinking, add /proc/pid/oom_score_adj that is just added
> > into the badness score (and is then exported with /proc/pid/oom_score)
> > like this patch did with oom_adj and then scale it into oom_adj units for
> > that tunable. A write to either oom_adj or oom_score_adj would change the
> > other,
>
> How ugly is all this?
>

The advantages outweigh the disadvantages, users need to be able to
specify how much memory vital tasks should be able to use compared to
others without getting penalized and that needs to be done as a fraction
of available memory. I wanted to avoid it originally by not having to
introduce another tunable, but I understand the need for a stable ABI and
backwards compatability. The way /proc/pid/oom_adj currently works as a
bitshift on the badness score is nearly impossible to tune correctly so
change in scoring is inevitable. Luckily, users who tune either can
ignore the other until such time as oom_adj can be removed.

> There _are_ things we can do though. Detect a write to the old file and
> emit a WARN_ON_ONCE("you suck"). Wait a year, turn it into
> WARN_ON("you really suck"). Wait a year, then remove it.
>

Ok, I'll use WARN_ON_ONCE() to let the user know of the deprecation and
then add an entry to Documentation/feature-removal-schedule.txt.

2010-02-11 23:38:33

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 4/7 -mm] oom: badness heuristic rewrite

On Thu, 11 Feb 2010 15:31:14 -0800 (PST)
David Rientjes <[email protected]> wrote:

> > There _are_ things we can do though. Detect a write to the old file and
> > emit a WARN_ON_ONCE("you suck"). Wait a year, turn it into
> > WARN_ON("you really suck"). Wait a year, then remove it.
> >
>
> Ok, I'll use WARN_ON_ONCE() to let the user know of the deprecation and
> then add an entry to Documentation/feature-removal-schedule.txt.

A printk_once() would be better - WARN() generates a big stack
spew which often is wholly irrelevant.

2010-02-11 23:57:11

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [patch 1/7 -mm] oom: filter tasks not sharing the same cpuset

On Wed, 10 Feb 2010 08:32:08 -0800 (PST)
David Rientjes <[email protected]> wrote:

> Tasks that do not share the same set of allowed nodes with the task that
> triggered the oom should not be considered as candidates for oom kill.
>
> Tasks in other cpusets with a disjoint set of mems would be unfairly
> penalized otherwise because of oom conditions elsewhere; an extreme
> example could unfairly kill all other applications on the system if a
> single task in a user's cpuset sets itself to OOM_DISABLE and then uses
> more memory than allowed.
>
> Killing tasks outside of current's cpuset rarely would free memory for
> current anyway.
>
> Signed-off-by: David Rientjes <[email protected]>

Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>

2010-02-12 00:03:44

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [patch 2/7 -mm] oom: sacrifice child with highest badness score for parent

On Wed, 10 Feb 2010 08:32:10 -0800 (PST)
David Rientjes <[email protected]> wrote:

> When a task is chosen for oom kill, the oom killer first attempts to
> sacrifice a child not sharing its parent's memory instead.
> Unfortunately, this often kills in a seemingly random fashion based on
> the ordering of the selected task's child list. Additionally, it is not
> guaranteed at all to free a large amount of memory that we need to
> prevent additional oom killing in the very near future.
>
> Instead, we now only attempt to sacrifice the worst child not sharing its
> parent's memory, if one exists. The worst child is indicated with the
> highest badness() score. This serves two advantages: we kill a
> memory-hogging task more often, and we allow the configurable
> /proc/pid/oom_adj value to be considered as a factor in which child to
> kill.
>
> Reviewers may observe that the previous implementation would iterate
> through the children and attempt to kill each until one was successful
> and then the parent if none were found while the new code simply kills
> the most memory-hogging task or the parent. Note that the only time
> oom_kill_task() fails, however, is when a child does not have an mm or
> has a /proc/pid/oom_adj of OOM_DISABLE. badness() returns 0 for both
> cases, so the final oom_kill_task() will always succeed.
>
> Signed-off-by: David Rientjes <[email protected]>

Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>

Maybe better than current logic..but I'm not sure why we have to check children ;)

BTW,
==
list_for_each_entry(child, &p->children, sibling) {
task_lock(child);
if (child->mm != mm && child->mm)
points += child->mm->total_vm/2 + 1;
task_unlock(child);
}
==
I wonder this part should be
points += (child->total_vm/2) >> child->signal->oom_adj + 1

If not, in following situation,
==
parent (oom_adj = 0)
-> child (oom_adj=-15, very big memory user)
==
the child may be killd at first, anyway. Today, I have to explain customers
"When you set oom_adj to a process, please set the same value to all ancestors.
Otherwise, your oom_adj value will be ignored."

No ?

Thanks,
-Kame

> ---
> mm/oom_kill.c | 23 +++++++++++++++++------
> 1 files changed, 17 insertions(+), 6 deletions(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -432,7 +432,10 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
> unsigned long points, struct mem_cgroup *mem,
> const char *message)
> {
> + struct task_struct *victim = p;
> struct task_struct *c;
> + unsigned long victim_points = 0;
> + struct timespec uptime;
>
> if (printk_ratelimit())
> dump_header(p, gfp_mask, order, mem);
> @@ -446,17 +449,25 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
> return 0;
> }
>
> - printk(KERN_ERR "%s: kill process %d (%s) score %li or a child\n",
> - message, task_pid_nr(p), p->comm, points);
> + pr_err("%s: Kill process %d (%s) with score %lu or sacrifice child\n",
> + message, task_pid_nr(p), p->comm, points);
>
> - /* Try to kill a child first */
> + /* Try to sacrifice the worst child first */
> + do_posix_clock_monotonic_gettime(&uptime);
> list_for_each_entry(c, &p->children, sibling) {
> + unsigned long cpoints;
> +
> if (c->mm == p->mm)
> continue;
> - if (!oom_kill_task(c))
> - return 0;
> +
> + /* badness() returns 0 if the thread is unkillable */
> + cpoints = badness(c, uptime.tv_sec);
> + if (cpoints > victim_points) {
> + victim = c;
> + victim_points = cpoints;
> + }
> }
> - return oom_kill_task(p);
> + return oom_kill_task(victim);
> }
>
> #ifdef CONFIG_CGROUP_MEM_RES_CTLR
>

2010-02-12 00:15:31

by David Rientjes

[permalink] [raw]
Subject: Re: [patch 2/7 -mm] oom: sacrifice child with highest badness score for parent

On Fri, 12 Feb 2010, KAMEZAWA Hiroyuki wrote:

> Maybe better than current logic..but I'm not sure why we have to check children ;)
>
> BTW,
> ==
> list_for_each_entry(child, &p->children, sibling) {
> task_lock(child);
> if (child->mm != mm && child->mm)
> points += child->mm->total_vm/2 + 1;
> task_unlock(child);
> }
> ==
> I wonder this part should be
> points += (child->total_vm/2) >> child->signal->oom_adj + 1
>
> If not, in following situation,
> ==
> parent (oom_adj = 0)
> -> child (oom_adj=-15, very big memory user)
> ==
> the child may be killd at first, anyway. Today, I have to explain customers
> "When you set oom_adj to a process, please set the same value to all ancestors.
> Otherwise, your oom_adj value will be ignored."
>

This is a different change than the forkbomb detection which is rewritten
in the fourth patch in the series. We must rely on badness() being able
to tell us how beneficial it will be to kill a task, so iterating through
the child list and picking the most beneficial is the goal of this patch.
It reduces the chances of needlessly killing a child using very little
memory for no benefit just because it was forked first.

2010-02-12 00:16:12

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [patch 7/7 -mm] oom: remove unnecessary code and cleanup

On Wed, 10 Feb 2010 08:32:24 -0800 (PST)
David Rientjes <[email protected]> wrote:

> Remove the redundancy in __oom_kill_task() since:
>
> - init can never be passed to this function: it will never be PF_EXITING
> or selectable from select_bad_process(), and
>
> - it will never be passed a task from oom_kill_task() without an ->mm
> and we're unconcerned about detachment from exiting tasks, there's no
> reason to protect them against SIGKILL or access to memory reserves.
>
> Also moves the kernel log message to a higher level since the verbosity
> is not always emitted here; we need not print an error message if an
> exiting task is given a longer timeslice.
>
> Signed-off-by: David Rientjes <[email protected]>

If you say "never", it's better to add BUG_ON() rather than
if (!p->mm)...

But yes, this patch seesm to remove unnecessary codes.
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>


> ---
> mm/oom_kill.c | 64 ++++++++++++++------------------------------------------
> 1 files changed, 16 insertions(+), 48 deletions(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -400,67 +400,35 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
> dump_tasks(mem);
> }
>
> -#define K(x) ((x) << (PAGE_SHIFT-10))
> -
> /*
> - * Send SIGKILL to the selected process irrespective of CAP_SYS_RAW_IO
> - * flag though it's unlikely that we select a process with CAP_SYS_RAW_IO
> - * set.
> + * Give the oom killed task high priority and access to memory reserves so that
> + * it may quickly exit and free its memory.
> */
> -static void __oom_kill_task(struct task_struct *p, int verbose)
> +static void __oom_kill_task(struct task_struct *p)
> {
> - if (is_global_init(p)) {
> - WARN_ON(1);
> - printk(KERN_WARNING "tried to kill init!\n");
> - return;
> - }
> -
> - task_lock(p);
> - if (!p->mm) {
> - WARN_ON(1);
> - printk(KERN_WARNING "tried to kill an mm-less task %d (%s)!\n",
> - task_pid_nr(p), p->comm);
> - task_unlock(p);
> - return;
> - }
> -
> - if (verbose)
> - printk(KERN_ERR "Killed process %d (%s) "
> - "vsz:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
> - task_pid_nr(p), p->comm,
> - K(p->mm->total_vm),
> - K(get_mm_counter(p->mm, MM_ANONPAGES)),
> - K(get_mm_counter(p->mm, MM_FILEPAGES)));
> - task_unlock(p);
> -
> - /*
> - * We give our sacrificial lamb high priority and access to
> - * all the memory it needs. That way it should be able to
> - * exit() and clear out its resources quickly...
> - */
> p->rt.time_slice = HZ;
> set_tsk_thread_flag(p, TIF_MEMDIE);
> -
> force_sig(SIGKILL, p);
> }
>
> +#define K(x) ((x) << (PAGE_SHIFT-10))
> static int oom_kill_task(struct task_struct *p)
> {
> - /* WARNING: mm may not be dereferenced since we did not obtain its
> - * value from get_task_mm(p). This is OK since all we need to do is
> - * compare mm to q->mm below.
> - *
> - * Furthermore, even if mm contains a non-NULL value, p->mm may
> - * change to NULL at any time since we do not hold task_lock(p).
> - * However, this is of no concern to us.
> - */
> - if (!p->mm || p->signal->oom_adj == OOM_DISABLE)
> + task_lock(p);
> + if (!p->mm || p->signal->oom_adj == OOM_DISABLE) {
> + task_unlock(p);
> return 1;
> + }
> + pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
> + task_pid_nr(p), p->comm, K(p->mm->total_vm),
> + K(get_mm_counter(p->mm, MM_ANONPAGES)),
> + K(get_mm_counter(p->mm, MM_FILEPAGES)));
> + task_unlock(p);
>
> - __oom_kill_task(p, 1);
> -
> + __oom_kill_task(p);
> return 0;
> }
> +#undef K
>
> static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
> unsigned int points, unsigned long totalpages,
> @@ -479,7 +447,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
> * its children or threads, just set TIF_MEMDIE so it can die quickly
> */
> if (p->flags & PF_EXITING) {
> - __oom_kill_task(p, 0);
> + __oom_kill_task(p);
> return 0;
> }
>
>

2010-02-12 00:22:05

by David Rientjes

[permalink] [raw]
Subject: Re: [patch 7/7 -mm] oom: remove unnecessary code and cleanup

On Fri, 12 Feb 2010, KAMEZAWA Hiroyuki wrote:

> > Remove the redundancy in __oom_kill_task() since:
> >
> > - init can never be passed to this function: it will never be PF_EXITING
> > or selectable from select_bad_process(), and
> >
> > - it will never be passed a task from oom_kill_task() without an ->mm
> > and we're unconcerned about detachment from exiting tasks, there's no
> > reason to protect them against SIGKILL or access to memory reserves.
> >
> > Also moves the kernel log message to a higher level since the verbosity
> > is not always emitted here; we need not print an error message if an
> > exiting task is given a longer timeslice.
> >
> > Signed-off-by: David Rientjes <[email protected]>
>
> If you say "never", it's better to add BUG_ON() rather than
> if (!p->mm)...
>

As the description says, oom_kill_task() never passes __oom_kill_task() a
task, p, where !p->mm, but it doesn't imply that p cannot detach its ->mm
before __oom_kill_task() gets a chance to run. The point is that we don't
really care about giving it access to memory reserves anymore since it's
exiting and won't be allocating anything. Warning about that scenario is
unnecessary and would simply spam the kernel log, a recall to the oom
killer would no longer select this task in case the oom condition persists
anyway.

> But yes, this patch seesm to remove unnecessary codes.
> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
>

Thanks!

2010-02-12 00:30:07

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [patch 5/7 -mm] oom: replace sysctls with quick mode

On Wed, 10 Feb 2010 08:32:17 -0800 (PST)
David Rientjes <[email protected]> wrote:

> Two VM sysctls, oom dump_tasks and oom_kill_allocating_task, were
> implemented for very large systems to avoid excessively long tasklist
> scans. The former suppresses helpful diagnostic messages that are
> emitted for each thread group leader that are candidates for oom kill
> including their pid, uid, vm size, rss, oom_adj value, and name; this
> information is very helpful to users in understanding why a particular
> task was chosen for kill over others. The latter simply kills current,
> the task triggering the oom condition, instead of iterating through the
> tasklist looking for the worst offender.
>
> Both of these sysctls are combined into one for use on the aforementioned
> large systems: oom_kill_quick. This disables the now-default
> oom_dump_tasks and kills current whenever the oom killer is called.
>
> The oom killer rewrite is the perfect opportunity to combine both sysctls
> into one instead of carrying around the others for years to come for
> nothing else than legacy purposes.
>
> Signed-off-by: David Rientjes <[email protected]>

seems reasonable..but how old these APIs are ? Replacement is ok ?

Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>


> ---
> Documentation/sysctl/vm.txt | 44 +++++-------------------------------------
> include/linux/oom.h | 3 +-
> kernel/sysctl.c | 13 ++---------
> mm/oom_kill.c | 9 +++----
> 4 files changed, 14 insertions(+), 55 deletions(-)
>
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -43,9 +43,8 @@ Currently, these files are in /proc/sys/vm:
> - nr_pdflush_threads
> - nr_trim_pages (only if CONFIG_MMU=n)
> - numa_zonelist_order
> -- oom_dump_tasks
> - oom_forkbomb_thres
> -- oom_kill_allocating_task
> +- oom_kill_quick
> - overcommit_memory
> - overcommit_ratio
> - page-cluster
> @@ -470,27 +469,6 @@ this is causing problems for your system/application.
>
> ==============================================================
>
> -oom_dump_tasks
> -
> -Enables a system-wide task dump (excluding kernel threads) to be
> -produced when the kernel performs an OOM-killing and includes such
> -information as pid, uid, tgid, vm size, rss, cpu, oom_adj score, and
> -name. This is helpful to determine why the OOM killer was invoked
> -and to identify the rogue task that caused it.
> -
> -If this is set to zero, this information is suppressed. On very
> -large systems with thousands of tasks it may not be feasible to dump
> -the memory state information for each one. Such systems should not
> -be forced to incur a performance penalty in OOM conditions when the
> -information may not be desired.
> -
> -If this is set to non-zero, this information is shown whenever the
> -OOM killer actually kills a memory-hogging task.
> -
> -The default value is 0.
> -
> -==============================================================
> -
> oom_forkbomb_thres
>
> This value defines how many children with a seperate address space a specific
> @@ -511,22 +489,12 @@ The default value is 1000.
>
> ==============================================================
>
> -oom_kill_allocating_task
> -
> -This enables or disables killing the OOM-triggering task in
> -out-of-memory situations.
> -
> -If this is set to zero, the OOM killer will scan through the entire
> -tasklist and select a task based on heuristics to kill. This normally
> -selects a rogue memory-hogging task that frees up a large amount of
> -memory when killed.
> -
> -If this is set to non-zero, the OOM killer simply kills the task that
> -triggered the out-of-memory condition. This avoids the expensive
> -tasklist scan.
> +oom_kill_quick
>
> -If panic_on_oom is selected, it takes precedence over whatever value
> -is used in oom_kill_allocating_task.
> +When enabled, this will always kill the task that triggered the oom killer, i.e.
> +the task that attempted to allocate memory that could not be found. It also
> +suppresses the tasklist dump to the kernel log whenever the oom killer is
> +called. Typically set on systems with an extremely large number of tasks.
>
> The default value is 0.
>
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -51,8 +51,7 @@ static inline void oom_killer_enable(void)
> }
> /* for sysctl */
> extern int sysctl_panic_on_oom;
> -extern int sysctl_oom_kill_allocating_task;
> -extern int sysctl_oom_dump_tasks;
> +extern int sysctl_oom_kill_quick;
> extern int sysctl_oom_forkbomb_thres;
>
> #endif /* __KERNEL__*/
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -949,16 +949,9 @@ static struct ctl_table vm_table[] = {
> .proc_handler = proc_dointvec,
> },
> {
> - .procname = "oom_kill_allocating_task",
> - .data = &sysctl_oom_kill_allocating_task,
> - .maxlen = sizeof(sysctl_oom_kill_allocating_task),
> - .mode = 0644,
> - .proc_handler = proc_dointvec,
> - },
> - {
> - .procname = "oom_dump_tasks",
> - .data = &sysctl_oom_dump_tasks,
> - .maxlen = sizeof(sysctl_oom_dump_tasks),
> + .procname = "oom_kill_quick",
> + .data = &sysctl_oom_kill_quick,
> + .maxlen = sizeof(sysctl_oom_kill_quick),
> .mode = 0644,
> .proc_handler = proc_dointvec,
> },
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -32,9 +32,8 @@
> #include <linux/security.h>
>
> int sysctl_panic_on_oom;
> -int sysctl_oom_kill_allocating_task;
> -int sysctl_oom_dump_tasks;
> int sysctl_oom_forkbomb_thres = DEFAULT_OOM_FORKBOMB_THRES;
> +int sysctl_oom_kill_quick;
> static DEFINE_SPINLOCK(zone_scan_lock);
>
> /*
> @@ -397,7 +396,7 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
> dump_stack();
> mem_cgroup_print_oom_info(mem, p);
> show_mem();
> - if (sysctl_oom_dump_tasks)
> + if (!sysctl_oom_kill_quick)
> dump_tasks(mem);
> }
>
> @@ -604,9 +603,9 @@ static void __out_of_memory(gfp_t gfp_mask, int order, unsigned long totalpages,
> struct task_struct *p;
> unsigned int points;
>
> - if (sysctl_oom_kill_allocating_task)
> + if (sysctl_oom_kill_quick)
> if (!oom_kill_process(current, gfp_mask, order, 0, totalpages,
> - NULL, "Out of memory (oom_kill_allocating_task)"))
> + NULL, "Out of memory (quick mode)"))
> return;
> retry:
> /*
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

2010-02-12 01:32:12

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [patch 6/7 -mm] oom: avoid oom killer for lowmem allocations

On Wed, 10 Feb 2010 08:32:21 -0800 (PST)
David Rientjes <[email protected]> wrote:

> If memory has been depleted in lowmem zones even with the protection
> afforded to it by /proc/sys/vm/lowmem_reserve_ratio, it is unlikely that
> killing current users will help. The memory is either reclaimable (or
> migratable) already, in which case we should not invoke the oom killer at
> all, or it is pinned by an application for I/O. Killing such an
> application may leave the hardware in an unspecified state and there is
> no guarantee that it will be able to make a timely exit.
>
> Lowmem allocations are now failed in oom conditions so that the task can
> perhaps recover or try again later. Killing current is an unnecessary
> result for simply making a GFP_DMA or GFP_DMA32 page allocation and no
> lowmem allocations use the now-deprecated __GFP_NOFAIL bit so retrying is
> unnecessary.
>
> Previously, the heuristic provided some protection for those tasks with
> CAP_SYS_RAWIO, but this is no longer necessary since we will not be
> killing tasks for the purposes of ISA allocations.
>
> Signed-off-by: David Rientjes <[email protected]>

>From viewpoint of panic-on-oom lover, this patch seems to cause regression.
please do this check after sysctl_panic_on_oom == 2 test.
I think it's easy. So, temporary Nack to this patch itself.


And I think calling notifier is not very bad in the situation.
==
void out_of_memory()
..snip..
blocking_notifier_call_chain(&oom_notify_list, 0, &freed);


So,

if (sysctl_panic_on_oom == 2) {
dump_header(NULL, gfp_mask, order, NULL);
panic("out of memory. Compulsory panic_on_oom is selected.\n");
}

if (gfp_zone(gfp_mask) < ZONE_NORMAL) /* oom-kill is useless if lowmem is exhausted. */
return;

is better. I think.

Thanks,
-Kame


> ---
> mm/page_alloc.c | 3 +++
> 1 files changed, 3 insertions(+), 0 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1914,6 +1914,9 @@ rebalance:
> * running out of options and have to consider going OOM
> */
> if (!did_some_progress) {
> + /* The oom killer won't necessarily free lowmem */
> + if (high_zoneidx < ZONE_NORMAL)
> + goto nopage;
> if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
> if (oom_killer_disabled)
> goto nopage;
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2010-02-12 09:58:21

by David Rientjes

[permalink] [raw]
Subject: Re: [patch 5/7 -mm] oom: replace sysctls with quick mode

On Fri, 12 Feb 2010, KAMEZAWA Hiroyuki wrote:

> > Two VM sysctls, oom dump_tasks and oom_kill_allocating_task, were
> > implemented for very large systems to avoid excessively long tasklist
> > scans. The former suppresses helpful diagnostic messages that are
> > emitted for each thread group leader that are candidates for oom kill
> > including their pid, uid, vm size, rss, oom_adj value, and name; this
> > information is very helpful to users in understanding why a particular
> > task was chosen for kill over others. The latter simply kills current,
> > the task triggering the oom condition, instead of iterating through the
> > tasklist looking for the worst offender.
> >
> > Both of these sysctls are combined into one for use on the aforementioned
> > large systems: oom_kill_quick. This disables the now-default
> > oom_dump_tasks and kills current whenever the oom killer is called.
> >
> > The oom killer rewrite is the perfect opportunity to combine both sysctls
> > into one instead of carrying around the others for years to come for
> > nothing else than legacy purposes.
> >
> > Signed-off-by: David Rientjes <[email protected]>
>
> seems reasonable..but how old these APIs are ? Replacement is ok ?
>

I'm not concerned about /proc/sys/vm/oom_dump_tasks because it was
disabled by default and is now enabled by default (unless the user sets
this new /proc/sys/vm/oom_kill_quick). So existing users of
oom_dump_tasks will just have their write fail but identical behavior as
before.

/proc/sys/vm/oom_kill_allocating_task is different since it now requires
enabling /proc/sys/vm/oom_kill_quick, but I think there are such few users
(SGI originally requested it a couple years ago when we started scanning
the tasklist for CONSTRAINT_CPUSET in 2.6.24) and the side-effect of not
enabling it is minimal, it's just a long delay at oom kill time because
they must scan the tasklist. Therefore, I don't see it as a major problem
that will cause large disruptions, instead I see it as a great opportunity
to get rid of one more sysctl without taking away functionality.

> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
>

Thanks!

2010-02-12 10:06:56

by David Rientjes

[permalink] [raw]
Subject: Re: [patch 6/7 -mm] oom: avoid oom killer for lowmem allocations

On Fri, 12 Feb 2010, KAMEZAWA Hiroyuki wrote:

> From viewpoint of panic-on-oom lover, this patch seems to cause regression.
> please do this check after sysctl_panic_on_oom == 2 test.
> I think it's easy. So, temporary Nack to this patch itself.
>
>
> And I think calling notifier is not very bad in the situation.
> ==
> void out_of_memory()
> ..snip..
> blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
>
>
> So,
>
> if (sysctl_panic_on_oom == 2) {
> dump_header(NULL, gfp_mask, order, NULL);
> panic("out of memory. Compulsory panic_on_oom is selected.\n");
> }
>
> if (gfp_zone(gfp_mask) < ZONE_NORMAL) /* oom-kill is useless if lowmem is exhausted. */
> return;
>
> is better. I think.
>

I can't agree with that assessment, I don't think it's a desired result to
ever panic the machine regardless of what /proc/sys/vm/panic_on_oom is set
to because a lowmem page allocation fails especially considering, as
mentioned in the changelog, these allocations are never __GFP_NOFAIL and
returning NULL is acceptable.

I've always disliked panicking the machine when a cpuset or mempolicy
allocation fails and panic_on_oom is set to 2. Since both such
constraints now force an iteration of the tasklist when oom_kill_quick is
not enabled and we strictly prohibit the consideration of tasks with
disjoint cpuset mems or mempolicy nodes, I think I'll take this
opportunity to get rid of the panic_on_oom == 2 behavior and ask that
users who really do want to panic the entire machine for cpuset or
mempolicy constrained ooms to simply set all such tasks to OOM_DISABLE.

2010-02-12 13:56:35

by Minchan Kim

[permalink] [raw]
Subject: Re: [patch 4/7 -mm] oom: badness heuristic rewrite

Hi, David.

On Thu, 2010-02-11 at 01:14 -0800, David Rientjes wrote:
> > > +/*
> > > + * Tasks that fork a very large number of children with seperate address
> > > spaces
> > > + * may be the result of a bug, user error, or a malicious application. The
> > > oom
> > > + * killer assesses a penalty equaling
> >
> > It could also be the result of the system getting many client
> > connections - think of overloaded mail, web or database servers.
> >
>
> True, that's a great example of why child tasks should be sacrificed for
> the parent: if the oom killer is being called then we are truly overloaded
> and there's no shame in killing excessive client connections to recover,
> otherwise we might find the entire server becoming unresponsive. The user
> can easily tune to /proc/sys/vm/oom_forkbomb_thres to define what
> "excessive" is to assess the penalty, if any. I'll add that to the
> comment if we require a second revision.
>

I am worried about opposite case.

If forkbomb parent makes so many children in a short time(ex, 2000 per
second) continuously and we kill a child continuously not parent, system
is almost unresponsible, I think.
I suffered from that case in LTP and no swap system.
It might be a corner case but might happen in real.

I think we could have two types of forkbomb.

Normal forkbomb : apache, DB server and so on.
Buggy forkbomb: It's mistake of user.

We can control normal forkbomb by oom_forkbomb_thres.
But how about handling buggy forkbomb?

If we make sure this task is buggy forkbomb, it would be better to kill
it. But it's hard to make sure it's a buggy forkbomb.

Could we solve this problem by following as?
If OOM selects victim and then the one was selected victim right before
and it's repeatable 5 times for example, then we kill the victim(buggy
forkbom) itself not child of one. It is assumed normal forkbomb is
controlled by admin who uses oom_forkbomb_thres well. So it doesn't
happen selecting victim continuously above five time.


--
Kind regards,
Minchan Kim

2010-02-12 21:00:18

by David Rientjes

[permalink] [raw]
Subject: Re: [patch 4/7 -mm] oom: badness heuristic rewrite

On Fri, 12 Feb 2010, Minchan Kim wrote:

> > True, that's a great example of why child tasks should be sacrificed for
> > the parent: if the oom killer is being called then we are truly overloaded
> > and there's no shame in killing excessive client connections to recover,
> > otherwise we might find the entire server becoming unresponsive. The user
> > can easily tune to /proc/sys/vm/oom_forkbomb_thres to define what
> > "excessive" is to assess the penalty, if any. I'll add that to the
> > comment if we require a second revision.
> >
>
> I am worried about opposite case.
>
> If forkbomb parent makes so many children in a short time(ex, 2000 per
> second) continuously and we kill a child continuously not parent, system
> is almost unresponsible, I think.

The oom killer is not the appropriate place for a kernel forkbomb policy
to be implemented, you'd need to address that concern in the scheduler.
When I've brought that up in the past, the response is that if we aren't
out of memory, then it isn't a problem. It is a problem for buggy
applications because their timeslice is now spread across an egregious
amount of tasks that they are perhaps leaking and is detrimental to their
server's performance. I'm not saying that we need to enforce a hard limit
on how many tasks a server forks, for instance, but the scheduler can
detect forkbombs much easier than the oom killer's tasklist scan by at
least indicating to us with a process flag that it is a likely forkbomb.

> I suffered from that case in LTP and no swap system.
> It might be a corner case but might happen in real.
>

If you look at the patchset overall and not just this one patch, you'll
notice that we now kill the child with the highest badness() score first,
i.e. generally the one consuming the most memory. That is radically
different than the previous behavior and should prevent the system from
becoming unresponsive. The goal is to allow the user to react to the
forkbomb rather than implement a strict detection and handling heuristic
that kills innocent servers and system daemons.

> If we make sure this task is buggy forkbomb, it would be better to kill
> it. But it's hard to make sure it's a buggy forkbomb.
>
> Could we solve this problem by following as?
> If OOM selects victim and then the one was selected victim right before
> and it's repeatable 5 times for example, then we kill the victim(buggy
> forkbom) itself not child of one. It is assumed normal forkbomb is
> controlled by admin who uses oom_forkbomb_thres well. So it doesn't
> happen selecting victim continuously above five time.
>

That doesn't work with Rik's example of a webserver that forks a large
number of threads to handle client connections. It is _always_ better to
kill a child instead of making the entire webserver unresponsive.

In other words, doing anything in the oom killer other than slightly
penalizing these tasks and killing a child is really a non-starter because
there are too many critical use cases (we have many) that would be
unfairly biased against.

2010-02-13 02:45:10

by Minchan Kim

[permalink] [raw]
Subject: Re: [patch 4/7 -mm] oom: badness heuristic rewrite

On Sat, Feb 13, 2010 at 6:00 AM, David Rientjes <[email protected]> wrote:
> On Fri, 12 Feb 2010, Minchan Kim wrote:
>
>> > True, that's a great example of why child tasks should be sacrificed for
>> > the parent: if the oom killer is being called then we are truly overloaded
>> > and there's no shame in killing excessive client connections to recover,
>> > otherwise we might find the entire server becoming unresponsive.  The user
>> > can easily tune to /proc/sys/vm/oom_forkbomb_thres to define what
>> > "excessive" is to assess the penalty, if any.  I'll add that to the
>> > comment if we require a second revision.
>> >
>>
>> I am worried about opposite case.
>>
>> If forkbomb parent makes so many children in a short time(ex, 2000 per
>> second) continuously and we kill a child continuously not parent, system
>> is almost unresponsible, I think.
>
> The oom killer is not the appropriate place for a kernel forkbomb policy
> to be implemented, you'd need to address that concern in the scheduler.

I agree. but your's patch try to implement policy(avg rss of children < HZ)
in oom killer as well as detection.
so I pointed out that.
I think if we want to implement it, we also consider above scenario.
As you said, it would be better to detect forkbom in scheduler.
Then, let's remove forkbomb detection in OOM killer.
Afterward, we can implement it in scheduler and can use it in OOM killer.

It makes OOM killer more simple and predictable.

> When I've brought that up in the past, the response is that if we aren't
> out of memory, then it isn't a problem.  It is a problem for buggy

I said the situation out of memory.

> applications because their timeslice is now spread across an egregious
> amount of tasks that they are perhaps leaking and is detrimental to their
> server's performance.  I'm not saying that we need to enforce a hard limit
> on how many tasks a server forks, for instance, but the scheduler can
> detect forkbombs much easier than the oom killer's tasklist scan by at
> least indicating to us with a process flag that it is a likely forkbomb.
>
>> I suffered from that case in LTP and no swap system.
>> It might be a corner case but might happen in real.
>>
>
> If you look at the patchset overall and not just this one patch, you'll
> notice that we now kill the child with the highest badness() score first,
> i.e. generally the one consuming the most memory.  That is radically

It would work well just in case children have big difference badness scores.

> different than the previous behavior and should prevent the system from
> becoming unresponsive.  The goal is to allow the user to react to the
> forkbomb rather than implement a strict detection and handling heuristic
> that kills innocent servers and system daemons.
>
>> If we make sure this task is buggy forkbomb, it would be better to kill
>> it. But it's hard to make sure it's a buggy forkbomb.
>>
>> Could we solve this problem by following as?
>> If OOM selects victim and then the one was selected victim right before
>> and it's repeatable 5 times for example, then we kill the victim(buggy
>> forkbom) itself not child of one. It is assumed normal forkbomb is
>> controlled by admin who uses oom_forkbomb_thres well. So it doesn't
>> happen selecting victim continuously above five time.
>>
>
> That doesn't work with Rik's example of a webserver that forks a large
> number of threads to handle client connections.  It is _always_ better to
> kill a child instead of making the entire webserver unresponsive.

In such case, admin have to handle it by oom_forkbom_thres.
Isn't it your goal?

My suggestion is how handle buggy forkbomb processes which make
system almost hang by user's mistake. :)

>
> In other words, doing anything in the oom killer other than slightly
> penalizing these tasks and killing a child is really a non-starter because
> there are too many critical use cases (we have many) that would be
> unfairly biased against.
>

Tend to agree. So I hope we shouldn't have consider forkbomb in OOM killer.
I guess forkbomb logic in OOM killer could make many issues in future, still.
As you said, it would be better to implement in scheduler and OOM
killer just uses it.

Thanks for quick reply, David.
--
Kind regards,
Minchan Kim

2010-02-13 03:19:52

by Minchan Kim

[permalink] [raw]
Subject: Re: [patch 2/7 -mm] oom: sacrifice child with highest badness score for parent

On Thu, Feb 11, 2010 at 1:32 AM, David Rientjes <[email protected]> wrote:
> When a task is chosen for oom kill, the oom killer first attempts to
> sacrifice a child not sharing its parent's memory instead.
> Unfortunately, this often kills in a seemingly random fashion based on
> the ordering of the selected task's child list.  Additionally, it is not
> guaranteed at all to free a large amount of memory that we need to
> prevent additional oom killing in the very near future.
>
> Instead, we now only attempt to sacrifice the worst child not sharing its
> parent's memory, if one exists.  The worst child is indicated with the
> highest badness() score.  This serves two advantages: we kill a
> memory-hogging task more often, and we allow the configurable
> /proc/pid/oom_adj value to be considered as a factor in which child to
> kill.
>
> Reviewers may observe that the previous implementation would iterate
> through the children and attempt to kill each until one was successful
> and then the parent if none were found while the new code simply kills
> the most memory-hogging task or the parent.  Note that the only time
> oom_kill_task() fails, however, is when a child does not have an mm or
> has a /proc/pid/oom_adj of OOM_DISABLE.  badness() returns 0 for both
> cases, so the final oom_kill_task() will always succeed.
>
> Signed-off-by: David Rientjes <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>

Regardless of forkbom detection, It does makes sense to me.

--
Kind regards,
Minchan Kim

2010-02-15 00:16:30

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [patch 6/7 -mm] oom: avoid oom killer for lowmem allocations

On Fri, 12 Feb 2010 02:06:49 -0800 (PST)
David Rientjes <[email protected]> wrote:

> On Fri, 12 Feb 2010, KAMEZAWA Hiroyuki wrote:
>
> > From viewpoint of panic-on-oom lover, this patch seems to cause regression.
> > please do this check after sysctl_panic_on_oom == 2 test.
> > I think it's easy. So, temporary Nack to this patch itself.
> >
> >
> > And I think calling notifier is not very bad in the situation.
> > ==
> > void out_of_memory()
> > ..snip..
> > blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
> >
> >
> > So,
> >
> > if (sysctl_panic_on_oom == 2) {
> > dump_header(NULL, gfp_mask, order, NULL);
> > panic("out of memory. Compulsory panic_on_oom is selected.\n");
> > }
> >
> > if (gfp_zone(gfp_mask) < ZONE_NORMAL) /* oom-kill is useless if lowmem is exhausted. */
> > return;
> >
> > is better. I think.
> >
>
> I can't agree with that assessment, I don't think it's a desired result to
> ever panic the machine regardless of what /proc/sys/vm/panic_on_oom is set
> to because a lowmem page allocation fails especially considering, as
> mentioned in the changelog, these allocations are never __GFP_NOFAIL and
> returning NULL is acceptable.
>
please add
WARN_ON((high_zoneidx < ZONE_NORMAL) && (gfp_mask & __GFP_NOFAIL))
somewhere. Then, it seems your patch makes sense.

I don't like the "possibility" of inifinte loops.

Thanks,
-Kame


2010-02-15 02:51:36

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [patch 0/7 -mm] oom killer rewrite

> This patchset is a rewrite of the out of memory killer to address several
> issues that have been raised recently. The most notable change is a
> complete rewrite of the badness heuristic that determines which task is
> killed; the goal was to make it as simple and predictable as possible
> while still addressing issues that plague the VM.
>
> This patchset is based on mmotm-2010-02-05-15-06 because of the following
> dependencies:

At first, I'm glad that you tackle this issue. unfortunatelly I'm very busy now.
but I'll make a time for reviewing this patches asap.




>
> [patch 4/7] oom: badness heuristic rewrite:
> mm-count-swap-usage.patch
>
> [patch 5/7] oom: replace sysctls with quick mode:
> sysctl-clean-up-vm-related-variable-delcarations.patch
>
> To apply to mainline, download 2.6.33-rc7 and apply
>
> mm-clean-up-mm_counter.patch
> mm-avoid-false-sharing-of-mm_counter.patch
> mm-avoid-false_sharing-of-mm_counter-checkpatch-fixes.patch
> mm-count-swap-usage.patch
> mm-count-swap-usage-checkpatch-fixes.patch
> mm-introduce-dump_page-and-print-symbolic-flag-names.patch
> sysctl-clean-up-vm-related-variable-declarations.patch
> sysctl-clean-up-vm-related-variable-declarations-fix.patch
>
> from http://userweb.kernel.org/~akpm/mmotm/broken-out.tar.gz first.
> ---
> Documentation/filesystems/proc.txt | 78 ++++---
> Documentation/sysctl/vm.txt | 51 ++---
> fs/proc/base.c | 13 +-
> include/linux/mempolicy.h | 13 +-
> include/linux/oom.h | 18 +-
> kernel/sysctl.c | 15 +-
> mm/mempolicy.c | 39 +++
> mm/oom_kill.c | 455 +++++++++++++++++++-----------------
> mm/page_alloc.c | 3 +
> 9 files changed, 383 insertions(+), 302 deletions(-)
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>


2010-02-15 02:56:39

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [patch 1/7 -mm] oom: filter tasks not sharing the same cpuset

> Tasks that do not share the same set of allowed nodes with the task that
> triggered the oom should not be considered as candidates for oom kill.
>
> Tasks in other cpusets with a disjoint set of mems would be unfairly
> penalized otherwise because of oom conditions elsewhere; an extreme
> example could unfairly kill all other applications on the system if a
> single task in a user's cpuset sets itself to OOM_DISABLE and then uses
> more memory than allowed.
>
> Killing tasks outside of current's cpuset rarely would free memory for
> current anyway.
>
> Signed-off-by: David Rientjes <[email protected]>

This patch does right thing and looks promissing. but unfortunately
I have to NAK this patch temporary.

This patch is nearly just revert of the commit 7887a3da75. We have to
dig archaeology mail log and find why this reverting don't cause
the old pain again.

However, I personally think we end up to merge this. It is the reason
I've used "temporary".


> ---
> mm/oom_kill.c | 12 +++---------
> 1 files changed, 3 insertions(+), 9 deletions(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -35,7 +35,7 @@ static DEFINE_SPINLOCK(zone_scan_lock);
> /* #define DEBUG */
>
> /*
> - * Is all threads of the target process nodes overlap ours?
> + * Do all threads of the target process overlap our allowed nodes?
> */
> static int has_intersects_mems_allowed(struct task_struct *tsk)
> {
> @@ -167,14 +167,6 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
> points /= 4;
>
> /*
> - * If p's nodes don't overlap ours, it may still help to kill p
> - * because p may have allocated or otherwise mapped memory on
> - * this node before. However it will be less likely.
> - */
> - if (!has_intersects_mems_allowed(p))
> - points /= 8;
> -
> - /*
> * Adjust the score by oom_adj.
> */
> if (oom_adj) {
> @@ -266,6 +258,8 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
> continue;
> if (mem && !task_in_mem_cgroup(p, mem))
> continue;
> + if (!has_intersects_mems_allowed(p))
> + continue;
>
> /*
> * This task already has access to memory reserves and is
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>


2010-02-15 03:08:40

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [patch 2/7 -mm] oom: sacrifice child with highest badness score for parent

> When a task is chosen for oom kill, the oom killer first attempts to
> sacrifice a child not sharing its parent's memory instead.
> Unfortunately, this often kills in a seemingly random fashion based on
> the ordering of the selected task's child list. Additionally, it is not
> guaranteed at all to free a large amount of memory that we need to
> prevent additional oom killing in the very near future.
>
> Instead, we now only attempt to sacrifice the worst child not sharing its
> parent's memory, if one exists. The worst child is indicated with the
> highest badness() score. This serves two advantages: we kill a
> memory-hogging task more often, and we allow the configurable
> /proc/pid/oom_adj value to be considered as a factor in which child to
> kill.
>
> Reviewers may observe that the previous implementation would iterate
> through the children and attempt to kill each until one was successful
> and then the parent if none were found while the new code simply kills
> the most memory-hogging task or the parent. Note that the only time
> oom_kill_task() fails, however, is when a child does not have an mm or
> has a /proc/pid/oom_adj of OOM_DISABLE. badness() returns 0 for both
> cases, so the final oom_kill_task() will always succeed.
>
> Signed-off-by: David Rientjes <[email protected]>

Probably, kamezawa-san talked about right thing. but this patch is
enough small and it have no regression risk. So, we can choice step-by-step
development.
Reviewed-by: KOSAKI Motohiro <[email protected]>



> ---
> mm/oom_kill.c | 23 +++++++++++++++++------
> 1 files changed, 17 insertions(+), 6 deletions(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -432,7 +432,10 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
> unsigned long points, struct mem_cgroup *mem,
> const char *message)
> {
> + struct task_struct *victim = p;
> struct task_struct *c;
> + unsigned long victim_points = 0;
> + struct timespec uptime;
>
> if (printk_ratelimit())
> dump_header(p, gfp_mask, order, mem);
> @@ -446,17 +449,25 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
> return 0;
> }
>
> - printk(KERN_ERR "%s: kill process %d (%s) score %li or a child\n",
> - message, task_pid_nr(p), p->comm, points);
> + pr_err("%s: Kill process %d (%s) with score %lu or sacrifice child\n",
> + message, task_pid_nr(p), p->comm, points);
>
> - /* Try to kill a child first */
> + /* Try to sacrifice the worst child first */
> + do_posix_clock_monotonic_gettime(&uptime);
> list_for_each_entry(c, &p->children, sibling) {
> + unsigned long cpoints;
> +
> if (c->mm == p->mm)
> continue;
> - if (!oom_kill_task(c))
> - return 0;
> +
> + /* badness() returns 0 if the thread is unkillable */
> + cpoints = badness(c, uptime.tv_sec);
> + if (cpoints > victim_points) {
> + victim = c;
> + victim_points = cpoints;
> + }
> }
> - return oom_kill_task(p);
> + return oom_kill_task(victim);
> }
>
> #ifdef CONFIG_CGROUP_MEM_RES_CTLR
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>


2010-02-15 05:03:12

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [patch 3/7 -mm] oom: select task from tasklist for mempolicy ooms

> The oom killer presently kills current whenever there is no more memory
> free or reclaimable on its mempolicy's nodes. There is no guarantee that
> current is a memory-hogging task or that killing it will free any
> substantial amount of memory, however.
>
> In such situations, it is better to scan the tasklist for nodes that are
> allowed to allocate on current's set of nodes and kill the task with the
> highest badness() score. This ensures that the most memory-hogging task,
> or the one configured by the user with /proc/pid/oom_adj, is always
> selected in such scenarios.
>
> Signed-off-by: David Rientjes <[email protected]>
> ---
> include/linux/mempolicy.h | 13 +++++++-
> mm/mempolicy.c | 39 +++++++++++++++++++++++
> mm/oom_kill.c | 77 +++++++++++++++++++++++++++-----------------
> 3 files changed, 98 insertions(+), 31 deletions(-)
>
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -202,6 +202,8 @@ extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
> unsigned long addr, gfp_t gfp_flags,
> struct mempolicy **mpol, nodemask_t **nodemask);
> extern bool init_nodemask_of_mempolicy(nodemask_t *mask);
> +extern bool mempolicy_nodemask_intersects(struct task_struct *tsk,
> + const nodemask_t *mask);
> extern unsigned slab_node(struct mempolicy *policy);
>
> extern enum zone_type policy_zone;
> @@ -329,7 +331,16 @@ static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma,
> return node_zonelist(0, gfp_flags);
> }
>
> -static inline bool init_nodemask_of_mempolicy(nodemask_t *m) { return false; }
> +static inline bool init_nodemask_of_mempolicy(nodemask_t *m)
> +{
> + return false;
> +}
> +
> +static inline bool mempolicy_nodemask_intersects(struct task_struct *tsk,
> + const nodemask_t *mask)
> +{
> + return false;
> +}
>
> static inline int do_migrate_pages(struct mm_struct *mm,
> const nodemask_t *from_nodes,
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1638,6 +1638,45 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
> }
> #endif
>
> +/*
> + * mempolicy_nodemask_intersects
> + *
> + * If tsk's mempolicy is "default" [NULL], return 'true' to indicate default
> + * policy. Otherwise, check for intersection between mask and the policy
> + * nodemask for 'bind' or 'interleave' policy, or mask to contain the single
> + * node for 'preferred' or 'local' policy.
> + */
> +bool mempolicy_nodemask_intersects(struct task_struct *tsk,
> + const nodemask_t *mask)
> +{
> + struct mempolicy *mempolicy;
> + bool ret = true;
> +
> + mempolicy = tsk->mempolicy;
> + mpol_get(mempolicy);

Why is this refcount increment necessary? mempolicy is grabbed by tsk,
IOW it never be freed in this function.


> + if (!mask || !mempolicy)
> + goto out;
> +
> + switch (mempolicy->mode) {
> + case MPOL_PREFERRED:
> + if (mempolicy->flags & MPOL_F_LOCAL)
> + ret = node_isset(numa_node_id(), *mask);

Um? Is this good heuristic?
The task can migrate various cpus, then "node_isset(numa_node_id(), *mask) == 0"
doesn't mean the task doesn't consume *mask's memory.


> + else
> + ret = node_isset(mempolicy->v.preferred_node,
> + *mask);
> + break;
> + case MPOL_BIND:
> + case MPOL_INTERLEAVE:
> + ret = nodes_intersects(mempolicy->v.nodes, *mask);
> + break;
> + default:
> + BUG();
> + }
> +out:
> + mpol_put(mempolicy);
> + return ret;
> +}
> +
> /* Allocate a page in interleaved policy.
> Own path because it needs to do special accounting. */
> static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -26,6 +26,7 @@
> #include <linux/module.h>
> #include <linux/notifier.h>
> #include <linux/memcontrol.h>
> +#include <linux/mempolicy.h>
> #include <linux/security.h>
>
> int sysctl_panic_on_oom;
> @@ -36,19 +37,35 @@ static DEFINE_SPINLOCK(zone_scan_lock);
>
> /*
> * Do all threads of the target process overlap our allowed nodes?
> + * @tsk: task struct of which task to consider
> + * @mask: nodemask passed to page allocator for mempolicy ooms
> */
> -static int has_intersects_mems_allowed(struct task_struct *tsk)
> +static bool has_intersects_mems_allowed(struct task_struct *tsk,
> + const nodemask_t *mask)
> {
> - struct task_struct *t;
> + struct task_struct *start = tsk;
>
> - t = tsk;
> do {
> - if (cpuset_mems_allowed_intersects(current, t))
> - return 1;
> - t = next_thread(t);
> - } while (t != tsk);
> -
> - return 0;
> + if (mask) {
> + /*
> + * If this is a mempolicy constrained oom, tsk's
> + * cpuset is irrelevant. Only return true if its
> + * mempolicy intersects current, otherwise it may be
> + * needlessly killed.
> + */
> + if (mempolicy_nodemask_intersects(tsk, mask))
> + return true;
> + } else {
> + /*
> + * This is not a mempolicy constrained oom, so only
> + * check the mems of tsk's cpuset.
> + */
> + if (cpuset_mems_allowed_intersects(current, tsk))
> + return true;
> + }
> + tsk = next_thread(tsk);
> + } while (tsk != start);
> + return false;
> }
>
> /**
> @@ -236,7 +253,8 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
> * (not docbooked, we don't want this one cluttering up the manual)
> */
> static struct task_struct *select_bad_process(unsigned long *ppoints,
> - struct mem_cgroup *mem)
> + struct mem_cgroup *mem, enum oom_constraint constraint,
> + const nodemask_t *mask)
> {
> struct task_struct *p;
> struct task_struct *chosen = NULL;
> @@ -258,7 +276,9 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
> continue;
> if (mem && !task_in_mem_cgroup(p, mem))
> continue;
> - if (!has_intersects_mems_allowed(p))
> + if (!has_intersects_mems_allowed(p,
> + constraint == CONSTRAINT_MEMORY_POLICY ? mask :
> + NULL))
> continue;
>
> /*
> @@ -478,7 +498,7 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
>
> read_lock(&tasklist_lock);
> retry:
> - p = select_bad_process(&points, mem);
> + p = select_bad_process(&points, mem, CONSTRAINT_NONE, NULL);
> if (PTR_ERR(p) == -1UL)
> goto out;
>
> @@ -560,7 +580,8 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask)
> /*
> * Must be called with tasklist_lock held for read.
> */
> -static void __out_of_memory(gfp_t gfp_mask, int order)
> +static void __out_of_memory(gfp_t gfp_mask, int order,
> + enum oom_constraint constraint, const nodemask_t *mask)
> {
> struct task_struct *p;
> unsigned long points;
> @@ -574,7 +595,7 @@ retry:
> * Rambo mode: Shoot down a process and hope it solves whatever
> * issues we may have.
> */
> - p = select_bad_process(&points, NULL);
> + p = select_bad_process(&points, NULL, constraint, mask);
>
> if (PTR_ERR(p) == -1UL)
> return;
> @@ -615,7 +636,8 @@ void pagefault_out_of_memory(void)
> panic("out of memory from page fault. panic_on_oom is selected.\n");
>
> read_lock(&tasklist_lock);
> - __out_of_memory(0, 0); /* unknown gfp_mask and order */
> + /* unknown gfp_mask and order */
> + __out_of_memory(0, 0, CONSTRAINT_NONE, NULL);
> read_unlock(&tasklist_lock);
>
> /*
> @@ -632,6 +654,7 @@ rest_and_return:
> * @zonelist: zonelist pointer
> * @gfp_mask: memory allocation flags
> * @order: amount of memory being requested as a power of 2
> + * @nodemask: nodemask passed to page allocator
> *
> * If we run out of memory, we have the choice between either
> * killing a random task (bad), letting the system crash (worse)
> @@ -660,24 +683,18 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> */
> constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
> read_lock(&tasklist_lock);
> -
> - switch (constraint) {
> - case CONSTRAINT_MEMORY_POLICY:
> - oom_kill_process(current, gfp_mask, order, 0, NULL,
> - "No available memory (MPOL_BIND)");
> - break;
> -
> - case CONSTRAINT_NONE:
> - if (sysctl_panic_on_oom) {
> + if (unlikely(sysctl_panic_on_oom)) {
> + /*
> + * panic_on_oom only affects CONSTRAINT_NONE, the kernel
> + * should not panic for cpuset or mempolicy induced memory
> + * failures.
> + */
> + if (constraint == CONSTRAINT_NONE) {
> dump_header(NULL, gfp_mask, order, NULL);
> - panic("out of memory. panic_on_oom is selected\n");
> + panic("Out of memory: panic_on_oom is enabled\n");

enabled? Its feature is enabled at boot time. triggered? or fired?


> }
> - /* Fall-through */
> - case CONSTRAINT_CPUSET:
> - __out_of_memory(gfp_mask, order);
> - break;
> }
> -
> + __out_of_memory(gfp_mask, order, constraint, nodemask);
> read_unlock(&tasklist_lock);
>
> /*
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>


2010-02-15 08:05:55

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [patch 4/7 -mm] oom: badness heuristic rewrite

> This a complete rewrite of the oom killer's badness() heuristic which is
> used to determine which task to kill in oom conditions. The goal is to
> make it as simple and predictable as possible so the results are better
> understood and we end up killing the task which will lead to the most
> memory freeing while still respecting the fine-tuning from userspace.
>
> The baseline for the heuristic is a proportion of memory that each task
> is currently using in memory plus swap compared to the amount of
> "allowable" memory. "Allowable," in this sense, means the system-wide
> resources for unconstrained oom conditions, the set of mempolicy nodes,
> the mems attached to current's cpuset, or a memory controller's limit.
> The proportion is given on a scale of 0 (never kill) to 1000 (always
> kill), roughly meaning that if a task has a badness() score of 500 that
> the task consumes approximately 50% of allowable memory resident in RAM
> or in swap space.
>
> The proportion is always relative to the amount of "allowable" memory and
> not the total amount of RAM systemwide so that mempolicies and cpusets
> may operate in isolation; they shall not need to know the true size of
> the machine on which they are running if they are bound to a specific set
> of nodes or mems, respectively.
>
> Forkbomb detection is done in a completely different way: a threshold is
> configurable from userspace to determine how many first-generation execve
> children (those with their own address spaces) a task may have before it
> is considered a forkbomb. This can be tuned by altering the value in
> /proc/sys/vm/oom_forkbomb_thres, which defaults to 1000.
>
> When a task has more than 1000 first-generation children with different
> address spaces than itself, a penalty of
>
> (average rss of children) * (# of 1st generation execve children)
> -----------------------------------------------------------------
> oom_forkbomb_thres
>
> is assessed. So, for example, using the default oom_forkbomb_thres of
> 1000, the penalty is twice the average rss of all its execve children if
> there are 2000 such tasks. A task is considered to count toward the
> threshold if its total runtime is less than one second; for 1000 of such
> tasks to exist, the parent process must be forking at an extremely high
> rate either erroneously or maliciously.
>
> Even though a particular task may be designated a forkbomb and selected
> as the victim, the oom killer will still kill the 1st generation execve
> child with the highest badness() score in its place. The avoids killing
> important servers or system daemons.
>
> Root tasks are given 3% extra memory just like __vm_enough_memory()
> provides in LSMs. In the event of two tasks consuming similar amounts of
> memory, it is generally better to save root's task.
>
> Because of the change in the badness() heuristic's baseline, a change
> must also be made to the user interface used to tune it. Instead of a
> scale from -16 to +15 to indicate a bitshift on the point value given to
> a task, which was very difficult to tune accurately or with any degree of
> precision, /proc/pid/oom_adj now ranges from -1000 to +1000. That is, it
> can be used to polarize the heuristic such that certain tasks are never
> considered for oom kill while others are always considered. The value is
> added directly into the badness() score so a value of -500, for example,
> means to discount 50% of its memory consumption in comparison to other
> tasks either on the system, bound to the mempolicy, or in the cpuset.
>
> OOM_ADJUST_MIN and OOM_ADJUST_MAX have been exported to userspace since
> 2006 via include/linux/oom.h. This alters their values from -16 to -1000
> and from +15 to +1000, respectively. OOM_DISABLE is now the equivalent
> of the lowest possible value, OOM_ADJUST_MIN. Adding its value, -1000,
> to any badness score will always return 0.
>
> Although redefining these values may be controversial, it is much easier
> to understand when the units are fully understood as described above.
> In the short-term, there may be userspace breakage for tasks that
> hardcode -17 meaning OOM_DISABLE, for example, but the long-term will
> make the semantics much easier to understand and oom killing much more
> effective.

I NAK this patch as same as many other people. This patch bring to a lot of
compatibility issue. and it isn't necessary.


2010-02-15 08:09:58

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [patch 5/7 -mm] oom: replace sysctls with quick mode

> Two VM sysctls, oom dump_tasks and oom_kill_allocating_task, were
> implemented for very large systems to avoid excessively long tasklist
> scans. The former suppresses helpful diagnostic messages that are
> emitted for each thread group leader that are candidates for oom kill
> including their pid, uid, vm size, rss, oom_adj value, and name; this
> information is very helpful to users in understanding why a particular
> task was chosen for kill over others. The latter simply kills current,
> the task triggering the oom condition, instead of iterating through the
> tasklist looking for the worst offender.
>
> Both of these sysctls are combined into one for use on the aforementioned
> large systems: oom_kill_quick. This disables the now-default
> oom_dump_tasks and kills current whenever the oom killer is called.
>
> The oom killer rewrite is the perfect opportunity to combine both sysctls
> into one instead of carrying around the others for years to come for
> nothing else than legacy purposes.

"_quick" is always bad sysctl name. instead, turnning oom_dump_tasks on
by default is better.

plus, this patch makes unnecessary compatibility issue.



>
> Signed-off-by: David Rientjes <[email protected]>
> ---
> Documentation/sysctl/vm.txt | 44 +++++-------------------------------------
> include/linux/oom.h | 3 +-
> kernel/sysctl.c | 13 ++---------
> mm/oom_kill.c | 9 +++----
> 4 files changed, 14 insertions(+), 55 deletions(-)
>
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -43,9 +43,8 @@ Currently, these files are in /proc/sys/vm:
> - nr_pdflush_threads
> - nr_trim_pages (only if CONFIG_MMU=n)
> - numa_zonelist_order
> -- oom_dump_tasks
> - oom_forkbomb_thres
> -- oom_kill_allocating_task
> +- oom_kill_quick
> - overcommit_memory
> - overcommit_ratio
> - page-cluster
> @@ -470,27 +469,6 @@ this is causing problems for your system/application.
>
> ==============================================================
>
> -oom_dump_tasks
> -
> -Enables a system-wide task dump (excluding kernel threads) to be
> -produced when the kernel performs an OOM-killing and includes such
> -information as pid, uid, tgid, vm size, rss, cpu, oom_adj score, and
> -name. This is helpful to determine why the OOM killer was invoked
> -and to identify the rogue task that caused it.
> -
> -If this is set to zero, this information is suppressed. On very
> -large systems with thousands of tasks it may not be feasible to dump
> -the memory state information for each one. Such systems should not
> -be forced to incur a performance penalty in OOM conditions when the
> -information may not be desired.
> -
> -If this is set to non-zero, this information is shown whenever the
> -OOM killer actually kills a memory-hogging task.
> -
> -The default value is 0.
> -
> -==============================================================
> -
> oom_forkbomb_thres
>
> This value defines how many children with a seperate address space a specific
> @@ -511,22 +489,12 @@ The default value is 1000.
>
> ==============================================================
>
> -oom_kill_allocating_task
> -
> -This enables or disables killing the OOM-triggering task in
> -out-of-memory situations.
> -
> -If this is set to zero, the OOM killer will scan through the entire
> -tasklist and select a task based on heuristics to kill. This normally
> -selects a rogue memory-hogging task that frees up a large amount of
> -memory when killed.
> -
> -If this is set to non-zero, the OOM killer simply kills the task that
> -triggered the out-of-memory condition. This avoids the expensive
> -tasklist scan.
> +oom_kill_quick
>
> -If panic_on_oom is selected, it takes precedence over whatever value
> -is used in oom_kill_allocating_task.
> +When enabled, this will always kill the task that triggered the oom killer, i.e.
> +the task that attempted to allocate memory that could not be found. It also
> +suppresses the tasklist dump to the kernel log whenever the oom killer is
> +called. Typically set on systems with an extremely large number of tasks.
>
> The default value is 0.
>
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -51,8 +51,7 @@ static inline void oom_killer_enable(void)
> }
> /* for sysctl */
> extern int sysctl_panic_on_oom;
> -extern int sysctl_oom_kill_allocating_task;
> -extern int sysctl_oom_dump_tasks;
> +extern int sysctl_oom_kill_quick;
> extern int sysctl_oom_forkbomb_thres;
>
> #endif /* __KERNEL__*/
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -949,16 +949,9 @@ static struct ctl_table vm_table[] = {
> .proc_handler = proc_dointvec,
> },
> {
> - .procname = "oom_kill_allocating_task",
> - .data = &sysctl_oom_kill_allocating_task,
> - .maxlen = sizeof(sysctl_oom_kill_allocating_task),
> - .mode = 0644,
> - .proc_handler = proc_dointvec,
> - },
> - {
> - .procname = "oom_dump_tasks",
> - .data = &sysctl_oom_dump_tasks,
> - .maxlen = sizeof(sysctl_oom_dump_tasks),
> + .procname = "oom_kill_quick",
> + .data = &sysctl_oom_kill_quick,
> + .maxlen = sizeof(sysctl_oom_kill_quick),
> .mode = 0644,
> .proc_handler = proc_dointvec,
> },
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -32,9 +32,8 @@
> #include <linux/security.h>
>
> int sysctl_panic_on_oom;
> -int sysctl_oom_kill_allocating_task;
> -int sysctl_oom_dump_tasks;
> int sysctl_oom_forkbomb_thres = DEFAULT_OOM_FORKBOMB_THRES;
> +int sysctl_oom_kill_quick;
> static DEFINE_SPINLOCK(zone_scan_lock);
>
> /*
> @@ -397,7 +396,7 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
> dump_stack();
> mem_cgroup_print_oom_info(mem, p);
> show_mem();
> - if (sysctl_oom_dump_tasks)
> + if (!sysctl_oom_kill_quick)
> dump_tasks(mem);
> }
>
> @@ -604,9 +603,9 @@ static void __out_of_memory(gfp_t gfp_mask, int order, unsigned long totalpages,
> struct task_struct *p;
> unsigned int points;
>
> - if (sysctl_oom_kill_allocating_task)
> + if (sysctl_oom_kill_quick)
> if (!oom_kill_process(current, gfp_mask, order, 0, totalpages,
> - NULL, "Out of memory (oom_kill_allocating_task)"))
> + NULL, "Out of memory (quick mode)"))
> return;
> retry:
> /*
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>


2010-02-15 08:29:36

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [patch 6/7 -mm] oom: avoid oom killer for lowmem allocations

> If memory has been depleted in lowmem zones even with the protection
> afforded to it by /proc/sys/vm/lowmem_reserve_ratio, it is unlikely that
> killing current users will help. The memory is either reclaimable (or
> migratable) already, in which case we should not invoke the oom killer at
> all, or it is pinned by an application for I/O. Killing such an
> application may leave the hardware in an unspecified state and there is
> no guarantee that it will be able to make a timely exit.
>
> Lowmem allocations are now failed in oom conditions so that the task can
> perhaps recover or try again later. Killing current is an unnecessary
> result for simply making a GFP_DMA or GFP_DMA32 page allocation and no
> lowmem allocations use the now-deprecated __GFP_NOFAIL bit so retrying is
> unnecessary.
>
> Previously, the heuristic provided some protection for those tasks with
> CAP_SYS_RAWIO, but this is no longer necessary since we will not be
> killing tasks for the purposes of ISA allocations.

The main difference of Kamezawasan's patch is, his patch treated DMA
zone is filled by mlocked page too.
but I personally think such case should be solved auto page migration
mechanism. (probably, mel's memory compaction patch provide its base
infrastructure). So this patch seems enough and proper.

Reviewed-by: KOSAKI Motohiro <[email protected]>


>
> Signed-off-by: David Rientjes <[email protected]>
> ---
> mm/page_alloc.c | 3 +++
> 1 files changed, 3 insertions(+), 0 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1914,6 +1914,9 @@ rebalance:
> * running out of options and have to consider going OOM
> */
> if (!did_some_progress) {
> + /* The oom killer won't necessarily free lowmem */
> + if (high_zoneidx < ZONE_NORMAL)
> + goto nopage;
> if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
> if (oom_killer_disabled)
> goto nopage;
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>


2010-02-15 08:31:44

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [patch 7/7 -mm] oom: remove unnecessary code and cleanup

> On Fri, 12 Feb 2010, KAMEZAWA Hiroyuki wrote:
>
> > > Remove the redundancy in __oom_kill_task() since:
> > >
> > > - init can never be passed to this function: it will never be PF_EXITING
> > > or selectable from select_bad_process(), and
> > >
> > > - it will never be passed a task from oom_kill_task() without an ->mm
> > > and we're unconcerned about detachment from exiting tasks, there's no
> > > reason to protect them against SIGKILL or access to memory reserves.
> > >
> > > Also moves the kernel log message to a higher level since the verbosity
> > > is not always emitted here; we need not print an error message if an
> > > exiting task is given a longer timeslice.
> > >
> > > Signed-off-by: David Rientjes <[email protected]>
> >
> > If you say "never", it's better to add BUG_ON() rather than
> > if (!p->mm)...
> >
>
> As the description says, oom_kill_task() never passes __oom_kill_task() a
> task, p, where !p->mm, but it doesn't imply that p cannot detach its ->mm
> before __oom_kill_task() gets a chance to run. The point is that we don't
> really care about giving it access to memory reserves anymore since it's
> exiting and won't be allocating anything. Warning about that scenario is
> unnecessary and would simply spam the kernel log, a recall to the oom
> killer would no longer select this task in case the oom condition persists
> anyway.

I agree this description is correct and this code is unnecessary.
Reviewed-by: KOSAKI Motohiro <[email protected]>


>
> > But yes, this patch seesm to remove unnecessary codes.
> > Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
> >
>
> Thanks!
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>


2010-02-15 21:54:51

by David Rientjes

[permalink] [raw]
Subject: Re: [patch 4/7 -mm] oom: badness heuristic rewrite

On Sat, 13 Feb 2010, Minchan Kim wrote:

> > The oom killer is not the appropriate place for a kernel forkbomb policy
> > to be implemented, you'd need to address that concern in the scheduler.
>
> I agree. but your's patch try to implement policy(avg rss of children < HZ)
> in oom killer as well as detection.
> so I pointed out that.

That's not what's used, we detect whether a child should be included in
the forkbomb count by checking for two traits: (i) it doesn't share an
->mm with the parent, otherwise it wouldn't free any memory unless the
parent was killed as well, and (ii) its total runtime is less than a
second since threads in forkbomb scenarios don't typically get any
runtime. The _penalization_ is then the average rss of those children
times how many times the count exceeds oom_forkbomb_thres.

> I think if we want to implement it, we also consider above scenario.
> As you said, it would be better to detect forkbom in scheduler.
> Then, let's remove forkbomb detection in OOM killer.
> Afterward, we can implement it in scheduler and can use it in OOM killer.
>

We're not enforcing a global, system-wide forkbomb policy in the oom
killer, but we do need to identify tasks that fork a very large number of
tasks to break ties with other tasks: in other words, it would not be
helpful to kill an application that has been running for weeks because
another application with the same or less memory usage has forked 1000
children and has caused an oom condition. That unfairly penalizes the
former application that is actually doing work.

Again, I'd encourage you to look at this as only a slight penalization
rather than a policy that strictly needs to be enforced. If it were
strictly enforced, it would be a prerequisite for selection if such a task
were to exist; in my implementation, it is part of the heuristic.

> > That doesn't work with Rik's example of a webserver that forks a large
> > number of threads to handle client connections.  It is _always_ better to
> > kill a child instead of making the entire webserver unresponsive.
>
> In such case, admin have to handle it by oom_forkbom_thres.
> Isn't it your goal?
>

oom_forkbomb_thres has a default value, which is 1000, so it should be
enabled by default.

> My suggestion is how handle buggy forkbomb processes which make
> system almost hang by user's mistake. :)
>

I don't think you've given a clear description (or, even better, a patch)
of your suggestion.

2010-02-15 22:01:26

by David Rientjes

[permalink] [raw]
Subject: Re: [patch 6/7 -mm] oom: avoid oom killer for lowmem allocations

On Mon, 15 Feb 2010, KAMEZAWA Hiroyuki wrote:

> > I can't agree with that assessment, I don't think it's a desired result to
> > ever panic the machine regardless of what /proc/sys/vm/panic_on_oom is set
> > to because a lowmem page allocation fails especially considering, as
> > mentioned in the changelog, these allocations are never __GFP_NOFAIL and
> > returning NULL is acceptable.
> >
> please add
> WARN_ON((high_zoneidx < ZONE_NORMAL) && (gfp_mask & __GFP_NOFAIL))
> somewhere. Then, it seems your patch makes sense.
>

high_zoneidx < ZONE_NORMAL is not the only case where this exists: it
exists for __GFP_NOFAIL allocations that are not __GFP_FS as well and has
for years, no special handling is now needed.

There should be no cases of either (GFP_DMA | __GFP_NOFAIL, or
GFP_NOFS | __GFP_NOFAIL) in my audit of the kernel code. And since
__GFP_NOFAIL is not to be added anymore (see Andrew's dab48dab), there's
no real reason to add a WARN_ON() here.

> I don't like the "possibility" of inifinte loops.
>

The possibility of infinite loops has always existed in the page allocator
for __GFP_NOFAIL allocations, that's precisely why it's deprecated and
eventually we seek to remove it.

2010-02-15 22:06:57

by David Rientjes

[permalink] [raw]
Subject: Re: [patch 1/7 -mm] oom: filter tasks not sharing the same cpuset

On Mon, 15 Feb 2010, KOSAKI Motohiro wrote:

> > Tasks that do not share the same set of allowed nodes with the task that
> > triggered the oom should not be considered as candidates for oom kill.
> >
> > Tasks in other cpusets with a disjoint set of mems would be unfairly
> > penalized otherwise because of oom conditions elsewhere; an extreme
> > example could unfairly kill all other applications on the system if a
> > single task in a user's cpuset sets itself to OOM_DISABLE and then uses
> > more memory than allowed.
> >
> > Killing tasks outside of current's cpuset rarely would free memory for
> > current anyway.
> >
> > Signed-off-by: David Rientjes <[email protected]>
>
> This patch does right thing and looks promissing. but unfortunately
> I have to NAK this patch temporary.
>
> This patch is nearly just revert of the commit 7887a3da75. We have to
> dig archaeology mail log and find why this reverting don't cause
> the old pain again.
>

Nick is probably wondering why I cc'd him on this patchset, and this is it
:)

We now determine whether an allocation is constrained by a cpuset by
iterating through the zonelist and checking
cpuset_zone_allowed_softwall(). This checks for the necessary cpuset
restrictions that we need to validate (the GFP_ATOMIC exception is
irrelevant, we don't call into the oom killer for those). We don't need
to kill outside of its cpuset because we're not guaranteed to find any
memory on those nodes, in fact it allows for needless oom killing if a
task sets all of its threads to have OOM_DISABLE in its own cpuset and
then runs out of memory. The oom killer would have killed every other
user task on the system even though the offending application can't
allocate there. That's certainly an undesired result and needs to be
fixed in this manner.

2010-02-15 22:11:52

by David Rientjes

[permalink] [raw]
Subject: Re: [patch 3/7 -mm] oom: select task from tasklist for mempolicy ooms

On Mon, 15 Feb 2010, KOSAKI Motohiro wrote:

> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -1638,6 +1638,45 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
> > }
> > #endif
> >
> > +/*
> > + * mempolicy_nodemask_intersects
> > + *
> > + * If tsk's mempolicy is "default" [NULL], return 'true' to indicate default
> > + * policy. Otherwise, check for intersection between mask and the policy
> > + * nodemask for 'bind' or 'interleave' policy, or mask to contain the single
> > + * node for 'preferred' or 'local' policy.
> > + */
> > +bool mempolicy_nodemask_intersects(struct task_struct *tsk,
> > + const nodemask_t *mask)
> > +{
> > + struct mempolicy *mempolicy;
> > + bool ret = true;
> > +
> > + mempolicy = tsk->mempolicy;
> > + mpol_get(mempolicy);
>
> Why is this refcount increment necessary? mempolicy is grabbed by tsk,
> IOW it never be freed in this function.
>

We need to get a refcount on the mempolicy to ensure it doesn't get freed
from under us, tsk is not necessarily current.

>
> > + if (!mask || !mempolicy)
> > + goto out;
> > +
> > + switch (mempolicy->mode) {
> > + case MPOL_PREFERRED:
> > + if (mempolicy->flags & MPOL_F_LOCAL)
> > + ret = node_isset(numa_node_id(), *mask);
>
> Um? Is this good heuristic?
> The task can migrate various cpus, then "node_isset(numa_node_id(), *mask) == 0"
> doesn't mean the task doesn't consume *mask's memory.
>

For MPOL_F_LOCAL, we need to check whether the task's cpu is on a node
that is allowed by the zonelist passed to the page allocator. In the
second revision of this patchset, this was changed to

node_isset(cpu_to_node(task_cpu(tsk)), *mask)

to check. It would be possible for no memory to have been allocated on
that node and it just happens that the tsk is running on it momentarily,
but it's the best indication we have given the mempolicy of whether
killing a task may lead to future memory freeing.

> > @@ -660,24 +683,18 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> > */
> > constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
> > read_lock(&tasklist_lock);
> > -
> > - switch (constraint) {
> > - case CONSTRAINT_MEMORY_POLICY:
> > - oom_kill_process(current, gfp_mask, order, 0, NULL,
> > - "No available memory (MPOL_BIND)");
> > - break;
> > -
> > - case CONSTRAINT_NONE:
> > - if (sysctl_panic_on_oom) {
> > + if (unlikely(sysctl_panic_on_oom)) {
> > + /*
> > + * panic_on_oom only affects CONSTRAINT_NONE, the kernel
> > + * should not panic for cpuset or mempolicy induced memory
> > + * failures.
> > + */
> > + if (constraint == CONSTRAINT_NONE) {
> > dump_header(NULL, gfp_mask, order, NULL);
> > - panic("out of memory. panic_on_oom is selected\n");
> > + panic("Out of memory: panic_on_oom is enabled\n");
>
> enabled? Its feature is enabled at boot time. triggered? or fired?
>

The panic_on_oom sysctl is "enabled" if it is set to non-zero; that's the
word used throughout Documentation/sysctl/vm.txt to describe when a sysctl
is being used or not.

2010-02-15 22:15:51

by David Rientjes

[permalink] [raw]
Subject: Re: [patch 5/7 -mm] oom: replace sysctls with quick mode

On Mon, 15 Feb 2010, KOSAKI Motohiro wrote:

> > Two VM sysctls, oom dump_tasks and oom_kill_allocating_task, were
> > implemented for very large systems to avoid excessively long tasklist
> > scans. The former suppresses helpful diagnostic messages that are
> > emitted for each thread group leader that are candidates for oom kill
> > including their pid, uid, vm size, rss, oom_adj value, and name; this
> > information is very helpful to users in understanding why a particular
> > task was chosen for kill over others. The latter simply kills current,
> > the task triggering the oom condition, instead of iterating through the
> > tasklist looking for the worst offender.
> >
> > Both of these sysctls are combined into one for use on the aforementioned
> > large systems: oom_kill_quick. This disables the now-default
> > oom_dump_tasks and kills current whenever the oom killer is called.
> >
> > The oom killer rewrite is the perfect opportunity to combine both sysctls
> > into one instead of carrying around the others for years to come for
> > nothing else than legacy purposes.
>
> "_quick" is always bad sysctl name.

Why? It does exactly what it says: it kills current without doing an
expensive tasklist scan and suppresses the possibly long tasklist dump.
That's the oom killer's "quick mode."

> instead, turnning oom_dump_tasks on
> by default is better.
>

It's now on by default and can be disabled by enabling oom_kill_quick.

> plus, this patch makes unnecessary compatibility issue.
>

It's the perfect opportunity when totally rewriting the oom killer to
combine two sysctls with the exact same users into one. Users will notice
that the tasklist is always dumped now (we're defaulting oom_dump_tasks
to be enabled), so there is no reason why we can't remove oom_dump_tasks,
we're just giving them a new way to disable it. oom_kill_allocating_task
no longer always means what it once did: with the mempolicy-constrained
oom rewrite, we now iterate the tasklist for such cases to kill a task.
So users need to reassess whether this should be set if all tasks on the
system are constrained by mempolicies, a typical configuration for
extremely large systems.

2010-02-16 04:52:11

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [patch 1/7 -mm] oom: filter tasks not sharing the same cpuset

> On Mon, 15 Feb 2010, KOSAKI Motohiro wrote:
>
> > > Tasks that do not share the same set of allowed nodes with the task that
> > > triggered the oom should not be considered as candidates for oom kill.
> > >
> > > Tasks in other cpusets with a disjoint set of mems would be unfairly
> > > penalized otherwise because of oom conditions elsewhere; an extreme
> > > example could unfairly kill all other applications on the system if a
> > > single task in a user's cpuset sets itself to OOM_DISABLE and then uses
> > > more memory than allowed.
> > >
> > > Killing tasks outside of current's cpuset rarely would free memory for
> > > current anyway.
> > >
> > > Signed-off-by: David Rientjes <[email protected]>
> >
> > This patch does right thing and looks promissing. but unfortunately
> > I have to NAK this patch temporary.
> >
> > This patch is nearly just revert of the commit 7887a3da75. We have to
> > dig archaeology mail log and find why this reverting don't cause
> > the old pain again.
> >
>
> Nick is probably wondering why I cc'd him on this patchset, and this is it
> :)

Good decision :)

>
> We now determine whether an allocation is constrained by a cpuset by
> iterating through the zonelist and checking
> cpuset_zone_allowed_softwall(). This checks for the necessary cpuset
> restrictions that we need to validate (the GFP_ATOMIC exception is
> irrelevant, we don't call into the oom killer for those). We don't need
> to kill outside of its cpuset because we're not guaranteed to find any
> memory on those nodes, in fact it allows for needless oom killing if a
> task sets all of its threads to have OOM_DISABLE in its own cpuset and
> then runs out of memory. The oom killer would have killed every other
> user task on the system even though the offending application can't
> allocate there. That's certainly an undesired result and needs to be
> fixed in this manner.

But this explanation is irrelevant and meaningless. CPUSET can change
restricted node dynamically. So, the tsk->mempolicy at oom time doesn't
represent the place of task's usage memory. plus, OOM_DISABLE can
always makes undesirable result. it's not special in this case.

The fact is, both current and your heuristics have a corner case. it's
obvious. (I haven't seen corner caseless heuristics). then talking your
patch's merit doesn't help to merge the patch. The most important thing
is, we keep no regression. personally, I incline your one. but It doesn't
mean we can ignore its demerit.


2010-02-16 05:15:24

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [patch 3/7 -mm] oom: select task from tasklist for mempolicy ooms

> On Mon, 15 Feb 2010, KOSAKI Motohiro wrote:
>
> > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > > --- a/mm/mempolicy.c
> > > +++ b/mm/mempolicy.c
> > > @@ -1638,6 +1638,45 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
> > > }
> > > #endif
> > >
> > > +/*
> > > + * mempolicy_nodemask_intersects
> > > + *
> > > + * If tsk's mempolicy is "default" [NULL], return 'true' to indicate default
> > > + * policy. Otherwise, check for intersection between mask and the policy
> > > + * nodemask for 'bind' or 'interleave' policy, or mask to contain the single
> > > + * node for 'preferred' or 'local' policy.
> > > + */
> > > +bool mempolicy_nodemask_intersects(struct task_struct *tsk,
> > > + const nodemask_t *mask)
> > > +{
> > > + struct mempolicy *mempolicy;
> > > + bool ret = true;
> > > +
> > > + mempolicy = tsk->mempolicy;
> > > + mpol_get(mempolicy);
> >
> > Why is this refcount increment necessary? mempolicy is grabbed by tsk,
> > IOW it never be freed in this function.
>
> We need to get a refcount on the mempolicy to ensure it doesn't get freed
> from under us, tsk is not necessarily current.

Hm.
if you explanation is correct, I think your patch have following race.


CPU0 CPU1
----------------------------------------------
mempolicy_nodemask_intersects()
mempolicy = tsk->mempolicy;
do_exit()
mpol_put(tsk_mempolicy)
mpol_get(mempolicy);




> > > + if (!mask || !mempolicy)
> > > + goto out;
> > > +
> > > + switch (mempolicy->mode) {
> > > + case MPOL_PREFERRED:
> > > + if (mempolicy->flags & MPOL_F_LOCAL)
> > > + ret = node_isset(numa_node_id(), *mask);
> >
> > Um? Is this good heuristic?
> > The task can migrate various cpus, then "node_isset(numa_node_id(), *mask) == 0"
> > doesn't mean the task doesn't consume *mask's memory.
> >
>
> For MPOL_F_LOCAL, we need to check whether the task's cpu is on a node
> that is allowed by the zonelist passed to the page allocator. In the
> second revision of this patchset, this was changed to
>
> node_isset(cpu_to_node(task_cpu(tsk)), *mask)
>
> to check. It would be possible for no memory to have been allocated on
> that node and it just happens that the tsk is running on it momentarily,
> but it's the best indication we have given the mempolicy of whether
> killing a task may lead to future memory freeing.

This calculation is still broken. In general, running cpu and allocation node
is not bound.
We can't know such task use which node memory because MPOL_PREFERRED doesn't
bind allocation node. it only provide allocation hint.

case MPOL_PREFERRED:
ret = true;
break;

is better. (probably we can make some bonus to oom_badness, but it's irrelevant thing).


>
> > > @@ -660,24 +683,18 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> > > */
> > > constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
> > > read_lock(&tasklist_lock);
> > > -
> > > - switch (constraint) {
> > > - case CONSTRAINT_MEMORY_POLICY:
> > > - oom_kill_process(current, gfp_mask, order, 0, NULL,
> > > - "No available memory (MPOL_BIND)");
> > > - break;
> > > -
> > > - case CONSTRAINT_NONE:
> > > - if (sysctl_panic_on_oom) {
> > > + if (unlikely(sysctl_panic_on_oom)) {
> > > + /*
> > > + * panic_on_oom only affects CONSTRAINT_NONE, the kernel
> > > + * should not panic for cpuset or mempolicy induced memory
> > > + * failures.
> > > + */
> > > + if (constraint == CONSTRAINT_NONE) {
> > > dump_header(NULL, gfp_mask, order, NULL);
> > > - panic("out of memory. panic_on_oom is selected\n");
> > > + panic("Out of memory: panic_on_oom is enabled\n");
> >
> > enabled? Its feature is enabled at boot time. triggered? or fired?
>
> The panic_on_oom sysctl is "enabled" if it is set to non-zero; that's the
> word used throughout Documentation/sysctl/vm.txt to describe when a sysctl
> is being used or not.

Probably, you changed message meanings. I think the original one doesn't
intend to describe enable or disable. but it isn't big matter. I can accept it.


2010-02-16 05:25:40

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [patch 5/7 -mm] oom: replace sysctls with quick mode

> On Mon, 15 Feb 2010, KOSAKI Motohiro wrote:
>
> > > Two VM sysctls, oom dump_tasks and oom_kill_allocating_task, were
> > > implemented for very large systems to avoid excessively long tasklist
> > > scans. The former suppresses helpful diagnostic messages that are
> > > emitted for each thread group leader that are candidates for oom kill
> > > including their pid, uid, vm size, rss, oom_adj value, and name; this
> > > information is very helpful to users in understanding why a particular
> > > task was chosen for kill over others. The latter simply kills current,
> > > the task triggering the oom condition, instead of iterating through the
> > > tasklist looking for the worst offender.
> > >
> > > Both of these sysctls are combined into one for use on the aforementioned
> > > large systems: oom_kill_quick. This disables the now-default
> > > oom_dump_tasks and kills current whenever the oom killer is called.
> > >
> > > The oom killer rewrite is the perfect opportunity to combine both sysctls
> > > into one instead of carrying around the others for years to come for
> > > nothing else than legacy purposes.
> >
> > "_quick" is always bad sysctl name.
>
> Why? It does exactly what it says: it kills current without doing an
> expensive tasklist scan and suppresses the possibly long tasklist dump.
> That's the oom killer's "quick mode."

Because, an administrator think "_quick" implies "please use it always".
plus, "quick" doesn't describe clealy meanings. oom_dump_tasks does.



> > instead, turnning oom_dump_tasks on
> > by default is better.
> >
>
> It's now on by default and can be disabled by enabling oom_kill_quick.
>
> > plus, this patch makes unnecessary compatibility issue.
> >
>
> It's the perfect opportunity when totally rewriting the oom killer to
> combine two sysctls with the exact same users into one. Users will notice
> that the tasklist is always dumped now (we're defaulting oom_dump_tasks
> to be enabled), so there is no reason why we can't remove oom_dump_tasks,
> we're just giving them a new way to disable it. oom_kill_allocating_task
> no longer always means what it once did: with the mempolicy-constrained
> oom rewrite, we now iterate the tasklist for such cases to kill a task.
> So users need to reassess whether this should be set if all tasks on the
> system are constrained by mempolicies, a typical configuration for
> extremely large systems.

No.
Your explanation doesn't answer why this change don't cause any comatibility
issue to _all_ user. Merely "opportunity" doesn't allow we ignore real world user.
I had made some incompatibility patch too, but all one have unavoidable reason.


2010-02-16 06:01:39

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [patch 1/7 -mm] oom: filter tasks not sharing the same cpuset


typo.

> But this explanation is irrelevant and meaningless. CPUSET can change

s/CPUSET/mempolicy/

> restricted node dynamically. So, the tsk->mempolicy at oom time doesn't
> represent the place of task's usage memory. plus, OOM_DISABLE can
> always makes undesirable result. it's not special in this case.
>
> The fact is, both current and your heuristics have a corner case. it's
> obvious. (I haven't seen corner caseless heuristics). then talking your
> patch's merit doesn't help to merge the patch. The most important thing
> is, we keep no regression. personally, I incline your one. but It doesn't
> mean we can ignore its demerit.
>
>
>


2010-02-16 07:03:50

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 1/7 -mm] oom: filter tasks not sharing the same cpuset

On Tue, Feb 16, 2010 at 01:52:02PM +0900, KOSAKI Motohiro wrote:
> But this explanation is irrelevant and meaningless. CPUSET can change
> restricted node dynamically. So, the tsk->mempolicy at oom time doesn't
> represent the place of task's usage memory. plus, OOM_DISABLE can
> always makes undesirable result. it's not special in this case.
>
> The fact is, both current and your heuristics have a corner case. it's
> obvious. (I haven't seen corner caseless heuristics). then talking your
> patch's merit doesn't help to merge the patch. The most important thing
> is, we keep no regression. personally, I incline your one. but It doesn't
> mean we can ignore its demerit.

Yes we do need to explain the downside of the patch. It is a
heuristic and we can't call either approach perfect.

The fact is that even if 2 tasks are on completely disjoint
memory policies and never _allocate_ from one another's nodes,
you can still have one task pinning memory of the other task's
node.

Most shared and userspace-pinnable resources (pagecache, vfs
caches and fds files sockes etc) are allocated by first-touch
basically.

I don't see much usage of cpusets and oom killer first hand in
my experience, so I am happy to defer to others when it comes
to heuristics. Just so long as we are all aware of the full
story :)

2010-02-16 08:47:05

by David Rientjes

[permalink] [raw]
Subject: Re: [patch 1/7 -mm] oom: filter tasks not sharing the same cpuset

On Tue, 16 Feb 2010, KOSAKI Motohiro wrote:

> > We now determine whether an allocation is constrained by a cpuset by
> > iterating through the zonelist and checking
> > cpuset_zone_allowed_softwall(). This checks for the necessary cpuset
> > restrictions that we need to validate (the GFP_ATOMIC exception is
> > irrelevant, we don't call into the oom killer for those). We don't need
> > to kill outside of its cpuset because we're not guaranteed to find any
> > memory on those nodes, in fact it allows for needless oom killing if a
> > task sets all of its threads to have OOM_DISABLE in its own cpuset and
> > then runs out of memory. The oom killer would have killed every other
> > user task on the system even though the offending application can't
> > allocate there. That's certainly an undesired result and needs to be
> > fixed in this manner.
>
> But this explanation is irrelevant and meaningless. CPUSET can change
> restricted node dynamically. So, the tsk->mempolicy at oom time doesn't
> represent the place of task's usage memory. plus, OOM_DISABLE can
> always makes undesirable result. it's not special in this case.
>

It depends whether memory_migrate is set or not when changing a cpuset's
set of mems. The point is that we cannot penalize tasks in cpusets with a
disjoint set of mems because another cpuset is out of memory. Unless a
candidate task will definitely free memory on a node that the zonelist
allows, we should not consider it because it may needlessly kill that
task, it would be better to kill current. Otherwise, our badness()
heuristic cannot possibly determine the optimal task to kill, anyway.

2010-02-16 08:49:22

by David Rientjes

[permalink] [raw]
Subject: Re: [patch 1/7 -mm] oom: filter tasks not sharing the same cpuset

On Tue, 16 Feb 2010, Nick Piggin wrote:

> Yes we do need to explain the downside of the patch. It is a
> heuristic and we can't call either approach perfect.
>
> The fact is that even if 2 tasks are on completely disjoint
> memory policies and never _allocate_ from one another's nodes,
> you can still have one task pinning memory of the other task's
> node.
>
> Most shared and userspace-pinnable resources (pagecache, vfs
> caches and fds files sockes etc) are allocated by first-touch
> basically.
>
> I don't see much usage of cpusets and oom killer first hand in
> my experience, so I am happy to defer to others when it comes
> to heuristics. Just so long as we are all aware of the full
> story :)
>

Unless you can present a heuristic that will determine how much memory
usage a given task has allocated on nodes in current's zonelist, we must
exclude tasks from cpusets with a disjoint set of nodes, otherwise we
cannot determine the optimal task to kill. There's a strong possibility
that killing a task on a disjoint set of mems will never free memory for
current, making it a needless kill. That's a much more serious
consequence than not having the patch, in my opinion, than rather simply
killing current.

2010-02-16 09:04:28

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 1/7 -mm] oom: filter tasks not sharing the same cpuset

On Tue, Feb 16, 2010 at 12:49:14AM -0800, David Rientjes wrote:
> On Tue, 16 Feb 2010, Nick Piggin wrote:
>
> > Yes we do need to explain the downside of the patch. It is a
> > heuristic and we can't call either approach perfect.
> >
> > The fact is that even if 2 tasks are on completely disjoint
> > memory policies and never _allocate_ from one another's nodes,
> > you can still have one task pinning memory of the other task's
> > node.
> >
> > Most shared and userspace-pinnable resources (pagecache, vfs
> > caches and fds files sockes etc) are allocated by first-touch
> > basically.
> >
> > I don't see much usage of cpusets and oom killer first hand in
> > my experience, so I am happy to defer to others when it comes
> > to heuristics. Just so long as we are all aware of the full
> > story :)
> >
>
> Unless you can present a heuristic that will determine how much memory
> usage a given task has allocated on nodes in current's zonelist, we must
> exclude tasks from cpusets with a disjoint set of nodes, otherwise we
> cannot determine the optimal task to kill. There's a strong possibility
> that killing a task on a disjoint set of mems will never free memory for
> current, making it a needless kill. That's a much more serious
> consequence than not having the patch, in my opinion, than rather simply
> killing current.

I don't really agree with your black and white view. We equally
can't tell a lot of cases about who is pinning memory where. The
fact is that any task can be pinning memory and the heuristic
was specifically catering for that.

It's not an issue of yes/no, but of more/less probability. Anyway
I wasn't really arguing against your patch.

2010-02-16 09:04:53

by David Rientjes

[permalink] [raw]
Subject: Re: [patch 5/7 -mm] oom: replace sysctls with quick mode

On Tue, 16 Feb 2010, KOSAKI Motohiro wrote:

> > > "_quick" is always bad sysctl name.
> >
> > Why? It does exactly what it says: it kills current without doing an
> > expensive tasklist scan and suppresses the possibly long tasklist dump.
> > That's the oom killer's "quick mode."
>
> Because, an administrator think "_quick" implies "please use it always".
> plus, "quick" doesn't describe clealy meanings. oom_dump_tasks does.
>

The audience for both of these tunables (now that oom_dump_tasks is
default to enabled) is users with extremely long tasklists that want to
avoid those scans, so oom_kill_quick implies that it won't waste any time
and will act how it's documented: simply kill current and move on.

2010-02-16 09:10:54

by David Rientjes

[permalink] [raw]
Subject: Re: [patch 1/7 -mm] oom: filter tasks not sharing the same cpuset

On Tue, 16 Feb 2010, Nick Piggin wrote:

> I don't really agree with your black and white view. We equally
> can't tell a lot of cases about who is pinning memory where. The
> fact is that any task can be pinning memory and the heuristic
> was specifically catering for that.
>

That's a main source of criticism of the current heuristic: it needlessly
kills tasks. There is only one thing we know for certain: current is
trying to allocate memory on its nodes. We can either kill a task that
is allowed that same set or current itself; there's no evidence that
killing anything else will lead to memory freeing that will allow the
allocation to succeed. The heuristic will never perfectly select the task
that it should kill 100% of the time when playing around with mempolicy
nodes or cpuset mems, but relying on their current placement is a good
indicator of what is more likely than not to free memory of interest.

2010-02-16 13:15:08

by Minchan Kim

[permalink] [raw]
Subject: Re: [patch 4/7 -mm] oom: badness heuristic rewrite

On Mon, 2010-02-15 at 13:54 -0800, David Rientjes wrote:
> We're not enforcing a global, system-wide forkbomb policy in the oom
> killer, but we do need to identify tasks that fork a very large number of
> tasks to break ties with other tasks: in other words, it would not be
> helpful to kill an application that has been running for weeks because
> another application with the same or less memory usage has forked 1000
> children and has caused an oom condition. That unfairly penalizes the
> former application that is actually doing work.
>
> Again, I'd encourage you to look at this as only a slight penalization
> rather than a policy that strictly needs to be enforced. If it were
> strictly enforced, it would be a prerequisite for selection if such a task
> were to exist; in my implementation, it is part of the heuristic.

Okay. I can think it of slight penalization in this patch.
But in current OOM logic, we try to kill child instead of forkbomb
itself. My concern was that.
Of course, It's not a part of your patch[2/7] which is good.
It has been in there during long time. I hope we could solve that in
this chance. Pz, look at below my example.

>
> > > That doesn't work with Rik's example of a webserver that forks a large
> > > number of threads to handle client connections. It is _always_ better to
> > > kill a child instead of making the entire webserver unresponsive.
> >
> > In such case, admin have to handle it by oom_forkbom_thres.
> > Isn't it your goal?
> >
>
> oom_forkbomb_thres has a default value, which is 1000, so it should be
> enabled by default.
>
> > My suggestion is how handle buggy forkbomb processes which make
> > system almost hang by user's mistake. :)
> >
>
> I don't think you've given a clear description (or, even better, a patch)
> of your suggestion.

I write down my suggestion, again.
My concern is following as.


1. Forkbomb A task makes 2000 children in a second.
2. 2000 children has almost same memory usage. I know another factors
affect oom_score. but in here, I assume all of children have almost same
badness score.
3. Your heuristic penalizes A task so it would be detected as forkbomb.
4. So OOM killer select A task as bad task.
5. oom_kill_process kills high badness one of children, _NOT_ task A
itself. Unfortunately high badness child doesn't has big memory usage
compared to sibling. It means sooner or later we would need OOM again.


My point was 5.

1. oom_kill_process have to take a long time to scan tasklist for
selecting just one high badness task. Okay. It's right since OOM system
hang is much bad and it would be better to kill just first task(ie,
random one) in tasklist.

2. But in above scenario, sibling have almost same memory. So we would
need OOM again sooner or later and OOM logic could do above scenario
repeatably.

Yes. Our system is already unresponsible since time slice is spread out
many child tasks. Then in here, it would be better to kill dumb child
instead of BUGGY forkbomb task A? How long time do we have to wait
system responsible?

I said _BUGGY_ forkbomb task. That's because Rik's example isn't buggy
task. Administrator already knows apache can make many task in a second.
So he can handle it by your oom_forkbomb_thres knob. It's goal of your
knob.

So my suggestion is following as.

I assume normal forkbomb tasks are handled well by admin who use your
oom_forkbom_thres. The remained problem is just BUGGY forkbomb process.
So if your logic selects same victim task as forkbomb by your heuristic
and it's 5th time continuously in 10 second, let's kill forkbomb instead
of child.

tsk = select_victim_task(&cause);
if (tsk == last_victim_tsk && cause == BUGGY_FORKBOMB)
if (++count == 5 && time_since_first_detect_forkbomb <= 10*HZ)
kill(tsk);
else {
last_victim_tsk = NULL; count = 0; time_since... = 0;
kill(tsk's child);
}

It's just example of my concern. It might never good solution.
What I mean is just whether we have to care this.



--
Kind regards,
Minchan Kim

2010-02-16 21:41:45

by David Rientjes

[permalink] [raw]
Subject: Re: [patch 4/7 -mm] oom: badness heuristic rewrite

On Tue, 16 Feb 2010, Minchan Kim wrote:

> > Again, I'd encourage you to look at this as only a slight penalization
> > rather than a policy that strictly needs to be enforced. If it were
> > strictly enforced, it would be a prerequisite for selection if such a task
> > were to exist; in my implementation, it is part of the heuristic.
>
> Okay. I can think it of slight penalization in this patch.
> But in current OOM logic, we try to kill child instead of forkbomb
> itself. My concern was that.

We still do with my rewrite, that is handled in oom_kill_process(). The
forkbomb penalization takes place in badness().

> 1. Forkbomb A task makes 2000 children in a second.
> 2. 2000 children has almost same memory usage. I know another factors
> affect oom_score. but in here, I assume all of children have almost same
> badness score.
> 3. Your heuristic penalizes A task so it would be detected as forkbomb.
> 4. So OOM killer select A task as bad task.
> 5. oom_kill_process kills high badness one of children, _NOT_ task A
> itself. Unfortunately high badness child doesn't has big memory usage
> compared to sibling. It means sooner or later we would need OOM again.
>

Couple points: killing a task with a comparatively small rss and swap
usage to the parent does not imply that we need the call the oom killer
again later, killing the child will allow for future memory freeing that
may be all that is necessary. If the parent continues to fork, that will
continue to be an issue, but the constant killing of its children should
allow the user to intervene without bring the system to a grinding halt.
I'd strongly prefer to kill a child from a forkbombing task, however, than
an innocent application that has been running for days or weeks only to
find that the forkbombing parent will consume its memory as well and then
need have its children killed. Secondly, the forkbomb detection does not
simply require 2000 children to be forked in a second, it requires
oom_forkbomb_thres children that have called execve(), i.e. they have
seperate address spaces, to have a runtime of less than one second.

> My point was 5.
>
> 1. oom_kill_process have to take a long time to scan tasklist for
> selecting just one high badness task. Okay. It's right since OOM system
> hang is much bad and it would be better to kill just first task(ie,
> random one) in tasklist.
>
> 2. But in above scenario, sibling have almost same memory. So we would
> need OOM again sooner or later and OOM logic could do above scenario
> repeatably.
>

In Rik's web server example, this is the preferred outcome: kill a thread
handling a single client connection rather than kill a "legitimate"
forkbombing server to make the entire service unresponsive.

> I said _BUGGY_ forkbomb task. That's because Rik's example isn't buggy
> task. Administrator already knows apache can make many task in a second.
> So he can handle it by your oom_forkbomb_thres knob. It's goal of your
> knob.
>

We can't force all web servers to tune oom_forkbomb_thres.

> So my suggestion is following as.
>
> I assume normal forkbomb tasks are handled well by admin who use your
> oom_forkbom_thres. The remained problem is just BUGGY forkbomb process.
> So if your logic selects same victim task as forkbomb by your heuristic
> and it's 5th time continuously in 10 second, let's kill forkbomb instead
> of child.
>
> tsk = select_victim_task(&cause);
> if (tsk == last_victim_tsk && cause == BUGGY_FORKBOMB)
> if (++count == 5 && time_since_first_detect_forkbomb <= 10*HZ)
> kill(tsk);
> else {
> last_victim_tsk = NULL; count = 0; time_since... = 0;
> kill(tsk's child);
> }
>
> It's just example of my concern. It might never good solution.
> What I mean is just whether we have to care this.
>

This unfairly penalizes tasks that have a large number of execve()
children, we can't possibly know how to define BUGGY_FORKBOMB. In other
words, a system-wide forkbombing policy in the oom killer will always have
a chance of killing a legitimate task, such as a web server, that will be
an undesired result. Setting the parent to OOM_DISABLE isn't really an
option in this case since that value is inherited by children and would
need to explicitly be cleared by each thread prior to execve(); this is
one of the reasons why I proposed /proc/pid/oom_adj_child a few months
ago, but it wasn't well received.

2010-02-16 21:53:01

by David Rientjes

[permalink] [raw]
Subject: Re: [patch 3/7 -mm] oom: select task from tasklist for mempolicy ooms

On Tue, 16 Feb 2010, KOSAKI Motohiro wrote:

> > We need to get a refcount on the mempolicy to ensure it doesn't get freed
> > from under us, tsk is not necessarily current.
>
> Hm.
> if you explanation is correct, I think your patch have following race.
>
>
> CPU0 CPU1
> ----------------------------------------------
> mempolicy_nodemask_intersects()
> mempolicy = tsk->mempolicy;
> do_exit()
> mpol_put(tsk_mempolicy)
> mpol_get(mempolicy);
>

True, good point. It looks like we'll need to include mempolicy
detachment in exit_mm() while under task_lock() and then synchronize with
that. It's a legitimate place to do it since no memory allocation will be
done after its mm is detached, anyway.

> > For MPOL_F_LOCAL, we need to check whether the task's cpu is on a node
> > that is allowed by the zonelist passed to the page allocator. In the
> > second revision of this patchset, this was changed to
> >
> > node_isset(cpu_to_node(task_cpu(tsk)), *mask)
> >
> > to check. It would be possible for no memory to have been allocated on
> > that node and it just happens that the tsk is running on it momentarily,
> > but it's the best indication we have given the mempolicy of whether
> > killing a task may lead to future memory freeing.
>
> This calculation is still broken. In general, running cpu and allocation node
> is not bound.

Not sure what you mean, MPOL_F_LOCAL means that allocations will happen on
the node of the cpu on which it is running. The cpu-to-node mapping
doesn't change, only the cpu on which it is running may change. That may
be restricted by sched_setaffinity() or cpusets, however, so this task may
never allocate on any other node (i.e. it may run on another cpu, but
always one local to a specific node). That's enough of an indication that
it should be a candidate for kill: we're trying to eliminate tasks that
may never allocate on current's nodemask from consideration. In other
words, it would be unfair for two tasks that are isolated to their own
cpus on different physical nodes using MPOL_F_LOCAL for NUMA optimizations
to have the other needlessly killed when current can't allocate there
anyway.

2010-02-17 00:48:19

by David Rientjes

[permalink] [raw]
Subject: Re: [patch 3/7 -mm] oom: select task from tasklist for mempolicy ooms

On Tue, 16 Feb 2010, David Rientjes wrote:

> True, good point. It looks like we'll need to include mempolicy
> detachment in exit_mm() while under task_lock() and then synchronize with
> that. It's a legitimate place to do it since no memory allocation will be
> done after its mm is detached, anyway.
>

Here's the updated version of the patch, what do you think?


oom: select task from tasklist for mempolicy ooms

The oom killer presently kills current whenever there is no more memory
free or reclaimable on its mempolicy's nodes. There is no guarantee that
current is a memory-hogging task or that killing it will free any
substantial amount of memory, however.

In such situations, it is better to scan the tasklist for nodes that are
allowed to allocate on current's set of nodes and kill the task with the
highest badness() score. This ensures that the most memory-hogging task,
or the one configured by the user with /proc/pid/oom_adj, is always
selected in such scenarios.

It is necessary to synchronize the detachment of task->mempolicy with
task_lock(task) to ensure it is not prematurely destroyed while a user is
operating on it.

Signed-off-by: David Rientjes <[email protected]>
---
include/linux/mempolicy.h | 13 +++++++-
kernel/exit.c | 8 ++--
mm/mempolicy.c | 44 +++++++++++++++++++++++++
mm/oom_kill.c | 77 +++++++++++++++++++++++++++-----------------
4 files changed, 107 insertions(+), 35 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -202,6 +202,8 @@ extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
unsigned long addr, gfp_t gfp_flags,
struct mempolicy **mpol, nodemask_t **nodemask);
extern bool init_nodemask_of_mempolicy(nodemask_t *mask);
+extern bool mempolicy_nodemask_intersects(struct task_struct *tsk,
+ const nodemask_t *mask);
extern unsigned slab_node(struct mempolicy *policy);

extern enum zone_type policy_zone;
@@ -329,7 +331,16 @@ static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma,
return node_zonelist(0, gfp_flags);
}

-static inline bool init_nodemask_of_mempolicy(nodemask_t *m) { return false; }
+static inline bool init_nodemask_of_mempolicy(nodemask_t *m)
+{
+ return false;
+}
+
+static inline bool mempolicy_nodemask_intersects(struct task_struct *tsk,
+ const nodemask_t *mask)
+{
+ return false;
+}

static inline int do_migrate_pages(struct mm_struct *mm,
const nodemask_t *from_nodes,
diff --git a/kernel/exit.c b/kernel/exit.c
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -689,6 +689,10 @@ static void exit_mm(struct task_struct * tsk)
enter_lazy_tlb(mm, current);
/* We don't want this task to be frozen prematurely */
clear_freeze_flag(tsk);
+#ifdef CONFIG_NUMA
+ mpol_put(tsk->mempolicy);
+ tsk->mempolicy = NULL;
+#endif
task_unlock(tsk);
mm_update_next_owner(mm);
mmput(mm);
@@ -993,10 +997,6 @@ NORET_TYPE void do_exit(long code)
perf_event_exit_task(tsk);

exit_notify(tsk, group_dead);
-#ifdef CONFIG_NUMA
- mpol_put(tsk->mempolicy);
- tsk->mempolicy = NULL;
-#endif
#ifdef CONFIG_FUTEX
if (unlikely(current->pi_state_cache))
kfree(current->pi_state_cache);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1638,6 +1638,50 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
}
#endif

+/*
+ * mempolicy_nodemask_intersects
+ *
+ * If tsk's mempolicy is "default" [NULL], return 'true' to indicate default
+ * policy. Otherwise, check for intersection between mask and the policy
+ * nodemask for 'bind' or 'interleave' policy. For 'perferred' or 'local'
+ * policy, always return true since it may allocate elsewhere on fallback.
+ *
+ * Takes task_lock(tsk) to prevent freeing of its mempolicy.
+ */
+bool mempolicy_nodemask_intersects(struct task_struct *tsk,
+ const nodemask_t *mask)
+{
+ struct mempolicy *mempolicy;
+ bool ret = true;
+
+ if (!mask)
+ return ret;
+ task_lock(tsk);
+ mempolicy = tsk->mempolicy;
+ if (!mempolicy)
+ goto out;
+
+ switch (mempolicy->mode) {
+ case MPOL_PREFERRED:
+ /*
+ * MPOL_PREFERRED and MPOL_F_LOCAL are only preferred nodes to
+ * allocate from, they may fallback to other nodes when oom.
+ * Thus, it's possible for tsk to have allocated memory from
+ * nodes in mask.
+ */
+ break;
+ case MPOL_BIND:
+ case MPOL_INTERLEAVE:
+ ret = nodes_intersects(mempolicy->v.nodes, *mask);
+ break;
+ default:
+ BUG();
+ }
+out:
+ task_unlock(tsk);
+ return ret;
+}
+
/* Allocate a page in interleaved policy.
Own path because it needs to do special accounting. */
static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -26,6 +26,7 @@
#include <linux/module.h>
#include <linux/notifier.h>
#include <linux/memcontrol.h>
+#include <linux/mempolicy.h>
#include <linux/security.h>

int sysctl_panic_on_oom;
@@ -36,19 +37,35 @@ static DEFINE_SPINLOCK(zone_scan_lock);

/*
* Do all threads of the target process overlap our allowed nodes?
+ * @tsk: task struct of which task to consider
+ * @mask: nodemask passed to page allocator for mempolicy ooms
*/
-static int has_intersects_mems_allowed(struct task_struct *tsk)
+static bool has_intersects_mems_allowed(struct task_struct *tsk,
+ const nodemask_t *mask)
{
- struct task_struct *t;
+ struct task_struct *start = tsk;

- t = tsk;
do {
- if (cpuset_mems_allowed_intersects(current, t))
- return 1;
- t = next_thread(t);
- } while (t != tsk);
-
- return 0;
+ if (mask) {
+ /*
+ * If this is a mempolicy constrained oom, tsk's
+ * cpuset is irrelevant. Only return true if its
+ * mempolicy intersects current, otherwise it may be
+ * needlessly killed.
+ */
+ if (mempolicy_nodemask_intersects(tsk, mask))
+ return true;
+ } else {
+ /*
+ * This is not a mempolicy constrained oom, so only
+ * check the mems of tsk's cpuset.
+ */
+ if (cpuset_mems_allowed_intersects(current, tsk))
+ return true;
+ }
+ tsk = next_thread(tsk);
+ } while (tsk != start);
+ return false;
}

/**
@@ -236,7 +253,8 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
* (not docbooked, we don't want this one cluttering up the manual)
*/
static struct task_struct *select_bad_process(unsigned long *ppoints,
- struct mem_cgroup *mem)
+ struct mem_cgroup *mem, enum oom_constraint constraint,
+ const nodemask_t *mask)
{
struct task_struct *p;
struct task_struct *chosen = NULL;
@@ -258,7 +276,9 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
continue;
if (mem && !task_in_mem_cgroup(p, mem))
continue;
- if (!has_intersects_mems_allowed(p))
+ if (!has_intersects_mems_allowed(p,
+ constraint == CONSTRAINT_MEMORY_POLICY ? mask :
+ NULL))
continue;

/*
@@ -478,7 +498,7 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)

read_lock(&tasklist_lock);
retry:
- p = select_bad_process(&points, mem);
+ p = select_bad_process(&points, mem, CONSTRAINT_NONE, NULL);
if (PTR_ERR(p) == -1UL)
goto out;

@@ -560,7 +580,8 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask)
/*
* Must be called with tasklist_lock held for read.
*/
-static void __out_of_memory(gfp_t gfp_mask, int order)
+static void __out_of_memory(gfp_t gfp_mask, int order,
+ enum oom_constraint constraint, const nodemask_t *mask)
{
struct task_struct *p;
unsigned long points;
@@ -574,7 +595,7 @@ retry:
* Rambo mode: Shoot down a process and hope it solves whatever
* issues we may have.
*/
- p = select_bad_process(&points, NULL);
+ p = select_bad_process(&points, NULL, constraint, mask);

if (PTR_ERR(p) == -1UL)
return;
@@ -615,7 +636,8 @@ void pagefault_out_of_memory(void)
panic("out of memory from page fault. panic_on_oom is selected.\n");

read_lock(&tasklist_lock);
- __out_of_memory(0, 0); /* unknown gfp_mask and order */
+ /* unknown gfp_mask and order */
+ __out_of_memory(0, 0, CONSTRAINT_NONE, NULL);
read_unlock(&tasklist_lock);

/*
@@ -632,6 +654,7 @@ rest_and_return:
* @zonelist: zonelist pointer
* @gfp_mask: memory allocation flags
* @order: amount of memory being requested as a power of 2
+ * @nodemask: nodemask passed to page allocator
*
* If we run out of memory, we have the choice between either
* killing a random task (bad), letting the system crash (worse)
@@ -660,24 +683,18 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
*/
constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
read_lock(&tasklist_lock);
-
- switch (constraint) {
- case CONSTRAINT_MEMORY_POLICY:
- oom_kill_process(current, gfp_mask, order, 0, NULL,
- "No available memory (MPOL_BIND)");
- break;
-
- case CONSTRAINT_NONE:
- if (sysctl_panic_on_oom) {
+ if (unlikely(sysctl_panic_on_oom)) {
+ /*
+ * panic_on_oom only affects CONSTRAINT_NONE, the kernel
+ * should not panic for cpuset or mempolicy induced memory
+ * failures.
+ */
+ if (constraint == CONSTRAINT_NONE) {
dump_header(NULL, gfp_mask, order, NULL);
- panic("out of memory. panic_on_oom is selected\n");
+ panic("Out of memory: panic_on_oom is enabled\n");
}
- /* Fall-through */
- case CONSTRAINT_CPUSET:
- __out_of_memory(gfp_mask, order);
- break;
}
-
+ __out_of_memory(gfp_mask, order, constraint, nodemask);
read_unlock(&tasklist_lock);

/*

2010-02-17 01:13:20

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [patch 3/7 -mm] oom: select task from tasklist for mempolicy ooms

> On Tue, 16 Feb 2010, David Rientjes wrote:
>
> > True, good point. It looks like we'll need to include mempolicy
> > detachment in exit_mm() while under task_lock() and then synchronize with
> > that. It's a legitimate place to do it since no memory allocation will be
> > done after its mm is detached, anyway.
> >
>
> Here's the updated version of the patch, what do you think?

Looks good. probably we need to discuss about mlocked page issue (as Nick
pointed out) and determine how about care it. but I think this patch
can be treated independently. this is just forward step.

Reviewed-by: KOSAKI Motohiro <[email protected]>


>
>
> oom: select task from tasklist for mempolicy ooms
>
> The oom killer presently kills current whenever there is no more memory
> free or reclaimable on its mempolicy's nodes. There is no guarantee that
> current is a memory-hogging task or that killing it will free any
> substantial amount of memory, however.
>
> In such situations, it is better to scan the tasklist for nodes that are
> allowed to allocate on current's set of nodes and kill the task with the
> highest badness() score. This ensures that the most memory-hogging task,
> or the one configured by the user with /proc/pid/oom_adj, is always
> selected in such scenarios.
>
> It is necessary to synchronize the detachment of task->mempolicy with
> task_lock(task) to ensure it is not prematurely destroyed while a user is
> operating on it.
>
> Signed-off-by: David Rientjes <[email protected]>
> ---
> include/linux/mempolicy.h | 13 +++++++-
> kernel/exit.c | 8 ++--
> mm/mempolicy.c | 44 +++++++++++++++++++++++++
> mm/oom_kill.c | 77 +++++++++++++++++++++++++++-----------------
> 4 files changed, 107 insertions(+), 35 deletions(-)
>
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -202,6 +202,8 @@ extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
> unsigned long addr, gfp_t gfp_flags,
> struct mempolicy **mpol, nodemask_t **nodemask);
> extern bool init_nodemask_of_mempolicy(nodemask_t *mask);
> +extern bool mempolicy_nodemask_intersects(struct task_struct *tsk,
> + const nodemask_t *mask);
> extern unsigned slab_node(struct mempolicy *policy);
>
> extern enum zone_type policy_zone;
> @@ -329,7 +331,16 @@ static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma,
> return node_zonelist(0, gfp_flags);
> }
>
> -static inline bool init_nodemask_of_mempolicy(nodemask_t *m) { return false; }
> +static inline bool init_nodemask_of_mempolicy(nodemask_t *m)
> +{
> + return false;
> +}
> +
> +static inline bool mempolicy_nodemask_intersects(struct task_struct *tsk,
> + const nodemask_t *mask)
> +{
> + return false;
> +}
>
> static inline int do_migrate_pages(struct mm_struct *mm,
> const nodemask_t *from_nodes,
> diff --git a/kernel/exit.c b/kernel/exit.c
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -689,6 +689,10 @@ static void exit_mm(struct task_struct * tsk)
> enter_lazy_tlb(mm, current);
> /* We don't want this task to be frozen prematurely */
> clear_freeze_flag(tsk);
> +#ifdef CONFIG_NUMA
> + mpol_put(tsk->mempolicy);
> + tsk->mempolicy = NULL;
> +#endif
> task_unlock(tsk);
> mm_update_next_owner(mm);
> mmput(mm);
> @@ -993,10 +997,6 @@ NORET_TYPE void do_exit(long code)
> perf_event_exit_task(tsk);
>
> exit_notify(tsk, group_dead);
> -#ifdef CONFIG_NUMA
> - mpol_put(tsk->mempolicy);
> - tsk->mempolicy = NULL;
> -#endif
> #ifdef CONFIG_FUTEX
> if (unlikely(current->pi_state_cache))
> kfree(current->pi_state_cache);
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1638,6 +1638,50 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
> }
> #endif
>
> +/*
> + * mempolicy_nodemask_intersects
> + *
> + * If tsk's mempolicy is "default" [NULL], return 'true' to indicate default
> + * policy. Otherwise, check for intersection between mask and the policy
> + * nodemask for 'bind' or 'interleave' policy. For 'perferred' or 'local'
> + * policy, always return true since it may allocate elsewhere on fallback.
> + *
> + * Takes task_lock(tsk) to prevent freeing of its mempolicy.
> + */
> +bool mempolicy_nodemask_intersects(struct task_struct *tsk,
> + const nodemask_t *mask)
> +{
> + struct mempolicy *mempolicy;
> + bool ret = true;
> +
> + if (!mask)
> + return ret;
> + task_lock(tsk);
> + mempolicy = tsk->mempolicy;
> + if (!mempolicy)
> + goto out;
> +
> + switch (mempolicy->mode) {
> + case MPOL_PREFERRED:
> + /*
> + * MPOL_PREFERRED and MPOL_F_LOCAL are only preferred nodes to
> + * allocate from, they may fallback to other nodes when oom.
> + * Thus, it's possible for tsk to have allocated memory from
> + * nodes in mask.
> + */
> + break;
> + case MPOL_BIND:
> + case MPOL_INTERLEAVE:
> + ret = nodes_intersects(mempolicy->v.nodes, *mask);
> + break;
> + default:
> + BUG();
> + }
> +out:
> + task_unlock(tsk);
> + return ret;
> +}
> +
> /* Allocate a page in interleaved policy.
> Own path because it needs to do special accounting. */
> static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -26,6 +26,7 @@
> #include <linux/module.h>
> #include <linux/notifier.h>
> #include <linux/memcontrol.h>
> +#include <linux/mempolicy.h>
> #include <linux/security.h>
>
> int sysctl_panic_on_oom;
> @@ -36,19 +37,35 @@ static DEFINE_SPINLOCK(zone_scan_lock);
>
> /*
> * Do all threads of the target process overlap our allowed nodes?
> + * @tsk: task struct of which task to consider
> + * @mask: nodemask passed to page allocator for mempolicy ooms
> */
> -static int has_intersects_mems_allowed(struct task_struct *tsk)
> +static bool has_intersects_mems_allowed(struct task_struct *tsk,
> + const nodemask_t *mask)
> {
> - struct task_struct *t;
> + struct task_struct *start = tsk;
>
> - t = tsk;
> do {
> - if (cpuset_mems_allowed_intersects(current, t))
> - return 1;
> - t = next_thread(t);
> - } while (t != tsk);
> -
> - return 0;
> + if (mask) {
> + /*
> + * If this is a mempolicy constrained oom, tsk's
> + * cpuset is irrelevant. Only return true if its
> + * mempolicy intersects current, otherwise it may be
> + * needlessly killed.
> + */
> + if (mempolicy_nodemask_intersects(tsk, mask))
> + return true;
> + } else {
> + /*
> + * This is not a mempolicy constrained oom, so only
> + * check the mems of tsk's cpuset.
> + */
> + if (cpuset_mems_allowed_intersects(current, tsk))
> + return true;
> + }
> + tsk = next_thread(tsk);
> + } while (tsk != start);
> + return false;
> }
>
> /**
> @@ -236,7 +253,8 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
> * (not docbooked, we don't want this one cluttering up the manual)
> */
> static struct task_struct *select_bad_process(unsigned long *ppoints,
> - struct mem_cgroup *mem)
> + struct mem_cgroup *mem, enum oom_constraint constraint,
> + const nodemask_t *mask)
> {
> struct task_struct *p;
> struct task_struct *chosen = NULL;
> @@ -258,7 +276,9 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
> continue;
> if (mem && !task_in_mem_cgroup(p, mem))
> continue;
> - if (!has_intersects_mems_allowed(p))
> + if (!has_intersects_mems_allowed(p,
> + constraint == CONSTRAINT_MEMORY_POLICY ? mask :
> + NULL))
> continue;
>
> /*
> @@ -478,7 +498,7 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
>
> read_lock(&tasklist_lock);
> retry:
> - p = select_bad_process(&points, mem);
> + p = select_bad_process(&points, mem, CONSTRAINT_NONE, NULL);
> if (PTR_ERR(p) == -1UL)
> goto out;
>
> @@ -560,7 +580,8 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask)
> /*
> * Must be called with tasklist_lock held for read.
> */
> -static void __out_of_memory(gfp_t gfp_mask, int order)
> +static void __out_of_memory(gfp_t gfp_mask, int order,
> + enum oom_constraint constraint, const nodemask_t *mask)
> {
> struct task_struct *p;
> unsigned long points;
> @@ -574,7 +595,7 @@ retry:
> * Rambo mode: Shoot down a process and hope it solves whatever
> * issues we may have.
> */
> - p = select_bad_process(&points, NULL);
> + p = select_bad_process(&points, NULL, constraint, mask);
>
> if (PTR_ERR(p) == -1UL)
> return;
> @@ -615,7 +636,8 @@ void pagefault_out_of_memory(void)
> panic("out of memory from page fault. panic_on_oom is selected.\n");
>
> read_lock(&tasklist_lock);
> - __out_of_memory(0, 0); /* unknown gfp_mask and order */
> + /* unknown gfp_mask and order */
> + __out_of_memory(0, 0, CONSTRAINT_NONE, NULL);
> read_unlock(&tasklist_lock);
>
> /*
> @@ -632,6 +654,7 @@ rest_and_return:
> * @zonelist: zonelist pointer
> * @gfp_mask: memory allocation flags
> * @order: amount of memory being requested as a power of 2
> + * @nodemask: nodemask passed to page allocator
> *
> * If we run out of memory, we have the choice between either
> * killing a random task (bad), letting the system crash (worse)
> @@ -660,24 +683,18 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> */
> constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
> read_lock(&tasklist_lock);
> -
> - switch (constraint) {
> - case CONSTRAINT_MEMORY_POLICY:
> - oom_kill_process(current, gfp_mask, order, 0, NULL,
> - "No available memory (MPOL_BIND)");
> - break;
> -
> - case CONSTRAINT_NONE:
> - if (sysctl_panic_on_oom) {
> + if (unlikely(sysctl_panic_on_oom)) {
> + /*
> + * panic_on_oom only affects CONSTRAINT_NONE, the kernel
> + * should not panic for cpuset or mempolicy induced memory
> + * failures.
> + */
> + if (constraint == CONSTRAINT_NONE) {
> dump_header(NULL, gfp_mask, order, NULL);
> - panic("out of memory. panic_on_oom is selected\n");
> + panic("Out of memory: panic_on_oom is enabled\n");
> }
> - /* Fall-through */
> - case CONSTRAINT_CPUSET:
> - __out_of_memory(gfp_mask, order);
> - break;
> }
> -
> + __out_of_memory(gfp_mask, order, constraint, nodemask);
> read_unlock(&tasklist_lock);
>
> /*


2010-02-17 07:41:16

by Minchan Kim

[permalink] [raw]
Subject: Re: [patch 4/7 -mm] oom: badness heuristic rewrite

On Wed, Feb 17, 2010 at 6:41 AM, David Rientjes <[email protected]> wrote:
> On Tue, 16 Feb 2010, Minchan Kim wrote:
>
>> > Again, I'd encourage you to look at this as only a slight penalization
>> > rather than a policy that strictly needs to be enforced.  If it were
>> > strictly enforced, it would be a prerequisite for selection if such a task
>> > were to exist; in my implementation, it is part of the heuristic.
>>
>> Okay. I can think it of slight penalization in this patch.
>> But in current OOM logic, we try to kill child instead of forkbomb
>> itself. My concern was that.
>
> We still do with my rewrite, that is handled in oom_kill_process().  The
> forkbomb penalization takes place in badness().


I thought this patch is closely related to [patch 2/7].
I can move this discussion to [patch 2/7] if you want.
Another guys already pointed out why we care child.

>
>> 1. Forkbomb A task makes 2000 children in a second.
>> 2. 2000 children has almost same memory usage. I know another factors
>> affect oom_score. but in here, I assume all of children have almost same
>> badness score.
>> 3. Your heuristic penalizes A task so it would be detected as forkbomb.
>> 4. So OOM killer select A task as bad task.
>> 5. oom_kill_process kills high badness one of children, _NOT_ task A
>> itself. Unfortunately high badness child doesn't has big memory usage
>> compared to sibling. It means sooner or later we would need OOM again.
>>
>
> Couple points: killing a task with a comparatively small rss and swap
> usage to the parent does not imply that we need the call the oom killer
> again later, killing the child will allow for future memory freeing that
> may be all that is necessary.  If the parent continues to fork, that will
> continue to be an issue, but the constant killing of its children should
> allow the user to intervene without bring the system to a grinding halt.

I said this scenario is BUGGY forkbomb process. It will fork + exec continuously
if it isn't killed. How does user intervene to fix the system?
System was almost hang due to unresponsive.

For extreme example,
User is writing some important document by OpenOffice and
he decided to execute hackbench 1000000 process 1000000.

Could user save his important office data without halt if we kill
child continuously?
I think this scenario can be happened enough if the user didn't know
parameter of hackbench.

> I'd strongly prefer to kill a child from a forkbombing task, however, than
> an innocent application that has been running for days or weeks only to
> find that the forkbombing parent will consume its memory as well and then
> need have its children killed.  Secondly, the forkbomb detection does not

Okay.
consider my argue related to 2/7, pz.

> simply require 2000 children to be forked in a second, it requires
> oom_forkbomb_thres children that have called execve(), i.e. they have
> seperate address spaces, to have a runtime of less than one second.
>
>> My point was 5.
>>
>> 1. oom_kill_process have to take a long time to scan tasklist for
>> selecting just one high badness task. Okay. It's right since OOM system
>> hang is much bad and it would be better to kill just first task(ie,
>> random one) in tasklist.
>>
>> 2. But in above scenario, sibling have almost same memory. So we would
>> need OOM again sooner or later and OOM logic could do above scenario
>> repeatably.
>>
>
> In Rik's web server example, this is the preferred outcome: kill a thread
> handling a single client connection rather than kill a "legitimate"
> forkbombing server to make the entire service unresponsive.
>
>> I said _BUGGY_ forkbomb task. That's because Rik's example isn't buggy
>> task. Administrator already knows apache can make many task in a second.
>> So he can handle it by your oom_forkbomb_thres knob. It's goal of your
>> knob.
>>
>
> We can't force all web servers to tune oom_forkbomb_thres.
>
>> So my suggestion is following as.
>>
>> I assume normal forkbomb tasks are handled well by admin who use your
>> oom_forkbom_thres. The remained problem is just BUGGY forkbomb process.
>> So if your logic selects same victim task as forkbomb by your heuristic
>> and it's 5th time continuously in 10 second, let's kill forkbomb instead
>> of child.
>>
>> tsk = select_victim_task(&cause);
>> if (tsk == last_victim_tsk && cause == BUGGY_FORKBOMB)
>>       if (++count == 5 && time_since_first_detect_forkbomb <= 10*HZ)
>>               kill(tsk);
>> else {
>>    last_victim_tsk = NULL; count = 0; time_since... = 0;
>>    kill(tsk's child);
>> }
>>
>> It's just example of my concern. It might never good solution.
>> What I mean is just whether we have to care this.
>>
>
> This unfairly penalizes tasks that have a large number of execve()
> children, we can't possibly know how to define BUGGY_FORKBOMB.  In other
> words, a system-wide forkbombing policy in the oom killer will always have
> a chance of killing a legitimate task, such as a web server, that will be
> an undesired result.  Setting the parent to OOM_DISABLE isn't really an
> option in this case since that value is inherited by children and would
> need to explicitly be cleared by each thread prior to execve(); this is
> one of the reasons why I proposed /proc/pid/oom_adj_child a few months
> ago, but it wasn't well received.
>

I don't want to annoy you if others guys don't have any complain.
If it has a problem in future, at that time we could discuss further
in detail with
real example.
I hope we don't received any complain report. :)

Thanks for good discussion, David.

--
Kind regards,
Minchan Kim

2010-02-17 09:23:44

by David Rientjes

[permalink] [raw]
Subject: Re: [patch 4/7 -mm] oom: badness heuristic rewrite

On Wed, 17 Feb 2010, Minchan Kim wrote:

> >> Okay. I can think it of slight penalization in this patch.
> >> But in current OOM logic, we try to kill child instead of forkbomb
> >> itself. My concern was that.
> >
> > We still do with my rewrite, that is handled in oom_kill_process().  The
> > forkbomb penalization takes place in badness().
>
>
> I thought this patch is closely related to [patch 2/7].
> I can move this discussion to [patch 2/7] if you want.
> Another guys already pointed out why we care child.
>

We have _always_ tried to kill a child of the selected task first if it
has a seperate address space, patch 2 doesn't change that. It simply
tries to kill the child with the highest badness() score.

> I said this scenario is BUGGY forkbomb process. It will fork + exec continuously
> if it isn't killed. How does user intervene to fix the system?
> System was almost hang due to unresponsive.
>

The user would need to kill the parent if it should be killed. The
unresponsiveness in this example, however, is not a question of the oom
killer but rather the scheduler to provide interactivity to the user in
forkbomb scenarios. The oom killer should not create a policy that
unfairly biases tasks that fork a large number of tasks, however, to
provide interactivity since that task may be a vital system resource.

> For extreme example,
> User is writing some important document by OpenOffice and
> he decided to execute hackbench 1000000 process 1000000.
>
> Could user save his important office data without halt if we kill
> child continuously?
> I think this scenario can be happened enough if the user didn't know
> parameter of hackbench.
>

So what exactly are you proposing we do in the oom killer to distinguish
between a user's mistake and a vital system resource? I'm personally much
more concerned with protecting system daemons that provide a service under
heavyload than protecting against forkbombs in the oom killer.

2010-02-17 13:08:27

by Minchan Kim

[permalink] [raw]
Subject: Re: [patch 4/7 -mm] oom: badness heuristic rewrite

On Wed, 2010-02-17 at 01:23 -0800, David Rientjes wrote:
> On Wed, 17 Feb 2010, Minchan Kim wrote:
>
> > >> Okay. I can think it of slight penalization in this patch.
> > >> But in current OOM logic, we try to kill child instead of forkbomb
> > >> itself. My concern was that.
> > >
> > > We still do with my rewrite, that is handled in oom_kill_process(). The
> > > forkbomb penalization takes place in badness().
> >
> >
> > I thought this patch is closely related to [patch 2/7].
> > I can move this discussion to [patch 2/7] if you want.
> > Another guys already pointed out why we care child.
> >
>
> We have _always_ tried to kill a child of the selected task first if it
> has a seperate address space, patch 2 doesn't change that. It simply
> tries to kill the child with the highest badness() score.

So I mentioned following as.

"Of course, It's not a part of your patch[2/7] which is good.
It has been in there during long time. I hope we could solve that in
this chance."

>
> > I said this scenario is BUGGY forkbomb process. It will fork + exec continuously
> > if it isn't killed. How does user intervene to fix the system?
> > System was almost hang due to unresponsive.
> >
>
> The user would need to kill the parent if it should be killed. The
> unresponsiveness in this example, however, is not a question of the oom
> killer but rather the scheduler to provide interactivity to the user in
> forkbomb scenarios. The oom killer should not create a policy that
> unfairly biases tasks that fork a large number of tasks, however, to
> provide interactivity since that task may be a vital system resource.

As you said, scheduler(or something) can do it with much graceful than
OOM killer. I agreed that.

You wrote "Forkbomb detector" in your patch description. When I saw
that, I thought we need more things to complete forkbomb detection. So I
just suggested my humble idea to fix it in this chance.

>
> > For extreme example,
> > User is writing some important document by OpenOffice and
> > he decided to execute hackbench 1000000 process 1000000.
> >
> > Could user save his important office data without halt if we kill
> > child continuously?
> > I think this scenario can be happened enough if the user didn't know
> > parameter of hackbench.
> >
>
> So what exactly are you proposing we do in the oom killer to distinguish
> between a user's mistake and a vital system resource? I'm personally much
> more concerned with protecting system daemons that provide a service under
> heavyload than protecting against forkbombs in the oom killer.

I don't opposed that. As I said, I just wanted for OOM killer to be more
smart to catch user's mistake. If I understand your opinion,
You said, it's not role of OOM killer but scheduler.

Okay.


--
Kind regards,
Minchan Kim