2010-11-14 05:07:37

by KOSAKI Motohiro

[permalink] [raw]
Subject: [PATCH] Revert oom rewrite series

Linus,

Please apply this. this patch revert commits of oom changes since v2.6.35.

briefly says, "oom: badness heuristic rewrite" was merges by mistaken.
It haven't been passed our design nor code review. then multiple bug reports
has been popped up. I believe evey patches should pass a usecase and a code
review :-/

The problem is, DavidR patches don't refrect real world usecase at all
and breaking them. He can talk about the userland is wrong. but such
excuse doesn't solve real world issue. it makes no sense.

I hope every developers keep honestly development. googlers are NOT
exception.


David, at least rss based oom score was passed our design review.
So, if you will resubmit such part, we will ack it. please remember it.
Also, I can accept oom_score_adj feature if you can remove imcomatibility
issue. OK?

Linus, if you want to check the patch. please use following way.
% git diff a63d83f427fbce97a6cea0db2e64b0eb8435cd10^ mm/oom_kill.c include/linux/oom.h fs/proc/base.c


Thanks.

--------------------------------------------------------------------------
Subject: [PATCH] Revert oom rewrite series

This reverts following commits. They has broke an ABI and made multiple
enduser claim.

9c28ab662a8e3d19d07077ac0a8931c015e8afec Revert "oom: badness heuristic rewrite"
74cd8c6cb3e093c4d67ac3eb3581e246e4981dad Revert "oom: deprecate oom_adj tunable"
79a0bd5796e754c4b4e22071c4edddef3517d010 Revert "memcg: use find_lock_task_mm() in memory cgroups oom"
a465ef80c2a9fe73c85029fcea5c68ffee8dbb69 Revert "oom: always return a badness score of non-zero for eligible tas
516fcbb0c45d943df1b739d3be3d417aee2275f3 Revert "oom: filter unkillable tasks from tasklist dump"
b1c98f95a7954c450dadd809280f86863ea9d05d Revert "oom: add per-mm oom disable count"
fd79f3f47c82a0af5288afe7556905dd171bfc43 Revert "oom: avoid killing a task if a thread sharing its mm cannot be
2d72175528870dcef577db4a2a0b49d819c6eaff Revert "oom: kill all threads sharing oom killed task's mm"
be212960618ddcdb9526ce2cb73fd081fd3e90ea Revert "oom: rewrite error handling for oom_adj and oom_score_adj tunab
1b17c41599c594c7d11ef415a92d47c205fe89ea Revert "oom: fix locking for oom_adj and oom_score_adj"

Signed-off-by: KOSAKI Motohiro <[email protected]>
---
Documentation/feature-removal-schedule.txt | 25 ---
Documentation/filesystems/proc.txt | 97 ++++-----
fs/exec.c | 5 -
fs/proc/base.c | 176 ++--------------
include/linux/memcontrol.h | 8 -
include/linux/mm_types.h | 2 -
include/linux/oom.h | 19 +--
include/linux/sched.h | 3 +-
kernel/exit.c | 3 -
kernel/fork.c | 16 +--
mm/memcontrol.c | 28 +---
mm/oom_kill.c | 323 ++++++++++++++--------------
12 files changed, 227 insertions(+), 478 deletions(-)

diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt
index d8f36f9..9af16b9 100644
--- a/Documentation/feature-removal-schedule.txt
+++ b/Documentation/feature-removal-schedule.txt
@@ -166,31 +166,6 @@ Who: Eric Biederman <[email protected]>

---------------------------

-What: /proc/<pid>/oom_adj
-When: August 2012
-Why: /proc/<pid>/oom_adj allows userspace to influence the oom killer's
- badness heuristic used to determine which task to kill when the kernel
- is out of memory.
-
- The badness heuristic has since been rewritten since the introduction of
- this tunable such that its meaning is deprecated. The value was
- implemented as a bitshift on a score generated by the badness()
- function that did not have any precise units of measure. With the
- rewrite, the score is given as a proportion of available memory to the
- task allocating pages, so using a bitshift which grows the score
- exponentially is, thus, impossible to tune with fine granularity.
-
- A much more powerful interface, /proc/<pid>/oom_score_adj, was
- introduced with the oom killer rewrite that allows users to increase or
- decrease the badness() score linearly. This interface will replace
- /proc/<pid>/oom_adj.
-
- A warning will be emitted to the kernel log if an application uses this
- deprecated interface. After it is printed once, future warnings will be
- suppressed until the kernel is rebooted.
-
----------------------------
-
What: remove EXPORT_SYMBOL(kernel_thread)
When: August 2006
Files: arch/*/kernel/*_ksyms.c
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index e73df27..030e3a1 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -33,8 +33,7 @@ Table of Contents
2 Modifying System Parameters

3 Per-Process Parameters
- 3.1 /proc/<pid>/oom_adj & /proc/<pid>/oom_score_adj - Adjust the oom-killer
- score
+ 3.1 /proc/<pid>/oom_adj - Adjust the oom-killer score
3.2 /proc/<pid>/oom_score - Display current oom-killer score
3.3 /proc/<pid>/io - Display the IO accounting fields
3.4 /proc/<pid>/coredump_filter - Core dump filtering settings
@@ -1246,64 +1245,42 @@ of the kernel.
CHAPTER 3: PER-PROCESS PARAMETERS
------------------------------------------------------------------------------

-3.1 /proc/<pid>/oom_adj & /proc/<pid>/oom_score_adj- Adjust the oom-killer score
---------------------------------------------------------------------------------
-
-These file can be used to adjust the badness heuristic used to select which
-process gets killed in out of memory conditions.
-
-The badness heuristic assigns a value to each candidate task ranging from 0
-(never kill) to 1000 (always kill) to determine which process is targeted. The
-units are roughly a proportion along that range of allowed memory the process
-may allocate from based on an estimation of its current memory and swap use.
-For example, if a task is using all allowed memory, its badness score will be
-1000. If it is using half of its allowed memory, its score will be 500.
-
-There is an additional factor included in the badness score: root
-processes are given 3% extra memory over other tasks.
-
-The amount of "allowed" memory depends on the context in which the oom killer
-was called. If it is due to the memory assigned to the allocating task's cpuset
-being exhausted, the allowed memory represents the set of mems assigned to that
-cpuset. If it is due to a mempolicy's node(s) being exhausted, the allowed
-memory represents the set of mempolicy nodes. If it is due to a memory
-limit (or swap limit) being reached, the allowed memory is that configured
-limit. Finally, if it is due to the entire system being out of memory, the
-allowed memory represents all allocatable resources.
-
-The value of /proc/<pid>/oom_score_adj is added to the badness score before it
-is used to determine which task to kill. Acceptable values range from -1000
-(OOM_SCORE_ADJ_MIN) to +1000 (OOM_SCORE_ADJ_MAX). This allows userspace to
-polarize the preference for oom killing either by always preferring a certain
-task or completely disabling it. The lowest possible value, -1000, is
-equivalent to disabling oom killing entirely for that task since it will always
-report a badness score of 0.
-
-Consequently, it is very simple for userspace to define the amount of memory to
-consider for each task. Setting a /proc/<pid>/oom_score_adj value of +500, for
-example, is roughly equivalent to allowing the remainder of tasks sharing the
-same system, cpuset, mempolicy, or memory controller resources to use at least
-50% more memory. A value of -500, on the other hand, would be roughly
-equivalent to discounting 50% of the task's allowed memory from being considered
-as scoring against the task.
-
-For backwards compatibility with previous kernels, /proc/<pid>/oom_adj may also
-be used to tune the badness score. Its acceptable values range from -16
-(OOM_ADJUST_MIN) to +15 (OOM_ADJUST_MAX) and a special value of -17
-(OOM_DISABLE) to disable oom killing entirely for that task. Its value is
-scaled linearly with /proc/<pid>/oom_score_adj.
-
-Writing to /proc/<pid>/oom_score_adj or /proc/<pid>/oom_adj will change the
-other with its scaled value.
-
-NOTICE: /proc/<pid>/oom_adj is deprecated and will be removed, please see
-Documentation/feature-removal-schedule.txt.
-
-Caveat: when a parent task is selected, the oom killer will sacrifice any first
-generation children with seperate address spaces instead, if possible. This
-avoids servers and important system daemons from being killed and loses the
-minimal amount of work.
-
+3.1 /proc/<pid>/oom_adj - Adjust the oom-killer score
+------------------------------------------------------
+
+This file can be used to adjust the score used to select which processes
+should be killed in an out-of-memory situation. Giving it a high score will
+increase the likelihood of this process being killed by the oom-killer. Valid
+values are in the range -16 to +15, plus the special value -17, which disables
+oom-killing altogether for this process.
+
+The process to be killed in an out-of-memory situation is selected among all others
+based on its badness score. This value equals the original memory size of the process
+and is then updated according to its CPU time (utime + stime) and the
+run time (uptime - start time). The longer it runs the smaller is the score.
+Badness score is divided by the square root of the CPU time and then by
+the double square root of the run time.
+
+Swapped out tasks are killed first. Half of each child's memory size is added to
+the parent's score if they do not share the same memory. Thus forking servers
+are the prime candidates to be killed. Having only one 'hungry' child will make
+parent less preferable than the child.
+
+/proc/<pid>/oom_score shows process' current badness score.
+
+The following heuristics are then applied:
+ * if the task was reniced, its score doubles
+ * superuser or direct hardware access tasks (CAP_SYS_ADMIN, CAP_SYS_RESOURCE
+ or CAP_SYS_RAWIO) have their score divided by 4
+ * if oom condition happened in one cpuset and checked process does not belong
+ to it, its score is divided by 8
+ * the resulting score is multiplied by two to the power of oom_adj, i.e.
+ points <<= oom_adj when it is positive and
+ points >>= -(oom_adj) otherwise
+
+The task with the highest badness score is then selected and its children
+are killed, process itself will be killed in an OOM situation when it does
+not have children or some of them disabled oom like described above.

3.2 /proc/<pid>/oom_score - Display current oom-killer score
-------------------------------------------------------------
diff --git a/fs/exec.c b/fs/exec.c
index 99d33a1..47986fb 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -54,7 +54,6 @@
#include <linux/fsnotify.h>
#include <linux/fs_struct.h>
#include <linux/pipe_fs_i.h>
-#include <linux/oom.h>

#include <asm/uaccess.h>
#include <asm/mmu_context.h>
@@ -766,10 +765,6 @@ static int exec_mmap(struct mm_struct *mm)
tsk->mm = mm;
tsk->active_mm = mm;
activate_mm(active_mm, mm);
- if (old_mm && tsk->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
- atomic_dec(&old_mm->oom_disable_count);
- atomic_inc(&tsk->mm->oom_disable_count);
- }
task_unlock(tsk);
arch_pick_mmap_layout(mm);
if (old_mm) {
diff --git a/fs/proc/base.c b/fs/proc/base.c
index f3d02ca..ed7d18e 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -63,7 +63,6 @@
#include <linux/namei.h>
#include <linux/mnt_namespace.h>
#include <linux/mm.h>
-#include <linux/swap.h>
#include <linux/rcupdate.h>
#include <linux/kallsyms.h>
#include <linux/stacktrace.h>
@@ -431,11 +430,12 @@ static const struct file_operations proc_lstats_operations = {
static int proc_oom_score(struct task_struct *task, char *buffer)
{
unsigned long points = 0;
+ struct timespec uptime;

+ do_posix_clock_monotonic_gettime(&uptime);
read_lock(&tasklist_lock);
if (pid_alive(task))
- points = oom_badness(task, NULL, NULL,
- totalram_pages + total_swap_pages);
+ points = badness(task, NULL, NULL, uptime.tv_sec);
read_unlock(&tasklist_lock);
return sprintf(buffer, "%lu\n", points);
}
@@ -1025,74 +1025,36 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
memset(buffer, 0, sizeof(buffer));
if (count > sizeof(buffer) - 1)
count = sizeof(buffer) - 1;
- if (copy_from_user(buffer, buf, count)) {
- err = -EFAULT;
- goto out;
- }
+ if (copy_from_user(buffer, buf, count))
+ return -EFAULT;

err = strict_strtol(strstrip(buffer), 0, &oom_adjust);
if (err)
- goto out;
+ return -EINVAL;
if ((oom_adjust < OOM_ADJUST_MIN || oom_adjust > OOM_ADJUST_MAX) &&
- oom_adjust != OOM_DISABLE) {
- err = -EINVAL;
- goto out;
- }
+ oom_adjust != OOM_DISABLE)
+ return -EINVAL;

task = get_proc_task(file->f_path.dentry->d_inode);
- if (!task) {
- err = -ESRCH;
- goto out;
- }
-
- task_lock(task);
- if (!task->mm) {
- err = -EINVAL;
- goto err_task_lock;
- }
-
+ if (!task)
+ return -ESRCH;
if (!lock_task_sighand(task, &flags)) {
- err = -ESRCH;
- goto err_task_lock;
+ put_task_struct(task);
+ return -ESRCH;
}

if (oom_adjust < task->signal->oom_adj && !capable(CAP_SYS_RESOURCE)) {
- err = -EACCES;
- goto err_sighand;
- }
-
- if (oom_adjust != task->signal->oom_adj) {
- if (oom_adjust == OOM_DISABLE)
- atomic_inc(&task->mm->oom_disable_count);
- if (task->signal->oom_adj == OOM_DISABLE)
- atomic_dec(&task->mm->oom_disable_count);
+ unlock_task_sighand(task, &flags);
+ put_task_struct(task);
+ return -EACCES;
}

- /*
- * Warn that /proc/pid/oom_adj is deprecated, see
- * Documentation/feature-removal-schedule.txt.
- */
- printk_once(KERN_WARNING "%s (%d): /proc/%d/oom_adj is deprecated, "
- "please use /proc/%d/oom_score_adj instead.\n",
- current->comm, task_pid_nr(current),
- task_pid_nr(task), task_pid_nr(task));
task->signal->oom_adj = oom_adjust;
- /*
- * Scale /proc/pid/oom_score_adj appropriately ensuring that a maximum
- * value is always attainable.
- */
- if (task->signal->oom_adj == OOM_ADJUST_MAX)
- task->signal->oom_score_adj = OOM_SCORE_ADJ_MAX;
- else
- task->signal->oom_score_adj = (oom_adjust * OOM_SCORE_ADJ_MAX) /
- -OOM_DISABLE;
-err_sighand:
+
unlock_task_sighand(task, &flags);
-err_task_lock:
- task_unlock(task);
put_task_struct(task);
-out:
- return err < 0 ? err : count;
+
+ return count;
}

static const struct file_operations proc_oom_adjust_operations = {
@@ -1101,106 +1063,6 @@ static const struct file_operations proc_oom_adjust_operations = {
.llseek = generic_file_llseek,
};

-static ssize_t oom_score_adj_read(struct file *file, char __user *buf,
- size_t count, loff_t *ppos)
-{
- struct task_struct *task = get_proc_task(file->f_path.dentry->d_inode);
- char buffer[PROC_NUMBUF];
- int oom_score_adj = OOM_SCORE_ADJ_MIN;
- unsigned long flags;
- size_t len;
-
- if (!task)
- return -ESRCH;
- if (lock_task_sighand(task, &flags)) {
- oom_score_adj = task->signal->oom_score_adj;
- unlock_task_sighand(task, &flags);
- }
- put_task_struct(task);
- len = snprintf(buffer, sizeof(buffer), "%d\n", oom_score_adj);
- return simple_read_from_buffer(buf, count, ppos, buffer, len);
-}
-
-static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
- size_t count, loff_t *ppos)
-{
- struct task_struct *task;
- char buffer[PROC_NUMBUF];
- unsigned long flags;
- long oom_score_adj;
- int err;
-
- memset(buffer, 0, sizeof(buffer));
- if (count > sizeof(buffer) - 1)
- count = sizeof(buffer) - 1;
- if (copy_from_user(buffer, buf, count)) {
- err = -EFAULT;
- goto out;
- }
-
- err = strict_strtol(strstrip(buffer), 0, &oom_score_adj);
- if (err)
- goto out;
- if (oom_score_adj < OOM_SCORE_ADJ_MIN ||
- oom_score_adj > OOM_SCORE_ADJ_MAX) {
- err = -EINVAL;
- goto out;
- }
-
- task = get_proc_task(file->f_path.dentry->d_inode);
- if (!task) {
- err = -ESRCH;
- goto out;
- }
-
- task_lock(task);
- if (!task->mm) {
- err = -EINVAL;
- goto err_task_lock;
- }
-
- if (!lock_task_sighand(task, &flags)) {
- err = -ESRCH;
- goto err_task_lock;
- }
-
- if (oom_score_adj < task->signal->oom_score_adj &&
- !capable(CAP_SYS_RESOURCE)) {
- err = -EACCES;
- goto err_sighand;
- }
-
- if (oom_score_adj != task->signal->oom_score_adj) {
- if (oom_score_adj == OOM_SCORE_ADJ_MIN)
- atomic_inc(&task->mm->oom_disable_count);
- if (task->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
- atomic_dec(&task->mm->oom_disable_count);
- }
- task->signal->oom_score_adj = oom_score_adj;
- /*
- * Scale /proc/pid/oom_adj appropriately ensuring that OOM_DISABLE is
- * always attainable.
- */
- if (task->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
- task->signal->oom_adj = OOM_DISABLE;
- else
- task->signal->oom_adj = (oom_score_adj * OOM_ADJUST_MAX) /
- OOM_SCORE_ADJ_MAX;
-err_sighand:
- unlock_task_sighand(task, &flags);
-err_task_lock:
- task_unlock(task);
- put_task_struct(task);
-out:
- return err < 0 ? err : count;
-}
-
-static const struct file_operations proc_oom_score_adj_operations = {
- .read = oom_score_adj_read,
- .write = oom_score_adj_write,
- .llseek = default_llseek,
-};
-
#ifdef CONFIG_AUDITSYSCALL
#define TMPBUFLEN 21
static ssize_t proc_loginuid_read(struct file * file, char __user * buf,
@@ -2779,7 +2641,6 @@ static const struct pid_entry tgid_base_stuff[] = {
#endif
INF("oom_score", S_IRUGO, proc_oom_score),
REG("oom_adj", S_IRUGO|S_IWUSR, proc_oom_adjust_operations),
- REG("oom_score_adj", S_IRUGO|S_IWUSR, proc_oom_score_adj_operations),
#ifdef CONFIG_AUDITSYSCALL
REG("loginuid", S_IWUSR|S_IRUGO, proc_loginuid_operations),
REG("sessionid", S_IRUGO, proc_sessionid_operations),
@@ -3115,7 +2976,6 @@ static const struct pid_entry tid_base_stuff[] = {
#endif
INF("oom_score", S_IRUGO, proc_oom_score),
REG("oom_adj", S_IRUGO|S_IWUSR, proc_oom_adjust_operations),
- REG("oom_score_adj", S_IRUGO|S_IWUSR, proc_oom_score_adj_operations),
#ifdef CONFIG_AUDITSYSCALL
REG("loginuid", S_IWUSR|S_IRUGO, proc_loginuid_operations),
REG("sessionid", S_IRUSR, proc_sessionid_operations),
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 159a076..b13fc2a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -124,8 +124,6 @@ static inline bool mem_cgroup_disabled(void)
void mem_cgroup_update_file_mapped(struct page *page, int val);
unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
gfp_t gfp_mask);
-u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
-
#else /* CONFIG_CGROUP_MEM_RES_CTLR */
struct mem_cgroup;

@@ -305,12 +303,6 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
return 0;
}

-static inline
-u64 mem_cgroup_get_limit(struct mem_cgroup *mem)
-{
- return 0;
-}
-
#endif /* CONFIG_CGROUP_MEM_CONT */

#endif /* _LINUX_MEMCONTROL_H */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index bb7288a..cb57d65 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -310,8 +310,6 @@ struct mm_struct {
#ifdef CONFIG_MMU_NOTIFIER
struct mmu_notifier_mm *mmu_notifier_mm;
#endif
- /* How many tasks sharing this mm are OOM_DISABLE */
- atomic_t oom_disable_count;
};

/* Future-safe accessor for struct mm_struct's cpu_vm_mask. */
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 5e3aa83..40e5e3a 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -1,27 +1,14 @@
#ifndef __INCLUDE_LINUX_OOM_H
#define __INCLUDE_LINUX_OOM_H

-/*
- * /proc/<pid>/oom_adj is deprecated, see
- * Documentation/feature-removal-schedule.txt.
- *
- * /proc/<pid>/oom_adj set to -17 protects from the oom-killer
- */
+/* /proc/<pid>/oom_adj set to -17 protects from the oom-killer */
#define OOM_DISABLE (-17)
/* inclusive */
#define OOM_ADJUST_MIN (-16)
#define OOM_ADJUST_MAX 15

-/*
- * /proc/<pid>/oom_score_adj set to OOM_SCORE_ADJ_MIN disables oom killing for
- * pid.
- */
-#define OOM_SCORE_ADJ_MIN (-1000)
-#define OOM_SCORE_ADJ_MAX 1000
-
#ifdef __KERNEL__

-#include <linux/sched.h>
#include <linux/types.h>
#include <linux/nodemask.h>

@@ -40,8 +27,6 @@ enum oom_constraint {
CONSTRAINT_MEMCG,
};

-extern unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
- const nodemask_t *nodemask, unsigned long totalpages);
extern int try_set_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);

@@ -66,8 +51,6 @@ static inline void oom_killer_enable(void)
extern unsigned long badness(struct task_struct *p, struct mem_cgroup *mem,
const nodemask_t *nodemask, unsigned long uptime);

-extern struct task_struct *find_lock_task_mm(struct task_struct *p);
-
/* sysctls */
extern int sysctl_oom_dump_tasks;
extern int sysctl_oom_kill_allocating_task;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d0036e5..a35acb6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -624,8 +624,7 @@ struct signal_struct {
struct tty_audit_buf *tty_audit_buf;
#endif

- int oom_adj; /* OOM kill score adjustment (bit shift) */
- int oom_score_adj; /* OOM kill score adjustment */
+ int oom_adj; /* OOM kill score adjustment (bit shift) */

struct mutex cred_guard_mutex; /* guard against foreign influences on
* credential calculations
diff --git a/kernel/exit.c b/kernel/exit.c
index 21aa7b3..c806406 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -50,7 +50,6 @@
#include <linux/perf_event.h>
#include <trace/events/sched.h>
#include <linux/hw_breakpoint.h>
-#include <linux/oom.h>

#include <asm/uaccess.h>
#include <asm/unistd.h>
@@ -696,8 +695,6 @@ static void exit_mm(struct task_struct * tsk)
enter_lazy_tlb(mm, current);
/* We don't want this task to be frozen prematurely */
clear_freeze_flag(tsk);
- if (tsk->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
- atomic_dec(&mm->oom_disable_count);
task_unlock(tsk);
mm_update_next_owner(mm);
mmput(mm);
diff --git a/kernel/fork.c b/kernel/fork.c
index 3b159c5..cca5e8b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -65,7 +65,6 @@
#include <linux/perf_event.h>
#include <linux/posix-timers.h>
#include <linux/user-return-notifier.h>
-#include <linux/oom.h>

#include <asm/pgtable.h>
#include <asm/pgalloc.h>
@@ -489,7 +488,6 @@ static struct mm_struct * mm_init(struct mm_struct * mm, struct task_struct *p)
mm->cached_hole_size = ~0UL;
mm_init_aio(mm);
mm_init_owner(mm, p);
- atomic_set(&mm->oom_disable_count, 0);

if (likely(!mm_alloc_pgd(mm))) {
mm->def_flags = 0;
@@ -743,8 +741,6 @@ good_mm:
/* Initializing for Swap token stuff */
mm->token_priority = 0;
mm->last_interval = 0;
- if (tsk->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
- atomic_inc(&mm->oom_disable_count);

tsk->mm = mm;
tsk->active_mm = mm;
@@ -906,7 +902,6 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
tty_audit_fork(sig);

sig->oom_adj = current->signal->oom_adj;
- sig->oom_score_adj = current->signal->oom_score_adj;

mutex_init(&sig->cred_guard_mutex);

@@ -1305,13 +1300,8 @@ bad_fork_cleanup_io:
bad_fork_cleanup_namespaces:
exit_task_namespaces(p);
bad_fork_cleanup_mm:
- if (p->mm) {
- task_lock(p);
- if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
- atomic_dec(&p->mm->oom_disable_count);
- task_unlock(p);
+ if (p->mm)
mmput(p->mm);
- }
bad_fork_cleanup_signal:
if (!(clone_flags & CLONE_THREAD))
free_signal_struct(p->signal);
@@ -1704,10 +1694,6 @@ SYSCALL_DEFINE1(unshare, unsigned long, unshare_flags)
active_mm = current->active_mm;
current->mm = new_mm;
current->active_mm = new_mm;
- if (current->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
- atomic_dec(&mm->oom_disable_count);
- atomic_inc(&new_mm->oom_disable_count);
- }
activate_mm(active_mm, new_mm);
new_mm = mm;
}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9a99cfa..c628370 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -47,7 +47,6 @@
#include <linux/mm_inline.h>
#include <linux/page_cgroup.h>
#include <linux/cpu.h>
-#include <linux/oom.h>
#include "internal.h"

#include <asm/uaccess.h>
@@ -917,13 +916,10 @@ int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem)
{
int ret;
struct mem_cgroup *curr = NULL;
- struct task_struct *p;

- p = find_lock_task_mm(task);
- if (!p)
- return 0;
- curr = try_get_mem_cgroup_from_mm(p->mm);
- task_unlock(p);
+ task_lock(task);
+ curr = try_get_mem_cgroup_from_mm(task->mm);
+ task_unlock(task);
if (!curr)
return 0;
/*
@@ -1297,24 +1293,6 @@ static int mem_cgroup_count_children(struct mem_cgroup *mem)
}

/*
- * Return the memory (and swap, if configured) limit for a memcg.
- */
-u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
-{
- u64 limit;
- u64 memsw;
-
- limit = res_counter_read_u64(&memcg->res, RES_LIMIT) +
- total_swap_pages;
- memsw = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
- /*
- * If memsw is finite and limits the amount of swap space available
- * to this memcg, return that limit.
- */
- return min(limit, memsw);
-}
-
-/*
* Visit the first child (need not be the first child as per the ordering
* of the cgroup list, since we track last_scanned_child) of @mem and use
* that to reclaim free pages from.
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 7dcca55..f251ddb 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -4,8 +4,6 @@
* Copyright (C) 1998,2000 Rik van Riel
* Thanks go out to Claus Fischer for some serious inspiration and
* for goading me into coding this file...
- * Copyright (C) 2010 Google, Inc.
- * Rewritten by David Rientjes
*
* The routines in this file are used to kill a process when
* we're seriously out of memory. This gets called from __alloc_pages()
@@ -36,6 +34,7 @@ int sysctl_panic_on_oom;
int sysctl_oom_kill_allocating_task;
int sysctl_oom_dump_tasks = 1;
static DEFINE_SPINLOCK(zone_scan_lock);
+/* #define DEBUG */

#ifdef CONFIG_NUMA
/**
@@ -106,7 +105,7 @@ static void boost_dying_task_prio(struct task_struct *p,
* pointer. Return p, or any of its subthreads with a valid ->mm, with
* task_lock() held.
*/
-struct task_struct *find_lock_task_mm(struct task_struct *p)
+static struct task_struct *find_lock_task_mm(struct task_struct *p)
{
struct task_struct *t = p;

@@ -121,8 +120,8 @@ struct task_struct *find_lock_task_mm(struct task_struct *p)
}

/* return true if the task is not adequate as candidate victim task. */
-static bool oom_unkillable_task(struct task_struct *p,
- const struct mem_cgroup *mem, const nodemask_t *nodemask)
+static bool oom_unkillable_task(struct task_struct *p, struct mem_cgroup *mem,
+ const nodemask_t *nodemask)
{
if (is_global_init(p))
return true;
@@ -141,82 +140,137 @@ static bool oom_unkillable_task(struct task_struct *p,
}

/**
- * oom_badness - heuristic function to determine which candidate task to kill
+ * badness - calculate a numeric value for how bad this task has been
* @p: task struct of which task we should calculate
- * @totalpages: total present RAM allowed for page allocation
+ * @uptime: current uptime in seconds
*
- * The heuristic for determining which task to kill is made to be as simple and
- * predictable as possible. The goal is to return the highest value for the
- * task consuming the most memory to avoid subsequent oom failures.
+ * The formula used is relatively simple and documented inline in the
+ * function. The main rationale is that we want to select a good task
+ * to kill when we run out of memory.
+ *
+ * Good in this context means that:
+ * 1) we lose the minimum amount of work done
+ * 2) we recover a large amount of memory
+ * 3) we don't kill anything innocent of eating tons of memory
+ * 4) we want to kill the minimum amount of processes (one)
+ * 5) we try to kill the process the user expects us to kill, this
+ * algorithm has been meticulously tuned to meet the principle
+ * of least surprise ... (be careful when you change it)
*/
-unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
- const nodemask_t *nodemask, unsigned long totalpages)
+unsigned long badness(struct task_struct *p, struct mem_cgroup *mem,
+ const nodemask_t *nodemask, unsigned long uptime)
{
- int points;
+ unsigned long points, cpu_time, run_time;
+ struct task_struct *child;
+ struct task_struct *c, *t;
+ int oom_adj = p->signal->oom_adj;
+ struct task_cputime task_time;
+ unsigned long utime;
+ unsigned long stime;

if (oom_unkillable_task(p, mem, nodemask))
return 0;
+ if (oom_adj == OOM_DISABLE)
+ return 0;

p = find_lock_task_mm(p);
if (!p)
return 0;

/*
- * Shortcut check for a thread sharing p->mm that is OOM_SCORE_ADJ_MIN
- * so the entire heuristic doesn't need to be executed for something
- * that cannot be killed.
+ * The memory size of the process is the basis for the badness.
*/
- if (atomic_read(&p->mm->oom_disable_count)) {
- task_unlock(p);
- return 0;
- }
+ points = p->mm->total_vm;
+ task_unlock(p);

/*
- * When the PF_OOM_ORIGIN bit is set, it indicates the task should have
- * priority for oom killing.
+ * swapoff can easily use up all memory, so kill those first.
*/
- if (p->flags & PF_OOM_ORIGIN) {
- task_unlock(p);
- return 1000;
- }
+ if (p->flags & PF_OOM_ORIGIN)
+ return ULONG_MAX;

/*
- * The memory controller may have a limit of 0 bytes, so avoid a divide
- * by zero, if necessary.
+ * Processes which fork a lot of child processes are likely
+ * a good choice. We add half the vmsize of the children if they
+ * have an own mm. This prevents forking servers to flood the
+ * machine with an endless amount of children. In case a single
+ * child is eating the vast majority of memory, adding only half
+ * to the parents will make the child our kill candidate of choice.
*/
- if (!totalpages)
- totalpages = 1;
+ t = p;
+ do {
+ list_for_each_entry(c, &t->children, sibling) {
+ child = find_lock_task_mm(c);
+ if (child) {
+ if (child->mm != p->mm)
+ points += child->mm->total_vm/2 + 1;
+ task_unlock(child);
+ }
+ }
+ } while_each_thread(p, t);

/*
- * The baseline for the badness score is the proportion of RAM that each
- * task's rss and swap space use.
+ * CPU time is in tens of seconds and run time is in thousands
+ * of seconds. There is no particular reason for this other than
+ * that it turned out to work very well in practice.
*/
- points = (get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS)) * 1000 /
- totalpages;
- task_unlock(p);
+ thread_group_cputime(p, &task_time);
+ utime = cputime_to_jiffies(task_time.utime);
+ stime = cputime_to_jiffies(task_time.stime);
+ cpu_time = (utime + stime) >> (SHIFT_HZ + 3);
+
+
+ if (uptime >= p->start_time.tv_sec)
+ run_time = (uptime - p->start_time.tv_sec) >> 10;
+ else
+ run_time = 0;
+
+ if (cpu_time)
+ points /= int_sqrt(cpu_time);
+ if (run_time)
+ points /= int_sqrt(int_sqrt(run_time));

/*
- * Root processes get 3% bonus, just like the __vm_enough_memory()
- * implementation used by LSMs.
+ * Niced processes are most likely less important, so double
+ * their badness points.
*/
- if (has_capability_noaudit(p, CAP_SYS_ADMIN))
- points -= 30;
+ if (task_nice(p) > 0)
+ points *= 2;

/*
- * /proc/pid/oom_score_adj ranges from -1000 to +1000 such that it may
- * either completely disable oom killing or always prefer a certain
- * task.
+ * Superuser processes are usually more important, so we make it
+ * less likely that we kill those.
*/
- points += p->signal->oom_score_adj;
+ if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
+ has_capability_noaudit(p, CAP_SYS_RESOURCE))
+ points /= 4;

/*
- * Never return 0 for an eligible task that may be killed since it's
- * possible that no single user task uses more than 0.1% of memory and
- * no single admin tasks uses more than 3.0%.
+ * We don't want to kill a process with direct hardware access.
+ * Not only could that mess up the hardware, but usually users
+ * tend to only have this flag set on applications they think
+ * of as important.
*/
- if (points <= 0)
- return 1;
- return (points < 1000) ? points : 1000;
+ if (has_capability_noaudit(p, CAP_SYS_RAWIO))
+ points /= 4;
+
+ /*
+ * Adjust the score by oom_adj.
+ */
+ if (oom_adj) {
+ if (oom_adj > 0) {
+ if (!points)
+ points = 1;
+ points <<= oom_adj;
+ } else
+ points >>= -(oom_adj);
+ }
+
+#ifdef DEBUG
+ printk(KERN_DEBUG "OOMkill: task %d (%s) got %lu points\n",
+ p->pid, p->comm, points);
+#endif
+ return points;
}

/*
@@ -224,20 +278,12 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
*/
#ifdef CONFIG_NUMA
static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
- gfp_t gfp_mask, nodemask_t *nodemask,
- unsigned long *totalpages)
+ gfp_t gfp_mask, nodemask_t *nodemask)
{
struct zone *zone;
struct zoneref *z;
enum zone_type high_zoneidx = gfp_zone(gfp_mask);
- bool cpuset_limited = false;
- int nid;
-
- /* Default to all available memory */
- *totalpages = totalram_pages + total_swap_pages;

- if (!zonelist)
- return CONSTRAINT_NONE;
/*
* Reach here only when __GFP_NOFAIL is used. So, we should avoid
* to kill current.We have to random task kill in this case.
@@ -247,37 +293,26 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
return CONSTRAINT_NONE;

/*
- * This is not a __GFP_THISNODE allocation, so a truncated nodemask in
- * the page allocator means a mempolicy is in effect. Cpuset policy
- * is enforced in get_page_from_freelist().
+ * The nodemask here is a nodemask passed to alloc_pages(). Now,
+ * cpuset doesn't use this nodemask for its hardwall/softwall/hierarchy
+ * feature. mempolicy is an only user of nodemask here.
+ * check mempolicy's nodemask contains all N_HIGH_MEMORY
*/
- if (nodemask && !nodes_subset(node_states[N_HIGH_MEMORY], *nodemask)) {
- *totalpages = total_swap_pages;
- for_each_node_mask(nid, *nodemask)
- *totalpages += node_spanned_pages(nid);
+ if (nodemask && !nodes_subset(node_states[N_HIGH_MEMORY], *nodemask))
return CONSTRAINT_MEMORY_POLICY;
- }

/* Check this allocation failure is caused by cpuset's wall function */
for_each_zone_zonelist_nodemask(zone, z, zonelist,
high_zoneidx, nodemask)
if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
- cpuset_limited = true;
+ return CONSTRAINT_CPUSET;

- if (cpuset_limited) {
- *totalpages = total_swap_pages;
- for_each_node_mask(nid, cpuset_current_mems_allowed)
- *totalpages += node_spanned_pages(nid);
- return CONSTRAINT_CPUSET;
- }
return CONSTRAINT_NONE;
}
#else
static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
- gfp_t gfp_mask, nodemask_t *nodemask,
- unsigned long *totalpages)
+ gfp_t gfp_mask, nodemask_t *nodemask)
{
- *totalpages = totalram_pages + total_swap_pages;
return CONSTRAINT_NONE;
}
#endif
@@ -288,16 +323,17 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
*
* (not docbooked, we don't want this one cluttering up the manual)
*/
-static struct task_struct *select_bad_process(unsigned int *ppoints,
- unsigned long totalpages, struct mem_cgroup *mem,
- const nodemask_t *nodemask)
+static struct task_struct *select_bad_process(unsigned long *ppoints,
+ struct mem_cgroup *mem, const nodemask_t *nodemask)
{
struct task_struct *p;
struct task_struct *chosen = NULL;
+ struct timespec uptime;
*ppoints = 0;

+ do_posix_clock_monotonic_gettime(&uptime);
for_each_process(p) {
- unsigned int points;
+ unsigned long points;

if (oom_unkillable_task(p, mem, nodemask))
continue;
@@ -329,11 +365,11 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
return ERR_PTR(-1UL);

chosen = p;
- *ppoints = 1000;
+ *ppoints = ULONG_MAX;
}

- points = oom_badness(p, mem, nodemask, totalpages);
- if (points > *ppoints) {
+ points = badness(p, mem, nodemask, uptime.tv_sec);
+ if (points > *ppoints || !chosen) {
chosen = p;
*ppoints = points;
}
@@ -345,24 +381,27 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
/**
* dump_tasks - dump current memory state of all system tasks
* @mem: current's memory controller, if constrained
- * @nodemask: nodemask passed to page allocator for mempolicy ooms
*
- * Dumps the current memory state of all eligible tasks. Tasks not in the same
- * memcg, not in the same cpuset, or bound to a disjoint set of mempolicy nodes
- * are not shown.
+ * Dumps the current memory state of all system tasks, excluding kernel threads.
* State information includes task's pid, uid, tgid, vm size, rss, cpu, oom_adj
- * value, oom_score_adj value, and name.
+ * score, and name.
+ *
+ * If the actual is non-NULL, only tasks that are a member of the mem_cgroup are
+ * shown.
*
* Call with tasklist_lock read-locked.
*/
-static void dump_tasks(const struct mem_cgroup *mem, const nodemask_t *nodemask)
+static void dump_tasks(const struct mem_cgroup *mem)
{
struct task_struct *p;
struct task_struct *task;

- pr_info("[ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name\n");
+ printk(KERN_INFO "[ pid ] uid tgid total_vm rss cpu oom_adj "
+ "name\n");
for_each_process(p) {
- if (oom_unkillable_task(p, mem, nodemask))
+ if (p->flags & PF_KTHREAD)
+ continue;
+ if (mem && !task_in_mem_cgroup(p, mem))
continue;

task = find_lock_task_mm(p);
@@ -375,69 +414,43 @@ static void dump_tasks(const struct mem_cgroup *mem, const nodemask_t *nodemask)
continue;
}

- pr_info("[%5d] %5d %5d %8lu %8lu %3u %3d %5d %s\n",
+ pr_info("[%5d] %5d %5d %8lu %8lu %3u %3d %s\n",
task->pid, task_uid(task), task->tgid,
task->mm->total_vm, get_mm_rss(task->mm),
- task_cpu(task), task->signal->oom_adj,
- task->signal->oom_score_adj, task->comm);
+ task_cpu(task), task->signal->oom_adj, task->comm);
task_unlock(task);
}
}

static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
- struct mem_cgroup *mem, const nodemask_t *nodemask)
+ struct mem_cgroup *mem)
{
task_lock(current);
pr_warning("%s invoked oom-killer: gfp_mask=0x%x, order=%d, "
- "oom_adj=%d, oom_score_adj=%d\n",
- current->comm, gfp_mask, order, current->signal->oom_adj,
- current->signal->oom_score_adj);
+ "oom_adj=%d\n",
+ current->comm, gfp_mask, order, current->signal->oom_adj);
cpuset_print_task_mems_allowed(current);
task_unlock(current);
dump_stack();
mem_cgroup_print_oom_info(mem, p);
show_mem();
if (sysctl_oom_dump_tasks)
- dump_tasks(mem, nodemask);
+ dump_tasks(mem);
}

#define K(x) ((x) << (PAGE_SHIFT-10))
static int oom_kill_task(struct task_struct *p, struct mem_cgroup *mem)
{
- struct task_struct *q;
- struct mm_struct *mm;
-
p = find_lock_task_mm(p);
if (!p)
return 1;

- /* mm cannot be safely dereferenced after task_unlock(p) */
- mm = p->mm;
-
pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
task_pid_nr(p), p->comm, K(p->mm->total_vm),
K(get_mm_counter(p->mm, MM_ANONPAGES)),
K(get_mm_counter(p->mm, MM_FILEPAGES)));
task_unlock(p);

- /*
- * Kill all processes sharing p->mm in other thread groups, if any.
- * They don't get access to memory reserves or a higher scheduler
- * priority, though, to avoid depletion of all memory or task
- * starvation. This prevents mm->mmap_sem livelock when an oom killed
- * task cannot exit because it requires the semaphore and its contended
- * by another thread trying to allocate memory itself. That thread will
- * now get access to memory reserves since it has a pending fatal
- * signal.
- */
- for_each_process(q)
- if (q->mm == mm && !same_thread_group(q, p)) {
- task_lock(q); /* Protect ->comm from prctl() */
- pr_err("Kill process %d (%s) sharing same memory\n",
- task_pid_nr(q), q->comm);
- task_unlock(q);
- force_sig(SIGKILL, q);
- }

set_tsk_thread_flag(p, TIF_MEMDIE);
force_sig(SIGKILL, p);
@@ -454,17 +467,17 @@ static int oom_kill_task(struct task_struct *p, struct mem_cgroup *mem)
#undef K

static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
- unsigned int points, unsigned long totalpages,
- struct mem_cgroup *mem, nodemask_t *nodemask,
- const char *message)
+ unsigned long points, struct mem_cgroup *mem,
+ nodemask_t *nodemask, const char *message)
{
struct task_struct *victim = p;
struct task_struct *child;
struct task_struct *t = p;
- unsigned int victim_points = 0;
+ unsigned long victim_points = 0;
+ struct timespec uptime;

if (printk_ratelimit())
- dump_header(p, gfp_mask, order, mem, nodemask);
+ dump_header(p, gfp_mask, order, mem);

/*
* If the task is already exiting, don't alarm the sysadmin or kill
@@ -477,7 +490,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
}

task_lock(p);
- pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n",
+ pr_err("%s: Kill process %d (%s) score %lu or sacrifice child\n",
message, task_pid_nr(p), p->comm, points);
task_unlock(p);

@@ -487,15 +500,14 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
* parent. This attempts to lose the minimal amount of work done while
* still freeing memory.
*/
+ do_posix_clock_monotonic_gettime(&uptime);
do {
list_for_each_entry(child, &t->children, sibling) {
- unsigned int child_points;
+ unsigned long child_points;

- /*
- * oom_badness() returns 0 if the thread is unkillable
- */
- child_points = oom_badness(child, mem, nodemask,
- totalpages);
+ /* badness() returns 0 if the thread is unkillable */
+ child_points = badness(child, mem, nodemask,
+ uptime.tv_sec);
if (child_points > victim_points) {
victim = child;
victim_points = child_points;
@@ -510,7 +522,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
* Determines whether the kernel must panic because of the panic_on_oom sysctl.
*/
static void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
- int order, const nodemask_t *nodemask)
+ int order)
{
if (likely(!sysctl_panic_on_oom))
return;
@@ -524,7 +536,7 @@ static void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
return;
}
read_lock(&tasklist_lock);
- dump_header(NULL, gfp_mask, order, NULL, nodemask);
+ dump_header(NULL, gfp_mask, order, NULL);
read_unlock(&tasklist_lock);
panic("Out of memory: %s panic_on_oom is enabled\n",
sysctl_panic_on_oom == 2 ? "compulsory" : "system-wide");
@@ -533,19 +545,17 @@ static void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
#ifdef CONFIG_CGROUP_MEM_RES_CTLR
void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
{
- unsigned long limit;
- unsigned int points = 0;
+ unsigned long points = 0;
struct task_struct *p;

- check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, 0, NULL);
- limit = mem_cgroup_get_limit(mem) >> PAGE_SHIFT;
+ check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, 0);
read_lock(&tasklist_lock);
retry:
- p = select_bad_process(&points, limit, mem, NULL);
+ p = select_bad_process(&points, mem, NULL);
if (!p || PTR_ERR(p) == -1UL)
goto out;

- if (oom_kill_process(p, gfp_mask, 0, points, limit, mem, NULL,
+ if (oom_kill_process(p, gfp_mask, 0, points, mem, NULL,
"Memory cgroup out of memory"))
goto retry;
out:
@@ -669,11 +679,9 @@ static void clear_system_oom(void)
void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
int order, nodemask_t *nodemask)
{
- const nodemask_t *mpol_mask;
struct task_struct *p;
- unsigned long totalpages;
unsigned long freed = 0;
- unsigned int points;
+ unsigned long points;
enum oom_constraint constraint = CONSTRAINT_NONE;
int killed = 0;

@@ -697,40 +705,41 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
* Check if there were limitations on the allocation (only relevant for
* NUMA) that may require different handling.
*/
- constraint = constrained_alloc(zonelist, gfp_mask, nodemask,
- &totalpages);
- mpol_mask = (constraint == CONSTRAINT_MEMORY_POLICY) ? nodemask : NULL;
- check_panic_on_oom(constraint, gfp_mask, order, mpol_mask);
+ if (zonelist)
+ constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
+ check_panic_on_oom(constraint, gfp_mask, order);

read_lock(&tasklist_lock);
if (sysctl_oom_kill_allocating_task &&
!oom_unkillable_task(current, NULL, nodemask) &&
- current->mm && !atomic_read(&current->mm->oom_disable_count)) {
+ (current->signal->oom_adj != OOM_DISABLE)) {
/*
* oom_kill_process() needs tasklist_lock held. If it returns
* non-zero, current could not be killed so we must fallback to
* the tasklist scan.
*/
- if (!oom_kill_process(current, gfp_mask, order, 0, totalpages,
- NULL, nodemask,
+ if (!oom_kill_process(current, gfp_mask, order, 0, NULL,
+ nodemask,
"Out of memory (oom_kill_allocating_task)"))
goto out;
}

retry:
- p = select_bad_process(&points, totalpages, NULL, mpol_mask);
+ p = select_bad_process(&points, NULL,
+ constraint == CONSTRAINT_MEMORY_POLICY ? nodemask :
+ NULL);
if (PTR_ERR(p) == -1UL)
goto out;

/* Found nothing?!?! Either we hang forever, or we panic. */
if (!p) {
- dump_header(NULL, gfp_mask, order, NULL, mpol_mask);
+ dump_header(NULL, gfp_mask, order, NULL);
read_unlock(&tasklist_lock);
panic("Out of memory and no killable processes...\n");
}

- if (oom_kill_process(p, gfp_mask, order, points, totalpages, NULL,
- nodemask, "Out of memory"))
+ if (oom_kill_process(p, gfp_mask, order, points, NULL, nodemask,
+ "Out of memory"))
goto retry;
killed = 1;
out:
--
1.6.5.2



2010-11-14 19:33:30

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Revert oom rewrite series

2010/11/13 KOSAKI Motohiro <[email protected]>:
>
> Please apply this. this patch revert commits of oom changes since v2.6.35.

I'm not getting involved in this whole flame-war. You need to convince
Andrew, who has been the person everything went through.

Linus

2010-11-14 21:58:59

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH] Revert oom rewrite series

On Sun, 14 Nov 2010, KOSAKI Motohiro wrote:

> Linus,
>
> Please apply this. this patch revert commits of oom changes since v2.6.35.
>
> briefly says, "oom: badness heuristic rewrite" was merges by mistaken.
> It haven't been passed our design nor code review. then multiple bug reports
> has been popped up. I believe evey patches should pass a usecase and a code
> review :-/
>

That's inaccurate, there haven't been multiple bug reports popping up
since the rewrite; in fact, there hasn't been a single bug report.

There have been two changes to the oom killer since the rewrite:

- we now kill all threads sharing the oom killed task that share the ->mm
since we can't free any memory without them exiting as well, and

- we count threads that are immune from oom kill attached to an ->mm so
we can avoid needlessly killing tasks that aren't immune themselves but
have other threads sharing the ->mm that are.

Both of those changes were needed in the old oom killer as well, they have
nothing to do with the rewrite.

Also, stating that the new heuristic doesn't address CAP_SYS_RESOURCE
approrpiately isn't a bug report, it's the desired behavior. I eliminated
all of the arbitrary heursitics in the old heuristic that we had the
remove internally as well so that is predictable as possible and achieves
the oom killer's sole goal: to kill the most memory-hogging task that is
eligible to allow memory allocations in the current context to succeed.
CAP_SYS_RESOURCE threads have full control over their oom killing priority
by /proc/pid/oom_score_adj and need no consideration in the heuristic by
default since it otherwise allows for the probability that multiple tasks
will need to be killed when a CAP_SYS_RESOURCE thread uses an egregious
amount of memory.

> The problem is, DavidR patches don't refrect real world usecase at all
> and breaking them. He can talk about the userland is wrong. but such
> excuse doesn't solve real world issue. it makes no sense.
>

As mentioned just a few minutes ago in another thread, there is no
userspace breakage with the rewrite and you're only complaining here about
the deprecation of /proc/pid/oom_adj for a period of two years. Until
it's removed in 2012 or later, it maps to the linear scale that
oom_score_adj uses rather than its old exponential scale that was
unusable for prioritization because of (1) the extremely low resolution,
and (2) the arbitrary heuristics that preceeded it.

You've proposed various forms of your revert (this is the fifth one) and
I've responded in a very respectful and technical way each time even
though you have repeatedly called me stupid. Linus is under the
impression that this is some kind of flamewar when in reality it's only a
desperate attempt of yours to start one, this kind of thing just really
bounces off of me on a personal level. I will, however, continue to
remain professional.

2010-11-15 00:54:20

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH] Revert oom rewrite series

> 2010/11/13 KOSAKI Motohiro <[email protected]>:
> >
> > Please apply this. this patch revert commits of oom changes since v2.6.35.
>
> I'm not getting involved in this whole flame-war. You need to convince
> Andrew, who has been the person everything went through.

I wonder why he deep silence. But, _I_ strongly don't want to ignore bug report and
userland complain. I hope to fix any bug as far as my development time is allowed.


2010-11-15 02:18:52

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Revert oom rewrite series

On Mon, 15 Nov 2010 09:54:14 +0900 (JST) KOSAKI Motohiro <[email protected]> wrote:

> > 2010/11/13 KOSAKI Motohiro <[email protected]>:
> > >
> > > Please apply this. this patch revert commits of oom changes since v2.6.35.
> >
> > I'm not getting involved in this whole flame-war. You need to convince
> > Andrew, who has been the person everything went through.
>
> I wonder why he deep silence.

Nothing to say, really. Seems each time we're told about a bug or a
regression, David either fixes the bug or points out why it wasn't a
bug or why it wasn't a regression or how it was a deliberate behaviour
change for the better.

I just haven't seen any solid reason to be concerned about the state of
the current oom-killer, sorry.

I'm concerned that you're concerned! A lot. When someone such as
yourself is unhappy with part of MM then I sit up and pay attention.
But after all this time I simply don't understand the technical issues
which you're seeing here.

2010-11-15 04:42:18

by Figo.zhang

[permalink] [raw]
Subject: Re: [PATCH] Revert oom rewrite series

>Nothing to say, really. Seems each time we're told about a bug or a
>regression, David either fixes the bug or points out why it wasn't a
>bug or why it wasn't a regression or how it was a deliberate behaviour
>change for the better.

>I just haven't seen any solid reason to be concerned about the state of
>the current oom-killer, sorry.

>I'm concerned that you're concerned! A lot. When someone such as
>yourself is unhappy with part of MM then I sit up and pay attention.
>But after all this time I simply don't understand the technical issues
>which you're seeing here.

we just talk about oom-killer technical issues.

i am doubt that a new rewrite but the athor canot provide some evidence
and experiment result, why did you do that? what is the prominent change

for your new algorithm?

as KOSAKI Motohiro said, "you removed CAP_SYS_RESOURCE condition with
ZERO explanation".

David just said that pls use userspace tunable for protection by
oom_score_adj. but may i ask question:


1. what is your innovation for your new algorithm, the old one have the
same way for user tunable oom_adj.

2. if server like db-server/financial-server have huge import processes
(such as root/hardware access processes)want to be protection, you let

the administrator to find out which processes should be protection. you
will let the financial-server administrator huge crazy!! and lose so
many money!! ^~^

3. i see your email in LKML, you just said
"I have repeatedly said that the oom killer no longer kills KDE when run

on my desktop in the presence of a memory hogging task that was written
specifically to oom the machine."
http://thread.gmane.org/gmane.linux.kernel.mm/48998


so you just test your new oom_killer algorithm on your desktop with KDE,
so have you provide the detail how you do the test? is it do the
experiment again for anyone and got the same result as your comment ?


as KOSAKI Motohiro said, in reality word, it we makes 5-6 brain
simulation, embedded, desktop, web server,db server, hpc, finance.
Different workloads certenally makes big impact. have you do those
experiments?


i think that technology should base on experiment not on imagine.


Best,
Figo.zhang

2010-11-15 06:57:40

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH] Revert oom rewrite series

> On Mon, 15 Nov 2010 09:54:14 +0900 (JST) KOSAKI Motohiro <[email protected]> wrote:
>
> > > 2010/11/13 KOSAKI Motohiro <[email protected]>:
> > > >
> > > > Please apply this. this patch revert commits of oom changes since v2.6.35.
> > >
> > > I'm not getting involved in this whole flame-war. You need to convince
> > > Andrew, who has been the person everything went through.
> >
> > I wonder why he deep silence.
>
> Nothing to say, really. Seems each time we're told about a bug or a
> regression, David either fixes the bug or points out why it wasn't a
> bug or why it wasn't a regression or how it was a deliberate behaviour
> change for the better.

Of cource, I denied. He seems to think number of email is meaningful than
how talk about. but it's incorrect and makes no sense. Why not? Also, He
have to talk about logically. "Hey, I think it's not bug" makes no sense.
Such claim don't solve anything. userland is still unhappy. Why not?
I want to quickly action.

I would like to suggest they join and contribute any distro kernel
maintainance team. Many community based distribution welcome to developrs.
And a bugfix work tell them a lot of thing. which usecase are freqently used,
which bug reports are fequently raised, etc.

That said, If anyone want to change userland ABI, Be carefully. They have
to investigate userland usecase carefully and avoid to break them carefully
again. If someone think "hey, It's no big matter. userland rewritten can solve
an issue", I strongly disagree. they don't understand why all of userland
applications rewritten is harmful.



> I just haven't seen any solid reason to be concerned about the state of
> the current oom-killer, sorry.

You can't say "I haven't seen". I always cced you.


> I'm concerned that you're concerned! A lot. When someone such as
> yourself is unhappy with part of MM then I sit up and pay attention.
> But after all this time I simply don't understand the technical issues
> which you're seeing here.

You should have read my patch descriptions which I sent and my e-mail.


1) About two month ago, Dave hansen observed strange OOM issue because he
has a big machine and ALL process are not so big. thus, eventually all
process got oom-score=0 and oom-killer didn't work.

https://kerneltrap.org/mailarchive/linux-driver-devel/2010/9/9/6886383

DavidR changed oom-score to +1 in such situation.

http://kerneltrap.org/mailarchive/linux-kernel/2010/9/9/4617455

But it is completely bognus. If all process have score=1, oom-killer fall
back to purely random killer. I expected and explained his patch has
its problem at half years ago. but he didn't fix yet.

2) Also half years ago, I did explained oom_adj is used from multiple
applications. And we can't break them. But DavidR didn't fix.

3) Also about four month ago, I and kamezawa-san pointed out his patch
don't work on memcg. It also haven't been fixed.


In the other hand, You can't explain what worth OOM-rewritten patch has.
Because there is nothing. It is only "powerful"(TM) for Google. but
instead It has zero worth for every other people. Here is just technical
issue. Bah.



And, I just don't understand why some people try to remove or obsolate
oom_adj. It's just eight lines code and It's used from multiple applications.
There is no reason to break userland at all.
--------------------------------------------------------
178 /*
179 * Adjust the score by oom_adj.
180 */
181 if (oom_adj) {
182 if (oom_adj > 0) {
183 if (!points)
184 points = 1;
185 points <<= oom_adj;
186 } else
187 points >>= -(oom_adj);
188 }
--------------------------------------------------------


If you still have a question, please ask me. maybe I can answer all of
your question.


2010-11-15 10:34:53

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH] Revert oom rewrite series

On Mon, 15 Nov 2010, KOSAKI Motohiro wrote:

> Of cource, I denied. He seems to think number of email is meaningful than
> how talk about. but it's incorrect and makes no sense. Why not? Also, He
> have to talk about logically. "Hey, I think it's not bug" makes no sense.
> Such claim don't solve anything. userland is still unhappy. Why not?
> I want to quickly action.
>

If there are pending complaints or bugs that I haven't addressed, please
bring them to my attention. To date, I know of no issues that have been
raised that I have not addressed; you're always free to disagree with my
position, but in the end you may find that when the kernel moves in a
different direction that you should begin to accept it.

> That said, If anyone want to change userland ABI, Be carefully. They have
> to investigate userland usecase carefully and avoid to break them carefully
> again. If someone think "hey, It's no big matter. userland rewritten can solve
> an issue", I strongly disagree. they don't understand why all of userland
> applications rewritten is harmful.
>

You may remember that the initial version of my rewrite replaced oom_adj
entirely with the new oom_score_adj semantics. Others suggested that it
be seperated into a new tunable and the old tunable deprecated for a
lengthy period of time. I accepted that criticism and understood the
drawbacks of replacing the tunable immediately and followed those
suggestions. I disagree with you that the deprecation of oom_adj for a
period of two years is as dramatic as you imply and I disagree that users
are experiencing problems with the linear scale that it now operates on
versus the old exponential scale.

> 1) About two month ago, Dave hansen observed strange OOM issue because he
> has a big machine and ALL process are not so big. thus, eventually all
> process got oom-score=0 and oom-killer didn't work.
>
> https://kerneltrap.org/mailarchive/linux-driver-devel/2010/9/9/6886383
>
> DavidR changed oom-score to +1 in such situation.
>
> http://kerneltrap.org/mailarchive/linux-kernel/2010/9/9/4617455
>
> But it is completely bognus. If all process have score=1, oom-killer fall
> back to purely random killer. I expected and explained his patch has
> its problem at half years ago. but he didn't fix yet.
>

The resolution with which the oom killer considers memory is at 0.1% of
system RAM at its highest (smaller when you have a memory controller,
cpuset, or mempolicy constrained oom). It considers a task within 0.1% of
memory of another task to have equal "badness" to kill, we don't break
ties in between that resolution -- it all depends on which one shows up in
the tasklist first. If you disagree with that resolution, which I support
as being high enough, then you may certainly propose a patch to make it
even finer at 0.01%, 0.001%, etc. It would only change oom_badness() to
range between [0,10000], [0,100000], etc.

> 2) Also half years ago, I did explained oom_adj is used from multiple
> applications. And we can't break them. But DavidR didn't fix.
>

And we didn't. oom_adj is still there and maps linearly to oom_score_adj;
you just can't show a single application where that mapping breaks because
it was based on an actual calculation.

If you would like to cite these "multiple" applications that need to be
converted to use oom_score_adj (I know of udev), please let me know and
if they're open-source applications then I will commit to submitting
patches for them myself. I believe the two year window is sufficient for
everyone else, though.

> 3) Also about four month ago, I and kamezawa-san pointed out his patch
> don't work on memcg. It also haven't been fixed.
>

I don't know what you're referring to here, sorry.

> In the other hand, You can't explain what worth OOM-rewritten patch has.
> Because there is nothing. It is only "powerful"(TM) for Google. but
> instead It has zero worth for every other people. Here is just technical
> issue. Bah.
>

Please see my reply to Figo.zhang where I enumerate the four reasons why
the new userspace tunable is more powerful than oom_adj.


At this point, I can only speculate that your distaste for the new oom
killer is one of disposition; it seems like everytime you reply to an
email (or, more regularly, just repost your revert) that you come into it
with the attitude that my response cannot possibly be correct and that the
way you see things is exactly as they should be. If you were to consider
other people's opinions, however, you may find some common ground that can
be met. I certainly did that when I introduced oom_score_adj instead of
replacing oom_adj immediatley. I also did it when I removed the forkbomb
detector from the rewrite. I also did it when considering swap in the
heuristic when it initially was only rss. Andrew is in the position where
he has to make a judgment call on what should be included and what
shouldn't and it should be pretty darn clear after you post your revert
the first time, then the second time, then the third time, then the fourth
time, and now the fifth time.

2010-11-15 23:33:49

by Bodo Eggert

[permalink] [raw]
Subject: Re: [PATCH] Revert oom rewrite series

On Sun, 14 Nov 2010, David Rientjes wrote:

> Also, stating that the new heuristic doesn't address CAP_SYS_RESOURCE
> approrpiately isn't a bug report, it's the desired behavior. I eliminated
> all of the arbitrary heursitics in the old heuristic that we had the
> remove internally as well so that is predictable as possible and achieves
> the oom killer's sole goal: to kill the most memory-hogging task that is
> eligible to allow memory allocations in the current context to succeed.

> CAP_SYS_RESOURCE threads have full control over their oom killing priority
> by /proc/pid/oom_score_adj

, but unless they are written in the last months and designed for linux
and if the author took some time to research each external process
invocation, they can not be aware of this possibility.

Besides that, if each process is supposed to change the default, the
default is wrong.

> and need no consideration in the heuristic by
> default since it otherwise allows for the probability that multiple tasks
> will need to be killed when a CAP_SYS_RESOURCE thread uses an egregious
> amount of memory.

If it happens to use an egregious mount of memory, it SHOULD score
enough to get killed.

>> The problem is, DavidR patches don't refrect real world usecase at all
>> and breaking them. He can talk about the userland is wrong. but such
>> excuse doesn't solve real world issue. it makes no sense.
>
> As mentioned just a few minutes ago in another thread, there is no
> userspace breakage with the rewrite and you're only complaining here about
> the deprecation of /proc/pid/oom_adj for a period of two years. Until
> it's removed in 2012 or later, it maps to the linear scale that
> oom_score_adj uses rather than its old exponential scale that was
> unusable for prioritization because of (1) the extremely low resolution,
> and (2) the arbitrary heuristics that preceeded it.

1) The exponential scale did have a low resolution.

2) The heuristics were developed using much brain power and much
trial-and-error. You are going back to basics, and some people
are not convinced that this is better. I googled and I did not
find a discussion about how and why the new score was designed
this way.
looking at the output of:
cd /proc; for a in [0-9]*; do
echo `cat $a/oom_score` $a `perl -pes/'\0.*$'// < $a/cmdline`;
done|grep -v ^0|sort -n |less
, I 'm not convinced, too.

PS) Mapping an exponential value to a linear score is bad. E.g. A
oom_adj of 8 should make an 1-MB-process as likely to kill as
a 256-MB-process with oom_adj=0.

PS2) Because I saw this in your presentation PDF: (@udev-people)
The -17 score of udevd is wrong, since it will even prevent
the OOM killer from working correctly if it grows to 100 MB:

It's default OOM score is 13, while root's shell is at 190
and some KDE processes are at 200 000. It will not get killed
under normal circumstances.

If it udevd grows enough to score 190 as well, it has a bug
that causes it to eat memory and it needs to be killed. Having
a -17 oom_adj, it will cause the system to fail instead.
Considering udevd's size, an adj of -1 or -2 should be enough on
embedded systems, while desktop systems should not need it.
If you are worried about udevd getting killed, protect ist using
a wrapper.

2010-11-15 23:43:21

by Jesper Juhl

[permalink] [raw]
Subject: Re: [PATCH] Revert oom rewrite series

On Mon, 15 Nov 2010, David Rientjes wrote:

[...]
> If you would like to cite these "multiple" applications that need to be
> converted to use oom_score_adj (I know of udev), please let me know and
> if they're open-source applications then I will commit to submitting
> patches for them myself. I believe the two year window is sufficient for
> everyone else, though.
[...]

I'm not going into the debate about whether or not deprecating one tunable
for two years is sufficient or not. I'm simply going to mention one app
that I know of that needs to be converted to use "oom_score_adj" on my
box :

[jj@dragon ~]$ uname -a
Linux dragon 2.6.37-rc1-ARCH-00542-g0143832-dirty #1 SMP PREEMPT Mon Nov 15 22:01:52 CET 2010 x86_64 Intel(R) Core(TM)2 Duo CPU T7250 @ 2.00GHz GenuineIntel GNU/Linux
[jj@dragon ~]$ dmesg | grep oom_adj
start_kdeinit (1502): /proc/1502/oom_adj is deprecated, please use /proc/1502/oom_score_adj instead.
[jj@dragon ~]$ /usr/lib/kde4/libexec/start_kdeinit --version

Qt: 4.7.1
KDE: 4.5.3 (KDE 4.5.3)



--
Jesper Juhl <[email protected]> http://www.chaosbits.net/
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please.

2010-11-15 23:50:33

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH] Revert oom rewrite series

On Tue, 16 Nov 2010, Bodo Eggert wrote:

> > CAP_SYS_RESOURCE threads have full control over their oom killing priority
> > by /proc/pid/oom_score_adj
>
> , but unless they are written in the last months and designed for linux
> and if the author took some time to research each external process invocation,
> they can not be aware of this possibility.
>

You're clearly wrong, CAP_SYS_RESOURCE has been required to modify oom_adj
for over five years (as long as the git history). 8fb4fc68, merged into
2.6.20, allowed tasks to raise their own oom_adj but not decrease it.
That is unchanged by the rewrite.

> Besides that, if each process is supposed to change the default, the default
> is wrong.
>

That doesn't make any sense, if want to protect a thread from the oom
killer you're going to need to modify oom_score_adj, the kernel can't know
what you perceive as being vital. Having CAP_SYS_RESOURCE alone does not
imply that, it only allows unbounded access to resources. That's
completely orthogonal to the goal of the oom killer heuristic, which is to
find the most memory-hogging task to kill.

> 1) The exponential scale did have a low resolution.
>
> 2) The heuristics were developed using much brain power and much
> trial-and-error. You are going back to basics, and some people
> are not convinced that this is better. I googled and I did not
> find a discussion about how and why the new score was designed
> this way.
> looking at the output of:
> cd /proc; for a in [0-9]*; do
> echo `cat $a/oom_score` $a `perl -pes/'\0.*$'// < $a/cmdline`;
> done|grep -v ^0|sort -n |less
> , I 'm not convinced, too.
>

The old heuristics were a mixture of arbitrary values that didn't adjust
scores based on a unit and would often cause the incorrect task to be
targeted because there was no clear goal being achieved. The new
heuristic has a solid goal: to identify and kill the most memory-hogging
task that is eligible given the context in which the oom occurs. If you
disagree with that goal and want any of the old heursitics reintroduced,
please show that it makes sense in the oom killer.

> PS) Mapping an exponential value to a linear score is bad. E.g. A
> oom_adj of 8 should make an 1-MB-process as likely to kill as
> a 256-MB-process with oom_adj=0.
>

To show that, you would have to show that an application that exists today
uses an oom_adj for something other than polarization and is based on a
calculation of allowable memory usage. It simply doesn't exist.

> PS2) Because I saw this in your presentation PDF: (@udev-people)
> The -17 score of udevd is wrong, since it will even prevent
> the OOM killer from working correctly if it grows to 100 MB:
>

Threads with CAP_SYS_RESOURCE are free to lower the oom_score_adj of any
thread they deem fit and that includes applications that lower its own
oom_score_adj. The kernel isn't going to prohibit users from setting
their own oom_score_adj.

2010-11-16 00:06:28

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH] Revert oom rewrite series

On Tue, 16 Nov 2010, Jesper Juhl wrote:

> [jj@dragon ~]$ uname -a
> Linux dragon 2.6.37-rc1-ARCH-00542-g0143832-dirty #1 SMP PREEMPT Mon Nov 15 22:01:52 CET 2010 x86_64 Intel(R) Core(TM)2 Duo CPU T7250 @ 2.00GHz GenuineIntel GNU/Linux
> [jj@dragon ~]$ dmesg | grep oom_adj
> start_kdeinit (1502): /proc/1502/oom_adj is deprecated, please use /proc/1502/oom_score_adj instead.
> [jj@dragon ~]$ /usr/lib/kde4/libexec/start_kdeinit --version
>
> Qt: 4.7.1
> KDE: 4.5.3 (KDE 4.5.3)
>

Thanks for the report! I'll get involved with kde-devel and send a patch
to remove this dependency on newer kernels to expedite the process.

[ Others with reports of deprecated use of oom_adj can contact me
privately and I'll find the parties of interest to avoid topics
unrelated to the kernel itself on LKML. ]

2010-11-16 00:13:38

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: [PATCH] Revert oom rewrite series

On Tue, 16 Nov 2010 00:31:00 +0100, Jesper Juhl said:

> I'm not going into the debate about whether or not deprecating one tunable
> for two years is sufficient or not. I'm simply going to mention one app
> that I know of that needs to be converted to use "oom_score_adj" on my
> box :
>
> [jj@dragon ~]$ uname -a
> Linux dragon 2.6.37-rc1-ARCH-00542-g0143832-dirty #1 SMP PREEMPT Mon Nov 15 2
2:01:52 CET 2010 x86_64 Intel(R) Core(TM)2 Duo CPU T7250 @ 2.00GHz GenuineIntel
GNU/Linux
> [jj@dragon ~]$ dmesg | grep oom_adj
> start_kdeinit (1502): /proc/1502/oom_adj is deprecated, please use /proc/1502/oom_score_adj instead.

Make that 2 common apps:

% uname -a
Linux turing-police.cc.vt.edu 2.6.37-rc1-mmotm1109 #1 SMP PREEMPT Wed Nov 10 12:30:17 EST 2010 x86_64 x86_64 x86_64 GNU/Linux
% dmesg | grep oom
[ 89.981594] sshd (4168): /proc/4168/oom_adj is deprecated, please use /proc/4168/oom_score_adj instead.
% rpm -q openssh
openssh-5.6p1-16.fc15.x86_64

5.6p1 is the latest-n-greatest released version on http://www.openssh.org, so somebody
probably needs to rattle their chain...


Attachments:
(No filename) (227.00 B)

2010-11-16 06:43:40

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH] Revert oom rewrite series

On Mon, 15 Nov 2010, [email protected] wrote:

> Make that 2 common apps:
>
> % uname -a
> Linux turing-police.cc.vt.edu 2.6.37-rc1-mmotm1109 #1 SMP PREEMPT Wed Nov 10 12:30:17 EST 2010 x86_64 x86_64 x86_64 GNU/Linux
> % dmesg | grep oom
> [ 89.981594] sshd (4168): /proc/4168/oom_adj is deprecated, please use /proc/4168/oom_score_adj instead.
> % rpm -q openssh
> openssh-5.6p1-16.fc15.x86_64
>
> 5.6p1 is the latest-n-greatest released version on http://www.openssh.org, so somebody
> probably needs to rattle their chain...
>

Thanks, Darren Tucker fixed this a few hours after I reported it on the
openssh bugzilla, the patch is at
https://bugzilla.mindrot.org/show_bug.cgi?id=1838 -- it uses oom_score_adj
if it exists and then falls back to oom_adj if running on an older kernel.

2010-11-16 10:04:50

by Martin Knoblauch

[permalink] [raw]
Subject: Re: [PATCH] Revert oom rewrite series

CC trimmed for sanity ...

----- Original Message ----

> From: David Rientjes <[email protected]>
> To: Jesper Juhl <[email protected]>
> Cc: KOSAKI Motohiro <[email protected]>; Andrew Morton
><[email protected]>; Linus Torvalds <[email protected]>;
>LKML <[email protected]>; Ying Han <[email protected]>; Bodo Eggert
><[email protected]>; [email protected]
> Sent: Tue, November 16, 2010 1:06:21 AM
> Subject: Re: [PATCH] Revert oom rewrite series
>
>
> Thanks for the report! I'll get involved with kde-devel and send a patch
> to remove this dependency on newer kernels to expedite the process.
>
> [ Others with reports of deprecated use of oom_adj can contact me
> privately and I'll find the parties of interest to avoid topics
> unrelated to the kernel itself on LKML. ]
David,

another one for your collection. You asked for it :-) This is CentOS-5.5
running on top of kernel 2.6.36, likely out of initrd:

$ dmesg | grep deprecated
[ 2.430330] nash-hotplug (67): /proc/67/oom_adj is deprecated, please use
/proc/67/oom_score_adj instead.

Cheers
Martin

2010-11-16 10:34:02

by Alessandro Suardi

[permalink] [raw]
Subject: Re: [PATCH] Revert oom rewrite series

On Tue, Nov 16, 2010 at 11:04 AM, Martin Knoblauch
<[email protected]> wrote:
> CC trimmed for sanity ...
>
> ----- Original Message ----
>
>> From: David Rientjes <[email protected]>
>> To: Jesper Juhl <[email protected]>
>> Cc: KOSAKI Motohiro <[email protected]>; Andrew Morton
>><[email protected]>; Linus Torvalds <[email protected]>;
>>LKML <[email protected]>; Ying Han <[email protected]>; Bodo Eggert
>><[email protected]>; [email protected]
>> Sent: Tue, November 16, 2010 1:06:21 AM
>> Subject: Re: [PATCH] Revert oom rewrite series
>>
>>
>> Thanks for the report! ?I'll get ?involved with kde-devel and send a patch
>> to remove this dependency on newer ?kernels to expedite the process.
>>
>> ?[ Others with reports of deprecated use ?of oom_adj can contact me
>> ? ?privately and I'll find the parties of ?interest to avoid topics
>> ? ?unrelated to the kernel itself on LKML. ?]
> David,
>
> ?another one for your collection. You asked for it :-) This is CentOS-5.5
> running on top of kernel 2.6.36, likely out of initrd:
>
> $ dmesg | grep deprecated
> [ ? ?2.430330] nash-hotplug (67): /proc/67/oom_adj is deprecated, please use
> /proc/67/oom_score_adj instead.

...and another, on Fedora 14, 2.6.37-rc1-git11:

auditd (2583): /proc/2583/oom_adj is deprecated, please use
/proc/2583/oom_score_adj instead.

Cheers,

--alessandro

?"There's always a siren singing you to shipwreck"

?? (Radiohead, "There There")

2010-11-16 11:04:37

by Alan

[permalink] [raw]
Subject: Re: [PATCH] Revert oom rewrite series

> 5.6p1 is the latest-n-greatest released version on http://www.openssh.org, so somebody
> probably needs to rattle their chain...

But current openssh needs to support old kernels.

This is why this kind of obsoleting doesn't work well. It's not "update
your app" so much as "drop support for older stuff or start doing
complicated crap dependant on version"

and it's why for tiny amounts of code it is the *wrong* thing to force
obsolete stuff especially when it still doesn't seem to have been
properly marked for deprecation in the first place.

2010-11-16 13:04:12

by Florian Mickler

[permalink] [raw]
Subject: Re: [PATCH] Revert oom rewrite series

On Tue, 16 Nov 2010 11:03:10 +0000
Alan Cox <[email protected]> wrote:

> > 5.6p1 is the latest-n-greatest released version on http://www.openssh.org, so somebody
> > probably needs to rattle their chain...
>
> But current openssh needs to support old kernels.
>
> This is why this kind of obsoleting doesn't work well. It's not "update
> your app" so much as "drop support for older stuff or start doing
> complicated crap dependant on version"
>
> and it's why for tiny amounts of code it is the *wrong* thing to force
> obsolete stuff especially when it still doesn't seem to have been
> properly marked for deprecation in the first place.
>

How does one mark it apropriately?
The commit 51b1bd2 (oom: deprecate oom_adj tunable, see below)
added it to feature-removal-schedule.txt, a patch for
Documentation/ABI has also been provided in the meantime, if i'm not
mistaken.

And there is already a patch for openssh:
https://bugzilla.mindrot.org/show_bug.cgi?id=1838

Regards,
Flo

commit 51b1bd2ace1595b72956224deda349efa880b693
Author: David Rientjes <[email protected]>
Date: Mon Aug 9 17:19:47 2010 -0700

oom: deprecate oom_adj tunable

/proc/pid/oom_adj is now deprecated so that that it may eventually be
removed. The target date for removal is August 2012.

A warning will be printed to the kernel log if a task attempts to use this
interface. Future warning will be suppressed until the kernel is rebooted
to prevent spamming the kernel log.

Signed-off-by: David Rientjes <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Balbir Singh <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

2010-11-16 14:57:20

by Alan

[permalink] [raw]
Subject: Re: [PATCH] Revert oom rewrite series

> How does one mark it apropriately?
> The commit 51b1bd2 (oom: deprecate oom_adj tunable, see below)
> added it to feature-removal-schedule.txt, a patch for
> Documentation/ABI has also been provided in the meantime, if i'm not
> mistaken.

Yes - so why is it spewing crap, annoying users and trying to irritate
application authors. It's not 2012 yet.

Subject: Re: [PATCH] Revert oom rewrite series

El Mon, 15 Nov 2010 19:13:15 -0500
[email protected] escribió:

> On Tue, 16 Nov 2010 00:31:00 +0100, Jesper Juhl said:
>
> > I'm not going into the debate about whether or not deprecating one tunable
> > for two years is sufficient or not. I'm simply going to mention one app
> > that I know of that needs to be converted to use "oom_score_adj" on my
> > box :
> >
> > [jj@dragon ~]$ uname -a
> > Linux dragon 2.6.37-rc1-ARCH-00542-g0143832-dirty #1 SMP PREEMPT Mon Nov 15 2
> 2:01:52 CET 2010 x86_64 Intel(R) Core(TM)2 Duo CPU T7250 @ 2.00GHz GenuineIntel
> GNU/Linux
> > [jj@dragon ~]$ dmesg | grep oom_adj
> > start_kdeinit (1502): /proc/1502/oom_adj is deprecated, please use /proc/1502/oom_score_adj instead.
>
> Make that 2 common apps:
>
> % uname -a
> Linux turing-police.cc.vt.edu 2.6.37-rc1-mmotm1109 #1 SMP PREEMPT Wed Nov 10 12:30:17 EST 2010 x86_64 x86_64 x86_64 GNU/Linux
> % dmesg | grep oom
> [ 89.981594] sshd (4168): /proc/4168/oom_adj is deprecated, please use /proc/4168/oom_score_adj instead.
> % rpm -q openssh
> openssh-5.6p1-16.fc15.x86_64
>
> 5.6p1 is the latest-n-greatest released version on http://www.openssh.org, so somebody
> probably needs to rattle their chain...

$ dmesg | grep deprecated
[ 1.473365] udevd (662): /proc/662/oom_adj is deprecated, please use /proc/662/oom_score_adj instead.
$ apt-cache policy udev
udev:
Instalados: 151-12.3
Candidato: 151-12.3
Tabla de versión:
*** 151-12.3 0
500 http://es.archive.ubuntu.com/ubuntu/ lucid-proposed/main Packages
100 /var/lib/dpkg/status
151-12.2 0
500 http://es.archive.ubuntu.com/ubuntu/ lucid-updates/main Packages
151-12 0
500 http://es.archive.ubuntu.com/ubuntu/ lucid/main Packages
$ uname -a
Linux varda 2.6.36-00001-g90d39e9 #145 SMP PREEMPT Wed Oct 20 23:27:44 CEST 2010 x86_64 GNU/Linux

Ubuntu 10.04 LTS

>


Attachments:
signature.asc (836.00 B)

2010-11-16 20:57:32

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH] Revert oom rewrite series

On Tue, 16 Nov 2010, Alan Cox wrote:

> Yes - so why is it spewing crap, annoying users and trying to irritate
> application authors. It's not 2012 yet.
>

It's a WARN_ON_ONCE() so it will only spew a single line as a reminder
that the application needs to be updated; would you prefer that to be
suppressed until a year before removal, for example?

2010-11-16 21:01:13

by Fabio Comolli

[permalink] [raw]
Subject: Re: [PATCH] Revert oom rewrite series

[CC: list trimmed again for sanity]

Another one:

[ 34.709156] chromium-browse (1439): /proc/1480/oom_adj is
deprecated, please use /proc/1480/oom_score_adj instead.

2.6.37-rc2 - archlinux - package chromium-browser-ppa from AUR




On Tue, Nov 16, 2010 at 9:57 PM, David Rientjes <[email protected]> wrote:
> On Tue, 16 Nov 2010, Alan Cox wrote:
>
>> Yes - so why is it spewing crap, annoying users and trying to irritate
>> application authors. It's not 2012 yet.
>>
>
> It's a WARN_ON_ONCE() so it will only spew a single line as a reminder
> that the application needs to be updated; would you prefer that to be
> suppressed until a year before removal, for example?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

2010-11-17 00:06:10

by Bodo Eggert

[permalink] [raw]
Subject: Re: [PATCH] Revert oom rewrite series

On Mon, 15 Nov 2010, David Rientjes wrote:
> On Tue, 16 Nov 2010, Bodo Eggert wrote:

> > > CAP_SYS_RESOURCE threads have full control over their oom killing priority
> > > by /proc/pid/oom_score_adj
> >
> > , but unless they are written in the last months and designed for linux
> > and if the author took some time to research each external process invocation,
> > they can not be aware of this possibility.
> >
>
> You're clearly wrong, CAP_SYS_RESOURCE has been required to modify oom_adj
> for over five years (as long as the git history). 8fb4fc68, merged into
> 2.6.20, allowed tasks to raise their own oom_adj but not decrease it.
> That is unchanged by the rewrite.

You are misunderstanding me. It was allowed to do this, but it did not need
to do it yet. It was enough to be a well-written POSIX application without
linux-specific OOM hacks for some specific kernel versions.

> > Besides that, if each process is supposed to change the default, the default
> > is wrong.
>
> That doesn't make any sense, if want to protect a thread from the oom
> killer you're going to need to modify oom_score_adj, the kernel can't know
> what you perceive as being vital. Having CAP_SYS_RESOURCE alone does not
> imply that, it only allows unbounded access to resources. That's
> completely orthogonal to the goal of the oom killer heuristic, which is to
> find the most memory-hogging task to kill.

The old oom killer's task was to guess the best victim to kill. For me, it
did a good job (but the system kept thrashing for too long until it kicked
the offender). Looking at CAP_SYS_RESOURCE was one way to recognize
important processes.

> > 1) The exponential scale did have a low resolution.
> >
> > 2) The heuristics were developed using much brain power and much
> > trial-and-error. You are going back to basics, and some people
> > are not convinced that this is better. I googled and I did not
> > find a discussion about how and why the new score was designed
> > this way.
> > looking at the output of:
> > cd /proc; for a in [0-9]*; do
> > echo `cat $a/oom_score` $a `perl -pes/'\0.*$'// < $a/cmdline`;
> > done|grep -v ^0|sort -n |less
> > , I 'm not convinced, too.
> >
>
> The old heuristics were a mixture of arbitrary values that didn't adjust
> scores based on a unit and would often cause the incorrect task to be
> targeted because there was no clear goal being achieved. The new
> heuristic has a solid goal: to identify and kill the most memory-hogging
> task that is eligible given the context in which the oom occurs. If you
> disagree with that goal and want any of the old heursitics reintroduced,
> please show that it makes sense in the oom killer.

The first old OOM killer did the same as you promise the current one does,
except for your bugfixes. That's why it killed the wrong applications and
all the heuristics were added until the complaints stopped.

Off cause I did not yet test your OOM killer, maybe it really is better.
Heuristics tend to rot and you did much work to make it right.

I don't want the old OOM killer back, but I don't want you to fall
into the same pits as the pre-old OOM killer used to do.

> > PS) Mapping an exponential value to a linear score is bad. E.g. A
> > oom_adj of 8 should make an 1-MB-process as likely to kill as
> > a 256-MB-process with oom_adj=0.
> >
>
> To show that, you would have to show that an application that exists today
> uses an oom_adj for something other than polarization and is based on a
> calculation of allowable memory usage. It simply doesn't exist.

No such application should exist because the OOM killer should DTRT.
oom_adj was supposed to let the sysadmin lower his mission-critical
DB's score to be just lower than the less-important tasks, or to
point the kernel to his ever-faulty and easily-restarted browser.

> > PS2) Because I saw this in your presentation PDF: (@udev-people)
> > The -17 score of udevd is wrong, since it will even prevent
> > the OOM killer from working correctly if it grows to 100 MB:
> >
>
> Threads with CAP_SYS_RESOURCE are free to lower the oom_score_adj of any
> thread they deem fit and that includes applications that lower its own
> oom_score_adj. The kernel isn't going to prohibit users from setting
> their own oom_score_adj.

My point is: The udev people should not prevent the OOM killer
unconditionally, it has an important task in case something goes wrong.
I just didn't want to start a new thread at that time of day.
--
How do I set my laser printer on stun?

2010-11-17 00:25:46

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH] Revert oom rewrite series

On Wed, 17 Nov 2010, Bodo Eggert wrote:

> The old oom killer's task was to guess the best victim to kill. For me, it
> did a good job (but the system kept thrashing for too long until it kicked
> the offender). Looking at CAP_SYS_RESOURCE was one way to recognize
> important processes.
>

CAP_SYS_RESOURCE does not imply the task is important.

There's a problem when the kernel is oom; killing a thread that is getting
work done is one of the most serious remedies the kernel will ever do to
allow forward progress. In almost all scenarios (except in some cpuset or
memcg configurations), it's a userspace configuration issue that exhausts
memory and the VM finds no other alternative. CAP_SYS_RESOURCE threads
have access to unbounded amounts of resources and thus can use an
extremely large amount of memory very quickly and at a detriment to other
threads that may be as important to more important. Considering them any
different is an unsubstantiated and undefined behavior that should not be
considered in the heuristic _unless_ the administrator or the task itself
tells the kernel via oom_score_adj of its priority.

> > The old heuristics were a mixture of arbitrary values that didn't adjust
> > scores based on a unit and would often cause the incorrect task to be
> > targeted because there was no clear goal being achieved. The new
> > heuristic has a solid goal: to identify and kill the most memory-hogging
> > task that is eligible given the context in which the oom occurs. If you
> > disagree with that goal and want any of the old heursitics reintroduced,
> > please show that it makes sense in the oom killer.
>
> The first old OOM killer did the same as you promise the current one does,
> except for your bugfixes. That's why it killed the wrong applications and
> all the heuristics were added until the complaints stopped.
>

No, the old oom killer did not always kill the application that used the
most amount of memory; it considered other factors with arbitrary point
deductions such as nice level, runtime, CAP_SYS_RAWIO, CAP_SYS_RESOURCE,
etc. We had to remove those heuristics internally in older kernels as
well because it would often allow a task to runaway using a massive amount
of memory because of leaks and kill everything else on the system before
targeting the appropriate task. At that point, it left the system with
barely anything running and no work was getting done.

> Off cause I did not yet test your OOM killer, maybe it really is better.
> Heuristics tend to rot and you did much work to make it right.
>
> I don't want the old OOM killer back, but I don't want you to fall
> into the same pits as the pre-old OOM killer used to do.
>

Thanks, and that's why I'm trying to avoid additional heuristics such
CAP_SYS_RESOURCE where the priority is _implied_ rather than _proven_. If
CAP_SYS_RESOURCE was defined to be more preferred to stay alive, then I'd
have no argument; it isn't.

> > > PS) Mapping an exponential value to a linear score is bad. E.g. A
> > > oom_adj of 8 should make an 1-MB-process as likely to kill as
> > > a 256-MB-process with oom_adj=0.
> > >
> >
> > To show that, you would have to show that an application that exists today
> > uses an oom_adj for something other than polarization and is based on a
> > calculation of allowable memory usage. It simply doesn't exist.
>
> No such application should exist because the OOM killer should DTRT.
> oom_adj was supposed to let the sysadmin lower his mission-critical
> DB's score to be just lower than the less-important tasks, or to
> point the kernel to his ever-faulty and easily-restarted browser.
>

oom_score_adj allows use to define when an application is using more
memory than expected and is often helpful in cpuset, memcg, or mempolicy
constrained cases as well. We'd like to be able to say that 30% of
available memory should be discounted from a particular task that is
expected to use 30% more memory than others without getting preferred.
oom_score_adj can do that, oom_adj could not.

2010-11-17 00:49:11

by Mandeep Singh Baines

[permalink] [raw]
Subject: Re: [PATCH] Revert oom rewrite series

Bodo Eggert ([email protected]) wrote:
> On Mon, 15 Nov 2010, David Rientjes wrote:
> > On Tue, 16 Nov 2010, Bodo Eggert wrote:
>
> > > > CAP_SYS_RESOURCE threads have full control over their oom killing priority
> > > > by /proc/pid/oom_score_adj
> > >
> > > , but unless they are written in the last months and designed for linux
> > > and if the author took some time to research each external process invocation,
> > > they can not be aware of this possibility.
> > >
> >
> > You're clearly wrong, CAP_SYS_RESOURCE has been required to modify oom_adj
> > for over five years (as long as the git history). 8fb4fc68, merged into
> > 2.6.20, allowed tasks to raise their own oom_adj but not decrease it.
> > That is unchanged by the rewrite.
>
> You are misunderstanding me. It was allowed to do this, but it did not need
> to do it yet. It was enough to be a well-written POSIX application without
> linux-specific OOM hacks for some specific kernel versions.
>
> > > Besides that, if each process is supposed to change the default, the default
> > > is wrong.
> >
> > That doesn't make any sense, if want to protect a thread from the oom
> > killer you're going to need to modify oom_score_adj, the kernel can't know
> > what you perceive as being vital. Having CAP_SYS_RESOURCE alone does not
> > imply that, it only allows unbounded access to resources. That's
> > completely orthogonal to the goal of the oom killer heuristic, which is to
> > find the most memory-hogging task to kill.
>
> The old oom killer's task was to guess the best victim to kill. For me, it
> did a good job (but the system kept thrashing for too long until it kicked

Here's a patch I've been working on to control thrashing.

http://lkml.org/lkml/2010/10/28/289

It works well for our app: web browser. We'd rather OOM quickly and kill
a browser tab than thrash for a few minutes and then OOM. It works well for
us but I'm working on a more generally useful solution.

> the offender). Looking at CAP_SYS_RESOURCE was one way to recognize
> important processes.
>
> > > 1) The exponential scale did have a low resolution.
> > >
> > > 2) The heuristics were developed using much brain power and much
> > > trial-and-error. You are going back to basics, and some people
> > > are not convinced that this is better. I googled and I did not
> > > find a discussion about how and why the new score was designed
> > > this way.
> > > looking at the output of:
> > > cd /proc; for a in [0-9]*; do
> > > echo `cat $a/oom_score` $a `perl -pes/'\0.*$'// < $a/cmdline`;
> > > done|grep -v ^0|sort -n |less
> > > , I 'm not convinced, too.
> > >
> >
> > The old heuristics were a mixture of arbitrary values that didn't adjust
> > scores based on a unit and would often cause the incorrect task to be
> > targeted because there was no clear goal being achieved. The new
> > heuristic has a solid goal: to identify and kill the most memory-hogging
> > task that is eligible given the context in which the oom occurs. If you
> > disagree with that goal and want any of the old heursitics reintroduced,
> > please show that it makes sense in the oom killer.
>
> The first old OOM killer did the same as you promise the current one does,
> except for your bugfixes. That's why it killed the wrong applications and
> all the heuristics were added until the complaints stopped.
>
> Off cause I did not yet test your OOM killer, maybe it really is better.
> Heuristics tend to rot and you did much work to make it right.
>
> I don't want the old OOM killer back, but I don't want you to fall
> into the same pits as the pre-old OOM killer used to do.
>
> > > PS) Mapping an exponential value to a linear score is bad. E.g. A
> > > oom_adj of 8 should make an 1-MB-process as likely to kill as
> > > a 256-MB-process with oom_adj=0.
> > >
> >
> > To show that, you would have to show that an application that exists today
> > uses an oom_adj for something other than polarization and is based on a
> > calculation of allowable memory usage. It simply doesn't exist.
>
> No such application should exist because the OOM killer should DTRT.
> oom_adj was supposed to let the sysadmin lower his mission-critical
> DB's score to be just lower than the less-important tasks, or to
> point the kernel to his ever-faulty and easily-restarted browser.
>
> > > PS2) Because I saw this in your presentation PDF: (@udev-people)
> > > The -17 score of udevd is wrong, since it will even prevent
> > > the OOM killer from working correctly if it grows to 100 MB:
> > >
> >
> > Threads with CAP_SYS_RESOURCE are free to lower the oom_score_adj of any
> > thread they deem fit and that includes applications that lower its own
> > oom_score_adj. The kernel isn't going to prohibit users from setting
> > their own oom_score_adj.
>
> My point is: The udev people should not prevent the OOM killer
> unconditionally, it has an important task in case something goes wrong.
> I just didn't want to start a new thread at that time of day.
> --
> How do I set my laser printer on stun?

2010-11-17 04:05:10

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: [PATCH] Revert oom rewrite series

On Tue, 16 Nov 2010 14:55:51 GMT, Alan Cox said:
> > How does one mark it apropriately?
> > The commit 51b1bd2 (oom: deprecate oom_adj tunable, see below)
> > added it to feature-removal-schedule.txt, a patch for
> > Documentation/ABI has also been provided in the meantime, if i'm not
> > mistaken.
>
> Yes - so why is it spewing crap, annoying users and trying to irritate
> application authors. It's not 2012 yet.

Aug 2012 is only 6 kernel releases or so away....

Presumably the whinging is so we start tracking down the offending userspace
and getting it fixed before 2012 gets here. Sticking the warning in just one
or two kernel releases before it becomes official leads to "I can't run the new
kernel because my userspace isn't patched yet". We really can't win here,
we don't whinge and stuff doesn't get tracked down and fixed, we do whinge
and that gets people upset too.


Attachments:
(No filename) (227.00 B)

2010-11-23 07:17:20

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH] Revert oom rewrite series

> If you still have a question, please ask me. maybe I can answer all of
> your question.

Zero question? If so, I'll resend the revert to linus.


Actually, I don't tend hear the shouting. They aren't discussion. It's
only crappy shout. Googlers have to think why no person agree their claim.
ZERO. even though >20 people discussed with them. DavidR seems to continue
to make flame. But I don't care. He have to learn making flame don't solve
ANYTHING.

And they have to learn correct discussion way and which is different of
discusstion and shouting. and why we have to learn userland workload and
have to avoid any breakage. I'm angry googlers frequently break kernel
and frequently ignore userland claim.


2010-11-23 07:18:37

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH] Revert oom rewrite series

Sorry for the delay.

> On Mon, 15 Nov 2010, KOSAKI Motohiro wrote:
>
> > Of cource, I denied. He seems to think number of email is meaningful than
> > how talk about. but it's incorrect and makes no sense. Why not? Also, He
> > have to talk about logically. "Hey, I think it's not bug" makes no sense.
> > Such claim don't solve anything. userland is still unhappy. Why not?
> > I want to quickly action.
>
> If there are pending complaints or bugs that I haven't addressed, please
> bring them to my attention. To date, I know of no issues that have been
> raised that I have not addressed; you're always free to disagree with my
> position, but in the end you may find that when the kernel moves in a
> different direction that you should begin to accept it.

I can't understand. Why do I need to ignore userland folks? WHY?
I have no reason userland complain. I tend to prefer to avoid userland
folks painful than kernel developers.


>
> > That said, If anyone want to change userland ABI, Be carefully. They have
> > to investigate userland usecase carefully and avoid to break them carefully
> > again. If someone think "hey, It's no big matter. userland rewritten can solve
> > an issue", I strongly disagree. they don't understand why all of userland
> > applications rewritten is harmful.
>
> You may remember that the initial version of my rewrite replaced oom_adj
> entirely with the new oom_score_adj semantics. Others suggested that it
> be seperated into a new tunable and the old tunable deprecated for a
> lengthy period of time. I accepted that criticism and understood the
> drawbacks of replacing the tunable immediately and followed those
> suggestions. I disagree with you that the deprecation of oom_adj for a
> period of two years is as dramatic as you imply and I disagree that users
> are experiencing problems with the linear scale that it now operates on
> versus the old exponential scale.

Yes and No. People wanted to separate AND don't break old one.


>
> > 1) About two month ago, Dave hansen observed strange OOM issue because he
> > has a big machine and ALL process are not so big. thus, eventually all
> > process got oom-score=0 and oom-killer didn't work.
> >
> > https://kerneltrap.org/mailarchive/linux-driver-devel/2010/9/9/6886383
> >
> > DavidR changed oom-score to +1 in such situation.
> >
> > http://kerneltrap.org/mailarchive/linux-kernel/2010/9/9/4617455
> >
> > But it is completely bognus. If all process have score=1, oom-killer fall
> > back to purely random killer. I expected and explained his patch has
> > its problem at half years ago. but he didn't fix yet.
> >
>
> The resolution with which the oom killer considers memory is at 0.1% of
> system RAM at its highest (smaller when you have a memory controller,
> cpuset, or mempolicy constrained oom). It considers a task within 0.1% of
> memory of another task to have equal "badness" to kill, we don't break
> ties in between that resolution -- it all depends on which one shows up in
> the tasklist first. If you disagree with that resolution, which I support
> as being high enough, then you may certainly propose a patch to make it
> even finer at 0.01%, 0.001%, etc. It would only change oom_badness() to
> range between [0,10000], [0,100000], etc.

No.
Think Moore's Law. rational value will be not able to work in future anyway.
10 years ago, I used 20M bytes memory desktop machine and I'm now using 2GB.
memory amount is growing and growing. and bash size doesn't grwoing so fast.


>
> > 2) Also half years ago, I did explained oom_adj is used from multiple
> > applications. And we can't break them. But DavidR didn't fix.
> >
>
> And we didn't. oom_adj is still there and maps linearly to oom_score_adj;
> you just can't show a single application where that mapping breaks because
> it was based on an actual calculation.
>
> If you would like to cite these "multiple" applications that need to be
> converted to use oom_score_adj (I know of udev), please let me know and
> if they're open-source applications then I will commit to submitting
> patches for them myself. I believe the two year window is sufficient for
> everyone else, though.

If you want, you have to change userland at first and by yourself. Don't
claim anyoneelse should working for you.


> > 3) Also about four month ago, I and kamezawa-san pointed out his patch
> > don't work on memcg. It also haven't been fixed.
>
> I don't know what you're referring to here, sorry.

You should have read my patch. Even though you haven't use memcg, We do.



> As kamezawa-san pointed out, This break cgroup and lxr environment.
> He said,
> > Assume 2 proceses A, B which has oom_score_adj of 300 and 0
> > And A uses 200M, B uses 1G of memory under 4G system
> >
> > Under the system.
> > A's socre = (200M *1000)/4G + 300 = 350
> > B's score = (1G * 1000)/4G = 250.
> >
> > In the cpuset, it has 2G of memory.
> > A's score = (200M * 1000)/2G + 300 = 400
> > B's socre = (1G * 1000)/2G = 500
> >
> > This priority-inversion don't happen in current system.



>
> > In the other hand, You can't explain what worth OOM-rewritten patch has.
> > Because there is nothing. It is only "powerful"(TM) for Google. but
> > instead It has zero worth for every other people. Here is just technical
> > issue. Bah.
> >
>
> Please see my reply to Figo.zhang where I enumerate the four reasons why
> the new userspace tunable is more powerful than oom_adj.

I'm NOT interesting *powerful* crap. Please DON'T talk which is powerful.
I can only said, It's useful only for you.



> At this point, I can only speculate that your distaste for the new oom
> killer is one of disposition; it seems like everytime you reply to an
> email (or, more regularly, just repost your revert) that you come into it
> with the attitude that my response cannot possibly be correct and that the
> way you see things is exactly as they should be. If you were to consider
> other people's opinions, however, you may find some common ground that can
> be met. I certainly did that when I introduced oom_score_adj instead of
> replacing oom_adj immediatley. I also did it when I removed the forkbomb
> detector from the rewrite. I also did it when considering swap in the
> heuristic when it initially was only rss. Andrew is in the position where
> he has to make a judgment call on what should be included and what
> shouldn't and it should be pretty darn clear after you post your revert
> the first time, then the second time, then the third time, then the fourth
> time, and now the fifth time.


2010-11-23 23:51:32

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH] Revert oom rewrite series

> 2010/11/13 KOSAKI Motohiro <[email protected]>:
> >
> > Please apply this. this patch revert commits of oom changes since v2.6.35.
>
> I'm not getting involved in this whole flame-war. You need to convince
> Andrew, who has been the person everything went through.

I did.

Therefore, I will resend the patch to you. Thanks.


--------------------------------------------------------------------------
Subject: [PATCH] Revert oom rewrite series

This reverts following commits. They has broke an ABI and made multiple
enduser claim.

9c28ab662a8e3d19d07077ac0a8931c015e8afec Revert "oom: badness heuristic rewrite"
74cd8c6cb3e093c4d67ac3eb3581e246e4981dad Revert "oom: deprecate oom_adj tunable"
79a0bd5796e754c4b4e22071c4edddef3517d010 Revert "memcg: use find_lock_task_mm() in memory cgroups oom"
a465ef80c2a9fe73c85029fcea5c68ffee8dbb69 Revert "oom: always return a badness score of non-zero for eligible tas
516fcbb0c45d943df1b739d3be3d417aee2275f3 Revert "oom: filter unkillable tasks from tasklist dump"
b1c98f95a7954c450dadd809280f86863ea9d05d Revert "oom: add per-mm oom disable count"
fd79f3f47c82a0af5288afe7556905dd171bfc43 Revert "oom: avoid killing a task if a thread sharing its mm cannot be
2d72175528870dcef577db4a2a0b49d819c6eaff Revert "oom: kill all threads sharing oom killed task's mm"
be212960618ddcdb9526ce2cb73fd081fd3e90ea Revert "oom: rewrite error handling for oom_adj and oom_score_adj tunab
1b17c41599c594c7d11ef415a92d47c205fe89ea Revert "oom: fix locking for oom_adj and oom_score_adj"

Signed-off-by: KOSAKI Motohiro <[email protected]>
---
Documentation/feature-removal-schedule.txt | 25 ---
Documentation/filesystems/proc.txt | 97 ++++-----
fs/exec.c | 5 -
fs/proc/base.c | 176 ++--------------
include/linux/memcontrol.h | 8 -
include/linux/mm_types.h | 2 -
include/linux/oom.h | 19 +--
include/linux/sched.h | 3 +-
kernel/exit.c | 3 -
kernel/fork.c | 16 +--
mm/memcontrol.c | 28 +---
mm/oom_kill.c | 323 ++++++++++++++--------------
12 files changed, 227 insertions(+), 478 deletions(-)

diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt
index d8f36f9..9af16b9 100644
--- a/Documentation/feature-removal-schedule.txt
+++ b/Documentation/feature-removal-schedule.txt
@@ -166,31 +166,6 @@ Who: Eric Biederman <[email protected]>

---------------------------

-What: /proc/<pid>/oom_adj
-When: August 2012
-Why: /proc/<pid>/oom_adj allows userspace to influence the oom killer's
- badness heuristic used to determine which task to kill when the kernel
- is out of memory.
-
- The badness heuristic has since been rewritten since the introduction of
- this tunable such that its meaning is deprecated. The value was
- implemented as a bitshift on a score generated by the badness()
- function that did not have any precise units of measure. With the
- rewrite, the score is given as a proportion of available memory to the
- task allocating pages, so using a bitshift which grows the score
- exponentially is, thus, impossible to tune with fine granularity.
-
- A much more powerful interface, /proc/<pid>/oom_score_adj, was
- introduced with the oom killer rewrite that allows users to increase or
- decrease the badness() score linearly. This interface will replace
- /proc/<pid>/oom_adj.
-
- A warning will be emitted to the kernel log if an application uses this
- deprecated interface. After it is printed once, future warnings will be
- suppressed until the kernel is rebooted.
-
----------------------------
-
What: remove EXPORT_SYMBOL(kernel_thread)
When: August 2006
Files: arch/*/kernel/*_ksyms.c
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index e73df27..030e3a1 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -33,8 +33,7 @@ Table of Contents
2 Modifying System Parameters

3 Per-Process Parameters
- 3.1 /proc/<pid>/oom_adj & /proc/<pid>/oom_score_adj - Adjust the oom-killer
- score
+ 3.1 /proc/<pid>/oom_adj - Adjust the oom-killer score
3.2 /proc/<pid>/oom_score - Display current oom-killer score
3.3 /proc/<pid>/io - Display the IO accounting fields
3.4 /proc/<pid>/coredump_filter - Core dump filtering settings
@@ -1246,64 +1245,42 @@ of the kernel.
CHAPTER 3: PER-PROCESS PARAMETERS
------------------------------------------------------------------------------

-3.1 /proc/<pid>/oom_adj & /proc/<pid>/oom_score_adj- Adjust the oom-killer score
---------------------------------------------------------------------------------
-
-These file can be used to adjust the badness heuristic used to select which
-process gets killed in out of memory conditions.
-
-The badness heuristic assigns a value to each candidate task ranging from 0
-(never kill) to 1000 (always kill) to determine which process is targeted. The
-units are roughly a proportion along that range of allowed memory the process
-may allocate from based on an estimation of its current memory and swap use.
-For example, if a task is using all allowed memory, its badness score will be
-1000. If it is using half of its allowed memory, its score will be 500.
-
-There is an additional factor included in the badness score: root
-processes are given 3% extra memory over other tasks.
-
-The amount of "allowed" memory depends on the context in which the oom killer
-was called. If it is due to the memory assigned to the allocating task's cpuset
-being exhausted, the allowed memory represents the set of mems assigned to that
-cpuset. If it is due to a mempolicy's node(s) being exhausted, the allowed
-memory represents the set of mempolicy nodes. If it is due to a memory
-limit (or swap limit) being reached, the allowed memory is that configured
-limit. Finally, if it is due to the entire system being out of memory, the
-allowed memory represents all allocatable resources.
-
-The value of /proc/<pid>/oom_score_adj is added to the badness score before it
-is used to determine which task to kill. Acceptable values range from -1000
-(OOM_SCORE_ADJ_MIN) to +1000 (OOM_SCORE_ADJ_MAX). This allows userspace to
-polarize the preference for oom killing either by always preferring a certain
-task or completely disabling it. The lowest possible value, -1000, is
-equivalent to disabling oom killing entirely for that task since it will always
-report a badness score of 0.
-
-Consequently, it is very simple for userspace to define the amount of memory to
-consider for each task. Setting a /proc/<pid>/oom_score_adj value of +500, for
-example, is roughly equivalent to allowing the remainder of tasks sharing the
-same system, cpuset, mempolicy, or memory controller resources to use at least
-50% more memory. A value of -500, on the other hand, would be roughly
-equivalent to discounting 50% of the task's allowed memory from being considered
-as scoring against the task.
-
-For backwards compatibility with previous kernels, /proc/<pid>/oom_adj may also
-be used to tune the badness score. Its acceptable values range from -16
-(OOM_ADJUST_MIN) to +15 (OOM_ADJUST_MAX) and a special value of -17
-(OOM_DISABLE) to disable oom killing entirely for that task. Its value is
-scaled linearly with /proc/<pid>/oom_score_adj.
-
-Writing to /proc/<pid>/oom_score_adj or /proc/<pid>/oom_adj will change the
-other with its scaled value.
-
-NOTICE: /proc/<pid>/oom_adj is deprecated and will be removed, please see
-Documentation/feature-removal-schedule.txt.
-
-Caveat: when a parent task is selected, the oom killer will sacrifice any first
-generation children with seperate address spaces instead, if possible. This
-avoids servers and important system daemons from being killed and loses the
-minimal amount of work.
-
+3.1 /proc/<pid>/oom_adj - Adjust the oom-killer score
+------------------------------------------------------
+
+This file can be used to adjust the score used to select which processes
+should be killed in an out-of-memory situation. Giving it a high score will
+increase the likelihood of this process being killed by the oom-killer. Valid
+values are in the range -16 to +15, plus the special value -17, which disables
+oom-killing altogether for this process.
+
+The process to be killed in an out-of-memory situation is selected among all others
+based on its badness score. This value equals the original memory size of the process
+and is then updated according to its CPU time (utime + stime) and the
+run time (uptime - start time). The longer it runs the smaller is the score.
+Badness score is divided by the square root of the CPU time and then by
+the double square root of the run time.
+
+Swapped out tasks are killed first. Half of each child's memory size is added to
+the parent's score if they do not share the same memory. Thus forking servers
+are the prime candidates to be killed. Having only one 'hungry' child will make
+parent less preferable than the child.
+
+/proc/<pid>/oom_score shows process' current badness score.
+
+The following heuristics are then applied:
+ * if the task was reniced, its score doubles
+ * superuser or direct hardware access tasks (CAP_SYS_ADMIN, CAP_SYS_RESOURCE
+ or CAP_SYS_RAWIO) have their score divided by 4
+ * if oom condition happened in one cpuset and checked process does not belong
+ to it, its score is divided by 8
+ * the resulting score is multiplied by two to the power of oom_adj, i.e.
+ points <<= oom_adj when it is positive and
+ points >>= -(oom_adj) otherwise
+
+The task with the highest badness score is then selected and its children
+are killed, process itself will be killed in an OOM situation when it does
+not have children or some of them disabled oom like described above.

3.2 /proc/<pid>/oom_score - Display current oom-killer score
-------------------------------------------------------------
diff --git a/fs/exec.c b/fs/exec.c
index 99d33a1..47986fb 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -54,7 +54,6 @@
#include <linux/fsnotify.h>
#include <linux/fs_struct.h>
#include <linux/pipe_fs_i.h>
-#include <linux/oom.h>

#include <asm/uaccess.h>
#include <asm/mmu_context.h>
@@ -766,10 +765,6 @@ static int exec_mmap(struct mm_struct *mm)
tsk->mm = mm;
tsk->active_mm = mm;
activate_mm(active_mm, mm);
- if (old_mm && tsk->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
- atomic_dec(&old_mm->oom_disable_count);
- atomic_inc(&tsk->mm->oom_disable_count);
- }
task_unlock(tsk);
arch_pick_mmap_layout(mm);
if (old_mm) {
diff --git a/fs/proc/base.c b/fs/proc/base.c
index f3d02ca..ed7d18e 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -63,7 +63,6 @@
#include <linux/namei.h>
#include <linux/mnt_namespace.h>
#include <linux/mm.h>
-#include <linux/swap.h>
#include <linux/rcupdate.h>
#include <linux/kallsyms.h>
#include <linux/stacktrace.h>
@@ -431,11 +430,12 @@ static const struct file_operations proc_lstats_operations = {
static int proc_oom_score(struct task_struct *task, char *buffer)
{
unsigned long points = 0;
+ struct timespec uptime;

+ do_posix_clock_monotonic_gettime(&uptime);
read_lock(&tasklist_lock);
if (pid_alive(task))
- points = oom_badness(task, NULL, NULL,
- totalram_pages + total_swap_pages);
+ points = badness(task, NULL, NULL, uptime.tv_sec);
read_unlock(&tasklist_lock);
return sprintf(buffer, "%lu\n", points);
}
@@ -1025,74 +1025,36 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
memset(buffer, 0, sizeof(buffer));
if (count > sizeof(buffer) - 1)
count = sizeof(buffer) - 1;
- if (copy_from_user(buffer, buf, count)) {
- err = -EFAULT;
- goto out;
- }
+ if (copy_from_user(buffer, buf, count))
+ return -EFAULT;

err = strict_strtol(strstrip(buffer), 0, &oom_adjust);
if (err)
- goto out;
+ return -EINVAL;
if ((oom_adjust < OOM_ADJUST_MIN || oom_adjust > OOM_ADJUST_MAX) &&
- oom_adjust != OOM_DISABLE) {
- err = -EINVAL;
- goto out;
- }
+ oom_adjust != OOM_DISABLE)
+ return -EINVAL;

task = get_proc_task(file->f_path.dentry->d_inode);
- if (!task) {
- err = -ESRCH;
- goto out;
- }
-
- task_lock(task);
- if (!task->mm) {
- err = -EINVAL;
- goto err_task_lock;
- }
-
+ if (!task)
+ return -ESRCH;
if (!lock_task_sighand(task, &flags)) {
- err = -ESRCH;
- goto err_task_lock;
+ put_task_struct(task);
+ return -ESRCH;
}

if (oom_adjust < task->signal->oom_adj && !capable(CAP_SYS_RESOURCE)) {
- err = -EACCES;
- goto err_sighand;
- }
-
- if (oom_adjust != task->signal->oom_adj) {
- if (oom_adjust == OOM_DISABLE)
- atomic_inc(&task->mm->oom_disable_count);
- if (task->signal->oom_adj == OOM_DISABLE)
- atomic_dec(&task->mm->oom_disable_count);
+ unlock_task_sighand(task, &flags);
+ put_task_struct(task);
+ return -EACCES;
}

- /*
- * Warn that /proc/pid/oom_adj is deprecated, see
- * Documentation/feature-removal-schedule.txt.
- */
- printk_once(KERN_WARNING "%s (%d): /proc/%d/oom_adj is deprecated, "
- "please use /proc/%d/oom_score_adj instead.\n",
- current->comm, task_pid_nr(current),
- task_pid_nr(task), task_pid_nr(task));
task->signal->oom_adj = oom_adjust;
- /*
- * Scale /proc/pid/oom_score_adj appropriately ensuring that a maximum
- * value is always attainable.
- */
- if (task->signal->oom_adj == OOM_ADJUST_MAX)
- task->signal->oom_score_adj = OOM_SCORE_ADJ_MAX;
- else
- task->signal->oom_score_adj = (oom_adjust * OOM_SCORE_ADJ_MAX) /
- -OOM_DISABLE;
-err_sighand:
+
unlock_task_sighand(task, &flags);
-err_task_lock:
- task_unlock(task);
put_task_struct(task);
-out:
- return err < 0 ? err : count;
+
+ return count;
}

static const struct file_operations proc_oom_adjust_operations = {
@@ -1101,106 +1063,6 @@ static const struct file_operations proc_oom_adjust_operations = {
.llseek = generic_file_llseek,
};

-static ssize_t oom_score_adj_read(struct file *file, char __user *buf,
- size_t count, loff_t *ppos)
-{
- struct task_struct *task = get_proc_task(file->f_path.dentry->d_inode);
- char buffer[PROC_NUMBUF];
- int oom_score_adj = OOM_SCORE_ADJ_MIN;
- unsigned long flags;
- size_t len;
-
- if (!task)
- return -ESRCH;
- if (lock_task_sighand(task, &flags)) {
- oom_score_adj = task->signal->oom_score_adj;
- unlock_task_sighand(task, &flags);
- }
- put_task_struct(task);
- len = snprintf(buffer, sizeof(buffer), "%d\n", oom_score_adj);
- return simple_read_from_buffer(buf, count, ppos, buffer, len);
-}
-
-static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
- size_t count, loff_t *ppos)
-{
- struct task_struct *task;
- char buffer[PROC_NUMBUF];
- unsigned long flags;
- long oom_score_adj;
- int err;
-
- memset(buffer, 0, sizeof(buffer));
- if (count > sizeof(buffer) - 1)
- count = sizeof(buffer) - 1;
- if (copy_from_user(buffer, buf, count)) {
- err = -EFAULT;
- goto out;
- }
-
- err = strict_strtol(strstrip(buffer), 0, &oom_score_adj);
- if (err)
- goto out;
- if (oom_score_adj < OOM_SCORE_ADJ_MIN ||
- oom_score_adj > OOM_SCORE_ADJ_MAX) {
- err = -EINVAL;
- goto out;
- }
-
- task = get_proc_task(file->f_path.dentry->d_inode);
- if (!task) {
- err = -ESRCH;
- goto out;
- }
-
- task_lock(task);
- if (!task->mm) {
- err = -EINVAL;
- goto err_task_lock;
- }
-
- if (!lock_task_sighand(task, &flags)) {
- err = -ESRCH;
- goto err_task_lock;
- }
-
- if (oom_score_adj < task->signal->oom_score_adj &&
- !capable(CAP_SYS_RESOURCE)) {
- err = -EACCES;
- goto err_sighand;
- }
-
- if (oom_score_adj != task->signal->oom_score_adj) {
- if (oom_score_adj == OOM_SCORE_ADJ_MIN)
- atomic_inc(&task->mm->oom_disable_count);
- if (task->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
- atomic_dec(&task->mm->oom_disable_count);
- }
- task->signal->oom_score_adj = oom_score_adj;
- /*
- * Scale /proc/pid/oom_adj appropriately ensuring that OOM_DISABLE is
- * always attainable.
- */
- if (task->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
- task->signal->oom_adj = OOM_DISABLE;
- else
- task->signal->oom_adj = (oom_score_adj * OOM_ADJUST_MAX) /
- OOM_SCORE_ADJ_MAX;
-err_sighand:
- unlock_task_sighand(task, &flags);
-err_task_lock:
- task_unlock(task);
- put_task_struct(task);
-out:
- return err < 0 ? err : count;
-}
-
-static const struct file_operations proc_oom_score_adj_operations = {
- .read = oom_score_adj_read,
- .write = oom_score_adj_write,
- .llseek = default_llseek,
-};
-
#ifdef CONFIG_AUDITSYSCALL
#define TMPBUFLEN 21
static ssize_t proc_loginuid_read(struct file * file, char __user * buf,
@@ -2779,7 +2641,6 @@ static const struct pid_entry tgid_base_stuff[] = {
#endif
INF("oom_score", S_IRUGO, proc_oom_score),
REG("oom_adj", S_IRUGO|S_IWUSR, proc_oom_adjust_operations),
- REG("oom_score_adj", S_IRUGO|S_IWUSR, proc_oom_score_adj_operations),
#ifdef CONFIG_AUDITSYSCALL
REG("loginuid", S_IWUSR|S_IRUGO, proc_loginuid_operations),
REG("sessionid", S_IRUGO, proc_sessionid_operations),
@@ -3115,7 +2976,6 @@ static const struct pid_entry tid_base_stuff[] = {
#endif
INF("oom_score", S_IRUGO, proc_oom_score),
REG("oom_adj", S_IRUGO|S_IWUSR, proc_oom_adjust_operations),
- REG("oom_score_adj", S_IRUGO|S_IWUSR, proc_oom_score_adj_operations),
#ifdef CONFIG_AUDITSYSCALL
REG("loginuid", S_IWUSR|S_IRUGO, proc_loginuid_operations),
REG("sessionid", S_IRUSR, proc_sessionid_operations),
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 159a076..b13fc2a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -124,8 +124,6 @@ static inline bool mem_cgroup_disabled(void)
void mem_cgroup_update_file_mapped(struct page *page, int val);
unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
gfp_t gfp_mask);
-u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
-
#else /* CONFIG_CGROUP_MEM_RES_CTLR */
struct mem_cgroup;

@@ -305,12 +303,6 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
return 0;
}

-static inline
-u64 mem_cgroup_get_limit(struct mem_cgroup *mem)
-{
- return 0;
-}
-
#endif /* CONFIG_CGROUP_MEM_CONT */

#endif /* _LINUX_MEMCONTROL_H */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index bb7288a..cb57d65 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -310,8 +310,6 @@ struct mm_struct {
#ifdef CONFIG_MMU_NOTIFIER
struct mmu_notifier_mm *mmu_notifier_mm;
#endif
- /* How many tasks sharing this mm are OOM_DISABLE */
- atomic_t oom_disable_count;
};

/* Future-safe accessor for struct mm_struct's cpu_vm_mask. */
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 5e3aa83..40e5e3a 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -1,27 +1,14 @@
#ifndef __INCLUDE_LINUX_OOM_H
#define __INCLUDE_LINUX_OOM_H

-/*
- * /proc/<pid>/oom_adj is deprecated, see
- * Documentation/feature-removal-schedule.txt.
- *
- * /proc/<pid>/oom_adj set to -17 protects from the oom-killer
- */
+/* /proc/<pid>/oom_adj set to -17 protects from the oom-killer */
#define OOM_DISABLE (-17)
/* inclusive */
#define OOM_ADJUST_MIN (-16)
#define OOM_ADJUST_MAX 15

-/*
- * /proc/<pid>/oom_score_adj set to OOM_SCORE_ADJ_MIN disables oom killing for
- * pid.
- */
-#define OOM_SCORE_ADJ_MIN (-1000)
-#define OOM_SCORE_ADJ_MAX 1000
-
#ifdef __KERNEL__

-#include <linux/sched.h>
#include <linux/types.h>
#include <linux/nodemask.h>

@@ -40,8 +27,6 @@ enum oom_constraint {
CONSTRAINT_MEMCG,
};

-extern unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
- const nodemask_t *nodemask, unsigned long totalpages);
extern int try_set_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);

@@ -66,8 +51,6 @@ static inline void oom_killer_enable(void)
extern unsigned long badness(struct task_struct *p, struct mem_cgroup *mem,
const nodemask_t *nodemask, unsigned long uptime);

-extern struct task_struct *find_lock_task_mm(struct task_struct *p);
-
/* sysctls */
extern int sysctl_oom_dump_tasks;
extern int sysctl_oom_kill_allocating_task;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d0036e5..a35acb6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -624,8 +624,7 @@ struct signal_struct {
struct tty_audit_buf *tty_audit_buf;
#endif

- int oom_adj; /* OOM kill score adjustment (bit shift) */
- int oom_score_adj; /* OOM kill score adjustment */
+ int oom_adj; /* OOM kill score adjustment (bit shift) */

struct mutex cred_guard_mutex; /* guard against foreign influences on
* credential calculations
diff --git a/kernel/exit.c b/kernel/exit.c
index 21aa7b3..c806406 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -50,7 +50,6 @@
#include <linux/perf_event.h>
#include <trace/events/sched.h>
#include <linux/hw_breakpoint.h>
-#include <linux/oom.h>

#include <asm/uaccess.h>
#include <asm/unistd.h>
@@ -696,8 +695,6 @@ static void exit_mm(struct task_struct * tsk)
enter_lazy_tlb(mm, current);
/* We don't want this task to be frozen prematurely */
clear_freeze_flag(tsk);
- if (tsk->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
- atomic_dec(&mm->oom_disable_count);
task_unlock(tsk);
mm_update_next_owner(mm);
mmput(mm);
diff --git a/kernel/fork.c b/kernel/fork.c
index 3b159c5..cca5e8b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -65,7 +65,6 @@
#include <linux/perf_event.h>
#include <linux/posix-timers.h>
#include <linux/user-return-notifier.h>
-#include <linux/oom.h>

#include <asm/pgtable.h>
#include <asm/pgalloc.h>
@@ -489,7 +488,6 @@ static struct mm_struct * mm_init(struct mm_struct * mm, struct task_struct *p)
mm->cached_hole_size = ~0UL;
mm_init_aio(mm);
mm_init_owner(mm, p);
- atomic_set(&mm->oom_disable_count, 0);

if (likely(!mm_alloc_pgd(mm))) {
mm->def_flags = 0;
@@ -743,8 +741,6 @@ good_mm:
/* Initializing for Swap token stuff */
mm->token_priority = 0;
mm->last_interval = 0;
- if (tsk->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
- atomic_inc(&mm->oom_disable_count);

tsk->mm = mm;
tsk->active_mm = mm;
@@ -906,7 +902,6 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
tty_audit_fork(sig);

sig->oom_adj = current->signal->oom_adj;
- sig->oom_score_adj = current->signal->oom_score_adj;

mutex_init(&sig->cred_guard_mutex);

@@ -1305,13 +1300,8 @@ bad_fork_cleanup_io:
bad_fork_cleanup_namespaces:
exit_task_namespaces(p);
bad_fork_cleanup_mm:
- if (p->mm) {
- task_lock(p);
- if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
- atomic_dec(&p->mm->oom_disable_count);
- task_unlock(p);
+ if (p->mm)
mmput(p->mm);
- }
bad_fork_cleanup_signal:
if (!(clone_flags & CLONE_THREAD))
free_signal_struct(p->signal);
@@ -1704,10 +1694,6 @@ SYSCALL_DEFINE1(unshare, unsigned long, unshare_flags)
active_mm = current->active_mm;
current->mm = new_mm;
current->active_mm = new_mm;
- if (current->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
- atomic_dec(&mm->oom_disable_count);
- atomic_inc(&new_mm->oom_disable_count);
- }
activate_mm(active_mm, new_mm);
new_mm = mm;
}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9a99cfa..c628370 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -47,7 +47,6 @@
#include <linux/mm_inline.h>
#include <linux/page_cgroup.h>
#include <linux/cpu.h>
-#include <linux/oom.h>
#include "internal.h"

#include <asm/uaccess.h>
@@ -917,13 +916,10 @@ int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem)
{
int ret;
struct mem_cgroup *curr = NULL;
- struct task_struct *p;

- p = find_lock_task_mm(task);
- if (!p)
- return 0;
- curr = try_get_mem_cgroup_from_mm(p->mm);
- task_unlock(p);
+ task_lock(task);
+ curr = try_get_mem_cgroup_from_mm(task->mm);
+ task_unlock(task);
if (!curr)
return 0;
/*
@@ -1297,24 +1293,6 @@ static int mem_cgroup_count_children(struct mem_cgroup *mem)
}

/*
- * Return the memory (and swap, if configured) limit for a memcg.
- */
-u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
-{
- u64 limit;
- u64 memsw;
-
- limit = res_counter_read_u64(&memcg->res, RES_LIMIT) +
- total_swap_pages;
- memsw = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
- /*
- * If memsw is finite and limits the amount of swap space available
- * to this memcg, return that limit.
- */
- return min(limit, memsw);
-}
-
-/*
* Visit the first child (need not be the first child as per the ordering
* of the cgroup list, since we track last_scanned_child) of @mem and use
* that to reclaim free pages from.
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 7dcca55..f251ddb 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -4,8 +4,6 @@
* Copyright (C) 1998,2000 Rik van Riel
* Thanks go out to Claus Fischer for some serious inspiration and
* for goading me into coding this file...
- * Copyright (C) 2010 Google, Inc.
- * Rewritten by David Rientjes
*
* The routines in this file are used to kill a process when
* we're seriously out of memory. This gets called from __alloc_pages()
@@ -36,6 +34,7 @@ int sysctl_panic_on_oom;
int sysctl_oom_kill_allocating_task;
int sysctl_oom_dump_tasks = 1;
static DEFINE_SPINLOCK(zone_scan_lock);
+/* #define DEBUG */

#ifdef CONFIG_NUMA
/**
@@ -106,7 +105,7 @@ static void boost_dying_task_prio(struct task_struct *p,
* pointer. Return p, or any of its subthreads with a valid ->mm, with
* task_lock() held.
*/
-struct task_struct *find_lock_task_mm(struct task_struct *p)
+static struct task_struct *find_lock_task_mm(struct task_struct *p)
{
struct task_struct *t = p;

@@ -121,8 +120,8 @@ struct task_struct *find_lock_task_mm(struct task_struct *p)
}

/* return true if the task is not adequate as candidate victim task. */
-static bool oom_unkillable_task(struct task_struct *p,
- const struct mem_cgroup *mem, const nodemask_t *nodemask)
+static bool oom_unkillable_task(struct task_struct *p, struct mem_cgroup *mem,
+ const nodemask_t *nodemask)
{
if (is_global_init(p))
return true;
@@ -141,82 +140,137 @@ static bool oom_unkillable_task(struct task_struct *p,
}

/**
- * oom_badness - heuristic function to determine which candidate task to kill
+ * badness - calculate a numeric value for how bad this task has been
* @p: task struct of which task we should calculate
- * @totalpages: total present RAM allowed for page allocation
+ * @uptime: current uptime in seconds
*
- * The heuristic for determining which task to kill is made to be as simple and
- * predictable as possible. The goal is to return the highest value for the
- * task consuming the most memory to avoid subsequent oom failures.
+ * The formula used is relatively simple and documented inline in the
+ * function. The main rationale is that we want to select a good task
+ * to kill when we run out of memory.
+ *
+ * Good in this context means that:
+ * 1) we lose the minimum amount of work done
+ * 2) we recover a large amount of memory
+ * 3) we don't kill anything innocent of eating tons of memory
+ * 4) we want to kill the minimum amount of processes (one)
+ * 5) we try to kill the process the user expects us to kill, this
+ * algorithm has been meticulously tuned to meet the principle
+ * of least surprise ... (be careful when you change it)
*/
-unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
- const nodemask_t *nodemask, unsigned long totalpages)
+unsigned long badness(struct task_struct *p, struct mem_cgroup *mem,
+ const nodemask_t *nodemask, unsigned long uptime)
{
- int points;
+ unsigned long points, cpu_time, run_time;
+ struct task_struct *child;
+ struct task_struct *c, *t;
+ int oom_adj = p->signal->oom_adj;
+ struct task_cputime task_time;
+ unsigned long utime;
+ unsigned long stime;

if (oom_unkillable_task(p, mem, nodemask))
return 0;
+ if (oom_adj == OOM_DISABLE)
+ return 0;

p = find_lock_task_mm(p);
if (!p)
return 0;

/*
- * Shortcut check for a thread sharing p->mm that is OOM_SCORE_ADJ_MIN
- * so the entire heuristic doesn't need to be executed for something
- * that cannot be killed.
+ * The memory size of the process is the basis for the badness.
*/
- if (atomic_read(&p->mm->oom_disable_count)) {
- task_unlock(p);
- return 0;
- }
+ points = p->mm->total_vm;
+ task_unlock(p);

/*
- * When the PF_OOM_ORIGIN bit is set, it indicates the task should have
- * priority for oom killing.
+ * swapoff can easily use up all memory, so kill those first.
*/
- if (p->flags & PF_OOM_ORIGIN) {
- task_unlock(p);
- return 1000;
- }
+ if (p->flags & PF_OOM_ORIGIN)
+ return ULONG_MAX;

/*
- * The memory controller may have a limit of 0 bytes, so avoid a divide
- * by zero, if necessary.
+ * Processes which fork a lot of child processes are likely
+ * a good choice. We add half the vmsize of the children if they
+ * have an own mm. This prevents forking servers to flood the
+ * machine with an endless amount of children. In case a single
+ * child is eating the vast majority of memory, adding only half
+ * to the parents will make the child our kill candidate of choice.
*/
- if (!totalpages)
- totalpages = 1;
+ t = p;
+ do {
+ list_for_each_entry(c, &t->children, sibling) {
+ child = find_lock_task_mm(c);
+ if (child) {
+ if (child->mm != p->mm)
+ points += child->mm->total_vm/2 + 1;
+ task_unlock(child);
+ }
+ }
+ } while_each_thread(p, t);

/*
- * The baseline for the badness score is the proportion of RAM that each
- * task's rss and swap space use.
+ * CPU time is in tens of seconds and run time is in thousands
+ * of seconds. There is no particular reason for this other than
+ * that it turned out to work very well in practice.
*/
- points = (get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS)) * 1000 /
- totalpages;
- task_unlock(p);
+ thread_group_cputime(p, &task_time);
+ utime = cputime_to_jiffies(task_time.utime);
+ stime = cputime_to_jiffies(task_time.stime);
+ cpu_time = (utime + stime) >> (SHIFT_HZ + 3);
+
+
+ if (uptime >= p->start_time.tv_sec)
+ run_time = (uptime - p->start_time.tv_sec) >> 10;
+ else
+ run_time = 0;
+
+ if (cpu_time)
+ points /= int_sqrt(cpu_time);
+ if (run_time)
+ points /= int_sqrt(int_sqrt(run_time));

/*
- * Root processes get 3% bonus, just like the __vm_enough_memory()
- * implementation used by LSMs.
+ * Niced processes are most likely less important, so double
+ * their badness points.
*/
- if (has_capability_noaudit(p, CAP_SYS_ADMIN))
- points -= 30;
+ if (task_nice(p) > 0)
+ points *= 2;

/*
- * /proc/pid/oom_score_adj ranges from -1000 to +1000 such that it may
- * either completely disable oom killing or always prefer a certain
- * task.
+ * Superuser processes are usually more important, so we make it
+ * less likely that we kill those.
*/
- points += p->signal->oom_score_adj;
+ if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
+ has_capability_noaudit(p, CAP_SYS_RESOURCE))
+ points /= 4;

/*
- * Never return 0 for an eligible task that may be killed since it's
- * possible that no single user task uses more than 0.1% of memory and
- * no single admin tasks uses more than 3.0%.
+ * We don't want to kill a process with direct hardware access.
+ * Not only could that mess up the hardware, but usually users
+ * tend to only have this flag set on applications they think
+ * of as important.
*/
- if (points <= 0)
- return 1;
- return (points < 1000) ? points : 1000;
+ if (has_capability_noaudit(p, CAP_SYS_RAWIO))
+ points /= 4;
+
+ /*
+ * Adjust the score by oom_adj.
+ */
+ if (oom_adj) {
+ if (oom_adj > 0) {
+ if (!points)
+ points = 1;
+ points <<= oom_adj;
+ } else
+ points >>= -(oom_adj);
+ }
+
+#ifdef DEBUG
+ printk(KERN_DEBUG "OOMkill: task %d (%s) got %lu points\n",
+ p->pid, p->comm, points);
+#endif
+ return points;
}

/*
@@ -224,20 +278,12 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
*/
#ifdef CONFIG_NUMA
static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
- gfp_t gfp_mask, nodemask_t *nodemask,
- unsigned long *totalpages)
+ gfp_t gfp_mask, nodemask_t *nodemask)
{
struct zone *zone;
struct zoneref *z;
enum zone_type high_zoneidx = gfp_zone(gfp_mask);
- bool cpuset_limited = false;
- int nid;
-
- /* Default to all available memory */
- *totalpages = totalram_pages + total_swap_pages;

- if (!zonelist)
- return CONSTRAINT_NONE;
/*
* Reach here only when __GFP_NOFAIL is used. So, we should avoid
* to kill current.We have to random task kill in this case.
@@ -247,37 +293,26 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
return CONSTRAINT_NONE;

/*
- * This is not a __GFP_THISNODE allocation, so a truncated nodemask in
- * the page allocator means a mempolicy is in effect. Cpuset policy
- * is enforced in get_page_from_freelist().
+ * The nodemask here is a nodemask passed to alloc_pages(). Now,
+ * cpuset doesn't use this nodemask for its hardwall/softwall/hierarchy
+ * feature. mempolicy is an only user of nodemask here.
+ * check mempolicy's nodemask contains all N_HIGH_MEMORY
*/
- if (nodemask && !nodes_subset(node_states[N_HIGH_MEMORY], *nodemask)) {
- *totalpages = total_swap_pages;
- for_each_node_mask(nid, *nodemask)
- *totalpages += node_spanned_pages(nid);
+ if (nodemask && !nodes_subset(node_states[N_HIGH_MEMORY], *nodemask))
return CONSTRAINT_MEMORY_POLICY;
- }

/* Check this allocation failure is caused by cpuset's wall function */
for_each_zone_zonelist_nodemask(zone, z, zonelist,
high_zoneidx, nodemask)
if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
- cpuset_limited = true;
+ return CONSTRAINT_CPUSET;

- if (cpuset_limited) {
- *totalpages = total_swap_pages;
- for_each_node_mask(nid, cpuset_current_mems_allowed)
- *totalpages += node_spanned_pages(nid);
- return CONSTRAINT_CPUSET;
- }
return CONSTRAINT_NONE;
}
#else
static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
- gfp_t gfp_mask, nodemask_t *nodemask,
- unsigned long *totalpages)
+ gfp_t gfp_mask, nodemask_t *nodemask)
{
- *totalpages = totalram_pages + total_swap_pages;
return CONSTRAINT_NONE;
}
#endif
@@ -288,16 +323,17 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
*
* (not docbooked, we don't want this one cluttering up the manual)
*/
-static struct task_struct *select_bad_process(unsigned int *ppoints,
- unsigned long totalpages, struct mem_cgroup *mem,
- const nodemask_t *nodemask)
+static struct task_struct *select_bad_process(unsigned long *ppoints,
+ struct mem_cgroup *mem, const nodemask_t *nodemask)
{
struct task_struct *p;
struct task_struct *chosen = NULL;
+ struct timespec uptime;
*ppoints = 0;

+ do_posix_clock_monotonic_gettime(&uptime);
for_each_process(p) {
- unsigned int points;
+ unsigned long points;

if (oom_unkillable_task(p, mem, nodemask))
continue;
@@ -329,11 +365,11 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
return ERR_PTR(-1UL);

chosen = p;
- *ppoints = 1000;
+ *ppoints = ULONG_MAX;
}

- points = oom_badness(p, mem, nodemask, totalpages);
- if (points > *ppoints) {
+ points = badness(p, mem, nodemask, uptime.tv_sec);
+ if (points > *ppoints || !chosen) {
chosen = p;
*ppoints = points;
}
@@ -345,24 +381,27 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
/**
* dump_tasks - dump current memory state of all system tasks
* @mem: current's memory controller, if constrained
- * @nodemask: nodemask passed to page allocator for mempolicy ooms
*
- * Dumps the current memory state of all eligible tasks. Tasks not in the same
- * memcg, not in the same cpuset, or bound to a disjoint set of mempolicy nodes
- * are not shown.
+ * Dumps the current memory state of all system tasks, excluding kernel threads.
* State information includes task's pid, uid, tgid, vm size, rss, cpu, oom_adj
- * value, oom_score_adj value, and name.
+ * score, and name.
+ *
+ * If the actual is non-NULL, only tasks that are a member of the mem_cgroup are
+ * shown.
*
* Call with tasklist_lock read-locked.
*/
-static void dump_tasks(const struct mem_cgroup *mem, const nodemask_t *nodemask)
+static void dump_tasks(const struct mem_cgroup *mem)
{
struct task_struct *p;
struct task_struct *task;

- pr_info("[ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name\n");
+ printk(KERN_INFO "[ pid ] uid tgid total_vm rss cpu oom_adj "
+ "name\n");
for_each_process(p) {
- if (oom_unkillable_task(p, mem, nodemask))
+ if (p->flags & PF_KTHREAD)
+ continue;
+ if (mem && !task_in_mem_cgroup(p, mem))
continue;

task = find_lock_task_mm(p);
@@ -375,69 +414,43 @@ static void dump_tasks(const struct mem_cgroup *mem, const nodemask_t *nodemask)
continue;
}

- pr_info("[%5d] %5d %5d %8lu %8lu %3u %3d %5d %s\n",
+ pr_info("[%5d] %5d %5d %8lu %8lu %3u %3d %s\n",
task->pid, task_uid(task), task->tgid,
task->mm->total_vm, get_mm_rss(task->mm),
- task_cpu(task), task->signal->oom_adj,
- task->signal->oom_score_adj, task->comm);
+ task_cpu(task), task->signal->oom_adj, task->comm);
task_unlock(task);
}
}

static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
- struct mem_cgroup *mem, const nodemask_t *nodemask)
+ struct mem_cgroup *mem)
{
task_lock(current);
pr_warning("%s invoked oom-killer: gfp_mask=0x%x, order=%d, "
- "oom_adj=%d, oom_score_adj=%d\n",
- current->comm, gfp_mask, order, current->signal->oom_adj,
- current->signal->oom_score_adj);
+ "oom_adj=%d\n",
+ current->comm, gfp_mask, order, current->signal->oom_adj);
cpuset_print_task_mems_allowed(current);
task_unlock(current);
dump_stack();
mem_cgroup_print_oom_info(mem, p);
show_mem();
if (sysctl_oom_dump_tasks)
- dump_tasks(mem, nodemask);
+ dump_tasks(mem);
}

#define K(x) ((x) << (PAGE_SHIFT-10))
static int oom_kill_task(struct task_struct *p, struct mem_cgroup *mem)
{
- struct task_struct *q;
- struct mm_struct *mm;
-
p = find_lock_task_mm(p);
if (!p)
return 1;

- /* mm cannot be safely dereferenced after task_unlock(p) */
- mm = p->mm;
-
pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
task_pid_nr(p), p->comm, K(p->mm->total_vm),
K(get_mm_counter(p->mm, MM_ANONPAGES)),
K(get_mm_counter(p->mm, MM_FILEPAGES)));
task_unlock(p);

- /*
- * Kill all processes sharing p->mm in other thread groups, if any.
- * They don't get access to memory reserves or a higher scheduler
- * priority, though, to avoid depletion of all memory or task
- * starvation. This prevents mm->mmap_sem livelock when an oom killed
- * task cannot exit because it requires the semaphore and its contended
- * by another thread trying to allocate memory itself. That thread will
- * now get access to memory reserves since it has a pending fatal
- * signal.
- */
- for_each_process(q)
- if (q->mm == mm && !same_thread_group(q, p)) {
- task_lock(q); /* Protect ->comm from prctl() */
- pr_err("Kill process %d (%s) sharing same memory\n",
- task_pid_nr(q), q->comm);
- task_unlock(q);
- force_sig(SIGKILL, q);
- }

set_tsk_thread_flag(p, TIF_MEMDIE);
force_sig(SIGKILL, p);
@@ -454,17 +467,17 @@ static int oom_kill_task(struct task_struct *p, struct mem_cgroup *mem)
#undef K

static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
- unsigned int points, unsigned long totalpages,
- struct mem_cgroup *mem, nodemask_t *nodemask,
- const char *message)
+ unsigned long points, struct mem_cgroup *mem,
+ nodemask_t *nodemask, const char *message)
{
struct task_struct *victim = p;
struct task_struct *child;
struct task_struct *t = p;
- unsigned int victim_points = 0;
+ unsigned long victim_points = 0;
+ struct timespec uptime;

if (printk_ratelimit())
- dump_header(p, gfp_mask, order, mem, nodemask);
+ dump_header(p, gfp_mask, order, mem);

/*
* If the task is already exiting, don't alarm the sysadmin or kill
@@ -477,7 +490,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
}

task_lock(p);
- pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n",
+ pr_err("%s: Kill process %d (%s) score %lu or sacrifice child\n",
message, task_pid_nr(p), p->comm, points);
task_unlock(p);

@@ -487,15 +500,14 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
* parent. This attempts to lose the minimal amount of work done while
* still freeing memory.
*/
+ do_posix_clock_monotonic_gettime(&uptime);
do {
list_for_each_entry(child, &t->children, sibling) {
- unsigned int child_points;
+ unsigned long child_points;

- /*
- * oom_badness() returns 0 if the thread is unkillable
- */
- child_points = oom_badness(child, mem, nodemask,
- totalpages);
+ /* badness() returns 0 if the thread is unkillable */
+ child_points = badness(child, mem, nodemask,
+ uptime.tv_sec);
if (child_points > victim_points) {
victim = child;
victim_points = child_points;
@@ -510,7 +522,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
* Determines whether the kernel must panic because of the panic_on_oom sysctl.
*/
static void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
- int order, const nodemask_t *nodemask)
+ int order)
{
if (likely(!sysctl_panic_on_oom))
return;
@@ -524,7 +536,7 @@ static void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
return;
}
read_lock(&tasklist_lock);
- dump_header(NULL, gfp_mask, order, NULL, nodemask);
+ dump_header(NULL, gfp_mask, order, NULL);
read_unlock(&tasklist_lock);
panic("Out of memory: %s panic_on_oom is enabled\n",
sysctl_panic_on_oom == 2 ? "compulsory" : "system-wide");
@@ -533,19 +545,17 @@ static void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
#ifdef CONFIG_CGROUP_MEM_RES_CTLR
void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
{
- unsigned long limit;
- unsigned int points = 0;
+ unsigned long points = 0;
struct task_struct *p;

- check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, 0, NULL);
- limit = mem_cgroup_get_limit(mem) >> PAGE_SHIFT;
+ check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, 0);
read_lock(&tasklist_lock);
retry:
- p = select_bad_process(&points, limit, mem, NULL);
+ p = select_bad_process(&points, mem, NULL);
if (!p || PTR_ERR(p) == -1UL)
goto out;

- if (oom_kill_process(p, gfp_mask, 0, points, limit, mem, NULL,
+ if (oom_kill_process(p, gfp_mask, 0, points, mem, NULL,
"Memory cgroup out of memory"))
goto retry;
out:
@@ -669,11 +679,9 @@ static void clear_system_oom(void)
void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
int order, nodemask_t *nodemask)
{
- const nodemask_t *mpol_mask;
struct task_struct *p;
- unsigned long totalpages;
unsigned long freed = 0;
- unsigned int points;
+ unsigned long points;
enum oom_constraint constraint = CONSTRAINT_NONE;
int killed = 0;

@@ -697,40 +705,41 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
* Check if there were limitations on the allocation (only relevant for
* NUMA) that may require different handling.
*/
- constraint = constrained_alloc(zonelist, gfp_mask, nodemask,
- &totalpages);
- mpol_mask = (constraint == CONSTRAINT_MEMORY_POLICY) ? nodemask : NULL;
- check_panic_on_oom(constraint, gfp_mask, order, mpol_mask);
+ if (zonelist)
+ constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
+ check_panic_on_oom(constraint, gfp_mask, order);

read_lock(&tasklist_lock);
if (sysctl_oom_kill_allocating_task &&
!oom_unkillable_task(current, NULL, nodemask) &&
- current->mm && !atomic_read(&current->mm->oom_disable_count)) {
+ (current->signal->oom_adj != OOM_DISABLE)) {
/*
* oom_kill_process() needs tasklist_lock held. If it returns
* non-zero, current could not be killed so we must fallback to
* the tasklist scan.
*/
- if (!oom_kill_process(current, gfp_mask, order, 0, totalpages,
- NULL, nodemask,
+ if (!oom_kill_process(current, gfp_mask, order, 0, NULL,
+ nodemask,
"Out of memory (oom_kill_allocating_task)"))
goto out;
}

retry:
- p = select_bad_process(&points, totalpages, NULL, mpol_mask);
+ p = select_bad_process(&points, NULL,
+ constraint == CONSTRAINT_MEMORY_POLICY ? nodemask :
+ NULL);
if (PTR_ERR(p) == -1UL)
goto out;

/* Found nothing?!?! Either we hang forever, or we panic. */
if (!p) {
- dump_header(NULL, gfp_mask, order, NULL, mpol_mask);
+ dump_header(NULL, gfp_mask, order, NULL);
read_unlock(&tasklist_lock);
panic("Out of memory and no killable processes...\n");
}

- if (oom_kill_process(p, gfp_mask, order, points, totalpages, NULL,
- nodemask, "Out of memory"))
+ if (oom_kill_process(p, gfp_mask, order, points, NULL, nodemask,
+ "Out of memory"))
goto retry;
killed = 1;
out:
--
1.6.5.2






2010-11-28 01:45:42

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH] Revert oom rewrite series

On Tue, 23 Nov 2010, KOSAKI Motohiro wrote:

> > You may remember that the initial version of my rewrite replaced oom_adj
> > entirely with the new oom_score_adj semantics. Others suggested that it
> > be seperated into a new tunable and the old tunable deprecated for a
> > lengthy period of time. I accepted that criticism and understood the
> > drawbacks of replacing the tunable immediately and followed those
> > suggestions. I disagree with you that the deprecation of oom_adj for a
> > period of two years is as dramatic as you imply and I disagree that users
> > are experiencing problems with the linear scale that it now operates on
> > versus the old exponential scale.
>
> Yes and No. People wanted to separate AND don't break old one.
>

You're arguing on the behalf of applications that don't exist.

> > > 1) About two month ago, Dave hansen observed strange OOM issue because he
> > > has a big machine and ALL process are not so big. thus, eventually all
> > > process got oom-score=0 and oom-killer didn't work.
> > >
> > > https://kerneltrap.org/mailarchive/linux-driver-devel/2010/9/9/6886383
> > >
> > > DavidR changed oom-score to +1 in such situation.
> > >
> > > http://kerneltrap.org/mailarchive/linux-kernel/2010/9/9/4617455
> > >
> > > But it is completely bognus. If all process have score=1, oom-killer fall
> > > back to purely random killer. I expected and explained his patch has
> > > its problem at half years ago. but he didn't fix yet.
> > >
> >
> > The resolution with which the oom killer considers memory is at 0.1% of
> > system RAM at its highest (smaller when you have a memory controller,
> > cpuset, or mempolicy constrained oom). It considers a task within 0.1% of
> > memory of another task to have equal "badness" to kill, we don't break
> > ties in between that resolution -- it all depends on which one shows up in
> > the tasklist first. If you disagree with that resolution, which I support
> > as being high enough, then you may certainly propose a patch to make it
> > even finer at 0.01%, 0.001%, etc. It would only change oom_badness() to
> > range between [0,10000], [0,100000], etc.
>
> No.
> Think Moore's Law. rational value will be not able to work in future anyway.
> 10 years ago, I used 20M bytes memory desktop machine and I'm now using 2GB.
> memory amount is growing and growing. and bash size doesn't grwoing so fast.
>

If you'd like to suggest an increase to the upper-bound of the badness
score, please do so, although I don't think we need to break ties amongst
tasks that differ by at most <0.1% of the system's capacity.

2010-11-30 13:05:03

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH] Revert oom rewrite series

> On Tue, 23 Nov 2010, KOSAKI Motohiro wrote:
>
> > > You may remember that the initial version of my rewrite replaced oom_adj
> > > entirely with the new oom_score_adj semantics. Others suggested that it
> > > be seperated into a new tunable and the old tunable deprecated for a
> > > lengthy period of time. I accepted that criticism and understood the
> > > drawbacks of replacing the tunable immediately and followed those
> > > suggestions. I disagree with you that the deprecation of oom_adj for a
> > > period of two years is as dramatic as you imply and I disagree that users
> > > are experiencing problems with the linear scale that it now operates on
> > > versus the old exponential scale.
> >
> > Yes and No. People wanted to separate AND don't break old one.
> >
>
> You're arguing on the behalf of applications that don't exist.

Why?
You actually got the bug report.


>
> > > > 1) About two month ago, Dave hansen observed strange OOM issue because he
> > > > has a big machine and ALL process are not so big. thus, eventually all
> > > > process got oom-score=0 and oom-killer didn't work.
> > > >
> > > > https://kerneltrap.org/mailarchive/linux-driver-devel/2010/9/9/6886383
> > > >
> > > > DavidR changed oom-score to +1 in such situation.
> > > >
> > > > http://kerneltrap.org/mailarchive/linux-kernel/2010/9/9/4617455
> > > >
> > > > But it is completely bognus. If all process have score=1, oom-killer fall
> > > > back to purely random killer. I expected and explained his patch has
> > > > its problem at half years ago. but he didn't fix yet.
> > > >
> > >
> > > The resolution with which the oom killer considers memory is at 0.1% of
> > > system RAM at its highest (smaller when you have a memory controller,
> > > cpuset, or mempolicy constrained oom). It considers a task within 0.1% of
> > > memory of another task to have equal "badness" to kill, we don't break
> > > ties in between that resolution -- it all depends on which one shows up in
> > > the tasklist first. If you disagree with that resolution, which I support
> > > as being high enough, then you may certainly propose a patch to make it
> > > even finer at 0.01%, 0.001%, etc. It would only change oom_badness() to
> > > range between [0,10000], [0,100000], etc.
> >
> > No.
> > Think Moore's Law. rational value will be not able to work in future anyway.
> > 10 years ago, I used 20M bytes memory desktop machine and I'm now using 2GB.
> > memory amount is growing and growing. and bash size doesn't grwoing so fast.
> >
>
> If you'd like to suggest an increase to the upper-bound of the badness
> score, please do so, although I don't think we need to break ties amongst
> tasks that differ by at most <0.1% of the system's capacity.

No. I dislike. I dislike propotinal score.

2010-11-30 20:02:47

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH] Revert oom rewrite series

On Tue, 30 Nov 2010, KOSAKI Motohiro wrote:

> > > > You may remember that the initial version of my rewrite replaced oom_adj
> > > > entirely with the new oom_score_adj semantics. Others suggested that it
> > > > be seperated into a new tunable and the old tunable deprecated for a
> > > > lengthy period of time. I accepted that criticism and understood the
> > > > drawbacks of replacing the tunable immediately and followed those
> > > > suggestions. I disagree with you that the deprecation of oom_adj for a
> > > > period of two years is as dramatic as you imply and I disagree that users
> > > > are experiencing problems with the linear scale that it now operates on
> > > > versus the old exponential scale.
> > >
> > > Yes and No. People wanted to separate AND don't break old one.
> > >
> >
> > You're arguing on the behalf of applications that don't exist.
>
> Why?
> You actually got the bug report.
>

There have never been any bug reports related to applications using
oom_score_adj and being impacted with its linear mapping onto oom_adj's
exponential scale. That's because no users prior to the rewrite were
using oom_adj scores that were based on either the expected memory usage
of the application nor the capacity of the machine.