LinuxLists.cc - [PATCH v2 0/5] Fix oom killer doesn't work at all if system have > gigabytes memory (aka CAI founded issue)

2011-05-20 08:00:36

by KOSAKI Motohiro

[permalink] [raw]

Subject: [PATCH v2 0/5] Fix oom killer doesn't work at all if system have > gigabytes memory (aka CAI founded issue)

CAI Qian reported current oom logic doesn't work at all on his 16GB RAM
machine. oom killer killed all system daemon at first and his system
stopped responding.

The brief log is below.

> Out of memory: Kill process 1175 (dhclient) score 1 or sacrifice child
> Out of memory: Kill process 1247 (rsyslogd) score 1 or sacrifice child
> Out of memory: Kill process 1284 (irqbalance) score 1 or sacrifice child
> Out of memory: Kill process 1303 (rpcbind) score 1 or sacrifice child
> Out of memory: Kill process 1321 (rpc.statd) score 1 or sacrifice child
> Out of memory: Kill process 1333 (mdadm) score 1 or sacrifice child
> Out of memory: Kill process 1365 (rpc.idmapd) score 1 or sacrifice child
> Out of memory: Kill process 1403 (dbus-daemon) score 1 or sacrifice child
> Out of memory: Kill process 1438 (acpid) score 1 or sacrifice child
> Out of memory: Kill process 1447 (hald) score 1 or sacrifice child
> Out of memory: Kill process 1447 (hald) score 1 or sacrifice child
> Out of memory: Kill process 1487 (hald-addon-inpu) score 1 or sacrifice child
> Out of memory: Kill process 1488 (hald-addon-acpi) score 1 or sacrifice child
> Out of memory: Kill process 1507 (automount) score 1 or sacrifice child

The problems are three.

1) if two processes have the same oom score, we should kill younger process.
but current logic kill older. Typically oldest processes are system daemons.
2) Current logic use 'unsigned int' for internal score calculation. (exactly says,
it only use 0-1000 value). its very low precision calculation makes a lot of
same oom score and kill an ineligible process.
3) Current logic give 3% of SystemRAM to root processes. It obviously too big
if you have plenty memory. Now, your fork-bomb processes have 500MB OOM immune
bonus. then your fork-bomb never ever be killed.

KOSAKI Motohiro (5):
oom: improve dump_tasks() show items
oom: kill younger process first
oom: oom-killer don't use proportion of system-ram internally
oom: don't kill random process
oom: merge oom_kill_process() with oom_kill_task()

fs/proc/base.c | 13 ++-
include/linux/oom.h | 10 +--
include/linux/sched.h | 11 +++
mm/oom_kill.c | 201 +++++++++++++++++++++++++++----------------------
4 files changed, 135 insertions(+), 100 deletions(-)

--
1.7.3.1

2011-05-20 08:01:53

by KOSAKI Motohiro

[permalink] [raw]

Subject: [PATCH 1/5] oom: improve dump_tasks() show items

Recently, oom internal logic was dramatically changed. Thus
dump_tasks() doesn't show enough information for bug report
analysis. it has some meaningless items and don't have some
oom score related items.

This patch adapt displaying fields to new oom logic.

details
--------
removed: pid (we always kill process. don't need thread id),
signal->oom_adj (we no longer uses it internally)
cpu (we no longer uses it)
added: ppid (we often kill sacrifice child process)
swap (it's accounted)
modify: RSS (account mm->nr_ptes too)

<old>
[ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name
[ 3886] 0 3886 2893 441 1 0 0 bash
[ 3905] 0 3905 29361 25833 0 0 0 memtoy

<new>
[ pid] ppid uid total_vm rss swap score_adj name
[ 417] 1 0 3298 12 184 -1000 udevd
[ 830] 1 0 1776 11 16 0 system-setup-ke
[ 973] 1 0 61179 35 116 0 rsyslogd
[ 1733] 1732 0 1052337 958582 0 0 memtoy

Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/oom_kill.c | 15 +++++++++------
1 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f52e85c..43d32ae 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -355,7 +355,7 @@ static void dump_tasks(const struct mem_cgroup *mem, const nodemask_t *nodemask)
struct task_struct *p;
struct task_struct *task;

- pr_info("[ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name\n");
+ pr_info("[ pid] ppid uid total_vm rss swap score_adj name\n");
for_each_process(p) {
if (oom_unkillable_task(p, mem, nodemask))
continue;
@@ -370,11 +370,14 @@ static void dump_tasks(const struct mem_cgroup *mem, const nodemask_t *nodemask)
continue;
}

- pr_info("[%5d] %5d %5d %8lu %8lu %3u %3d %5d %s\n",
- task->pid, task_uid(task), task->tgid,
- task->mm->total_vm, get_mm_rss(task->mm),
- task_cpu(task), task->signal->oom_adj,
- task->signal->oom_score_adj, task->comm);
+ pr_info("[%6d] %6d %5d %8lu %8lu %8lu %9d %s\n",
+ task_tgid_nr(task), task_tgid_nr(task->real_parent),
+ task_uid(task),
+ task->mm->total_vm,
+ get_mm_rss(task->mm) + p->mm->nr_ptes,
+ get_mm_counter(p->mm, MM_SWAPENTS),
+ task->signal->oom_score_adj,
+ task->comm);
task_unlock(task);
}
}
--
1.7.3.1

2011-05-20 08:02:30

by KOSAKI Motohiro

[permalink] [raw]

Subject: [PATCH 2/5] oom: kill younger process first

This patch introduces do_each_thread_reverse() and select_bad_process()
uses it. The benefits are two, 1) oom-killer can kill younger process
than older if they have a same oom score. Usually younger process is
less important. 2) younger task often have PF_EXITING because shell
script makes a lot of short lived processes. Reverse order search can
detect it faster.

Reported-by: CAI Qian <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/sched.h | 11 +++++++++++
mm/oom_kill.c | 2 +-
2 files changed, 12 insertions(+), 1 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 013314a..3698379 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2194,6 +2194,9 @@ static inline unsigned long wait_task_inactive(struct task_struct *p,
#define next_task(p) \
list_entry_rcu((p)->tasks.next, struct task_struct, tasks)

+#define prev_task(p) \
+ list_entry((p)->tasks.prev, struct task_struct, tasks)
+
#define for_each_process(p) \
for (p = &init_task ; (p = next_task(p)) != &init_task ; )

@@ -2206,6 +2209,14 @@ extern bool current_is_single_threaded(void);
#define do_each_thread(g, t) \
for (g = t = &init_task ; (g = t = next_task(g)) != &init_task ; ) do

+/*
+ * Similar to do_each_thread(). but two difference are there.
+ * - traverse tasks reverse order (i.e. younger to older)
+ * - caller must hold tasklist_lock. rcu_read_lock isn't enough
+*/
+#define do_each_thread_reverse(g, t) \
+ for (g = t = &init_task ; (g = t = prev_task(g)) != &init_task ; ) do
+
#define while_each_thread(g, t) \
while ((t = next_thread(t)) != g)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 43d32ae..e6a6c6f 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -282,7 +282,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
struct task_struct *chosen = NULL;
*ppoints = 0;

- do_each_thread(g, p) {
+ do_each_thread_reverse(g, p) {
unsigned int points;

if (!p->mm)
--
1.7.3.1

2011-05-20 08:03:42

by KOSAKI Motohiro

[permalink] [raw]

Subject: [PATCH 3/5] oom: oom-killer don't use proportion of system-ram internally

CAI Qian reported his kernel did hang-up if he ran fork intensive
workload and then invoke oom-killer.

The problem is, current oom calculation uses 0-1000 normalized value
(The unit is a permillage of system-ram). Its low precision make
a lot of same oom score. IOW, in his case, all processes have smaller
oom score than 1 and internal calculation round it to 1.

Thus oom-killer kill ineligible process. This regression is caused by
commit a63d83f427 (oom: badness heuristic rewrite).

The solution is, the internal calculation just use number of pages
instead of permillage of system-ram. And convert it to permillage
value at displaying time.

This patch doesn't change any ABI (included /proc/<pid>/oom_score_adj)
even though current logic has a lot of my dislike thing.

Reported-by: CAI Qian <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
fs/proc/base.c | 13 ++++++----
include/linux/oom.h | 7 +----
mm/oom_kill.c | 60 +++++++++++++++++++++++++++++++++-----------------
3 files changed, 49 insertions(+), 31 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index dfa5327..d6b0424 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -476,14 +476,17 @@ static const struct file_operations proc_lstats_operations = {

static int proc_oom_score(struct task_struct *task, char *buffer)
{
- unsigned long points = 0;
+ unsigned long points;
+ unsigned long ratio = 0;
+ unsigned long totalpages = totalram_pages + total_swap_pages + 1;

read_lock(&tasklist_lock);
- if (pid_alive(task))
- points = oom_badness(task, NULL, NULL,
- totalram_pages + total_swap_pages);
+ if (pid_alive(task)) {
+ points = oom_badness(task, NULL, NULL, totalpages);
+ ratio = points * 1000 / totalpages;
+ }
read_unlock(&tasklist_lock);
- return sprintf(buffer, "%lu\n", points);
+ return sprintf(buffer, "%lu\n", ratio);
}

struct limit_names {
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 5e3aa83..0f5b588 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -40,7 +40,8 @@ enum oom_constraint {
CONSTRAINT_MEMCG,
};

-extern unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
+/* The badness from the OOM killer */
+extern unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *mem,
const nodemask_t *nodemask, unsigned long totalpages);
extern int try_set_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
@@ -62,10 +63,6 @@ static inline void oom_killer_enable(void)
oom_killer_disabled = false;
}

-/* The badness from the OOM killer */
-extern unsigned long badness(struct task_struct *p, struct mem_cgroup *mem,
- const nodemask_t *nodemask, unsigned long uptime);
-
extern struct task_struct *find_lock_task_mm(struct task_struct *p);

/* sysctls */
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index e6a6c6f..8bbc3df 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -132,10 +132,12 @@ static bool oom_unkillable_task(struct task_struct *p,
* predictable as possible. The goal is to return the highest value for the
* task consuming the most memory to avoid subsequent oom failures.
*/
-unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
+unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *mem,
const nodemask_t *nodemask, unsigned long totalpages)
{
- int points;
+ unsigned long points;
+ unsigned long score_adj = 0;
+

if (oom_unkillable_task(p, mem, nodemask))
return 0;
@@ -160,7 +162,7 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
*/
if (p->flags & PF_OOM_ORIGIN) {
task_unlock(p);
- return 1000;
+ return ULONG_MAX;
}

/*
@@ -176,33 +178,49 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
*/
points = get_mm_rss(p->mm) + p->mm->nr_ptes;
points += get_mm_counter(p->mm, MM_SWAPENTS);
-
- points *= 1000;
- points /= totalpages;
task_unlock(p);

/*
* Root processes get 3% bonus, just like the __vm_enough_memory()
* implementation used by LSMs.
+ *
+ * XXX: Too large bonus, example, if the system have tera-bytes memory..
*/
- if (has_capability_noaudit(p, CAP_SYS_ADMIN))
- points -= 30;
+ if (has_capability_noaudit(p, CAP_SYS_ADMIN)) {
+ if (points >= totalpages / 32)
+ points -= totalpages / 32;
+ else
+ points = 0;
+ }

/*
* /proc/pid/oom_score_adj ranges from -1000 to +1000 such that it may
* either completely disable oom killing or always prefer a certain
* task.
*/
- points += p->signal->oom_score_adj;
+ if (p->signal->oom_score_adj >= 0) {
+ score_adj = p->signal->oom_score_adj * (totalpages / 1000);
+ if (ULONG_MAX - points >= score_adj)
+ points += score_adj;
+ else
+ points = ULONG_MAX;
+ } else {
+ score_adj = -p->signal->oom_score_adj * (totalpages / 1000);
+ if (points >= score_adj)
+ points -= score_adj;
+ else
+ points = 0;
+ }

/*
* Never return 0 for an eligible task that may be killed since it's
* possible that no single user task uses more than 0.1% of memory and
* no single admin tasks uses more than 3.0%.
*/
- if (points <= 0)
- return 1;
- return (points < 1000) ? points : 1000;
+ if (!points)
+ points = 1;
+
+ return points;
}

/*
@@ -274,7 +292,7 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
*
* (not docbooked, we don't want this one cluttering up the manual)
*/
-static struct task_struct *select_bad_process(unsigned int *ppoints,
+static struct task_struct *select_bad_process(unsigned long *ppoints,
unsigned long totalpages, struct mem_cgroup *mem,
const nodemask_t *nodemask)
{
@@ -283,7 +301,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
*ppoints = 0;

do_each_thread_reverse(g, p) {
- unsigned int points;
+ unsigned long points;

if (!p->mm)
continue;
@@ -314,7 +332,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
*/
if (p == current) {
chosen = p;
- *ppoints = 1000;
+ *ppoints = ULONG_MAX;
} else {
/*
* If this task is not being ptraced on exit,
@@ -445,14 +463,14 @@ static int oom_kill_task(struct task_struct *p, struct mem_cgroup *mem)
#undef K

static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
- unsigned int points, unsigned long totalpages,
+ unsigned long points, unsigned long totalpages,
struct mem_cgroup *mem, nodemask_t *nodemask,
const char *message)
{
struct task_struct *victim = p;
struct task_struct *child;
struct task_struct *t = p;
- unsigned int victim_points = 0;
+ unsigned long victim_points = 0;

if (printk_ratelimit())
dump_header(p, gfp_mask, order, mem, nodemask);
@@ -467,7 +485,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
}

task_lock(p);
- pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n",
+ pr_err("%s: Kill process %d (%s) points %lu or sacrifice child\n",
message, task_pid_nr(p), p->comm, points);
task_unlock(p);

@@ -479,7 +497,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
*/
do {
list_for_each_entry(child, &t->children, sibling) {
- unsigned int child_points;
+ unsigned long child_points;

if (child->mm == p->mm)
continue;
@@ -526,7 +544,7 @@ static void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
{
unsigned long limit;
- unsigned int points = 0;
+ unsigned long points = 0;
struct task_struct *p;

/*
@@ -675,7 +693,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
struct task_struct *p;
unsigned long totalpages;
unsigned long freed = 0;
- unsigned int points;
+ unsigned long points;
enum oom_constraint constraint = CONSTRAINT_NONE;
int killed = 0;

--
1.7.3.1

2011-05-20 08:04:30

by KOSAKI Motohiro

[permalink] [raw]

Subject: [PATCH 4/5] oom: don't kill random process

CAI Qian reported oom-killer killed all system daemons in his
system at first if he ran fork bomb as root. The problem is,
current logic give them bonus of 3% of system ram. Example,
he has 16GB machine, then root processes have ~500MB oom
immune. It bring us crazy bad result. _all_ processes have
oom-score=1 and then, oom killer ignore process memory usage
and kill random process. This regression is caused by commit
a63d83f427 (oom: badness heuristic rewrite).

This patch changes select_bad_process() slightly. If oom points == 1,
it's a sign that the system have only root privileged processes or
similar. Thus, select_bad_process() calculate oom badness without
root bonus and select eligible process.

Also, this patch move finding sacrifice child logic into
select_bad_process(). It's necessary to implement adequate
no root bonus recalculation. and it makes good side effect,
current logic doesn't behave as the doc.

Documentation/sysctl/vm.txt says

oom_kill_allocating_task

If this is set to non-zero, the OOM killer simply kills the task that
triggered the out-of-memory condition. This avoids the expensive
tasklist scan.

IOW, oom_kill_allocating_task shouldn't search sacrifice child.
This patch also fixes this issue.

Reported-by: CAI Qian <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
fs/proc/base.c | 2 +-
include/linux/oom.h | 3 +-
mm/oom_kill.c | 89 ++++++++++++++++++++++++++++----------------------
3 files changed, 53 insertions(+), 41 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index d6b0424..b608b69 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -482,7 +482,7 @@ static int proc_oom_score(struct task_struct *task, char *buffer)

read_lock(&tasklist_lock);
if (pid_alive(task)) {
- points = oom_badness(task, NULL, NULL, totalpages);
+ points = oom_badness(task, NULL, NULL, totalpages, 1);
ratio = points * 1000 / totalpages;
}
read_unlock(&tasklist_lock);
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 0f5b588..3dd3669 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -42,7 +42,8 @@ enum oom_constraint {

/* The badness from the OOM killer */
extern unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *mem,
- const nodemask_t *nodemask, unsigned long totalpages);
+ const nodemask_t *nodemask, unsigned long totalpages,
+ int protect_root);
extern int try_set_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 8bbc3df..7d280d4 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -133,7 +133,8 @@ static bool oom_unkillable_task(struct task_struct *p,
* task consuming the most memory to avoid subsequent oom failures.
*/
unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *mem,
- const nodemask_t *nodemask, unsigned long totalpages)
+ const nodemask_t *nodemask, unsigned long totalpages,
+ int protect_root)
{
unsigned long points;
unsigned long score_adj = 0;
@@ -186,7 +187,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *mem,
*
* XXX: Too large bonus, example, if the system have tera-bytes memory..
*/
- if (has_capability_noaudit(p, CAP_SYS_ADMIN)) {
+ if (protect_root && has_capability_noaudit(p, CAP_SYS_ADMIN)) {
if (points >= totalpages / 32)
points -= totalpages / 32;
else
@@ -298,8 +299,11 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
{
struct task_struct *g, *p;
struct task_struct *chosen = NULL;
- *ppoints = 0;
+ int protect_root = 1;
+ unsigned long chosen_points = 0;
+ struct task_struct *child;

+ retry:
do_each_thread_reverse(g, p) {
unsigned long points;

@@ -332,7 +336,7 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
*/
if (p == current) {
chosen = p;
- *ppoints = ULONG_MAX;
+ chosen_points = ULONG_MAX;
} else {
/*
* If this task is not being ptraced on exit,
@@ -345,13 +349,49 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
}
}

- points = oom_badness(p, mem, nodemask, totalpages);
- if (points > *ppoints) {
+ points = oom_badness(p, mem, nodemask, totalpages, protect_root);
+ if (points > chosen_points) {
chosen = p;
- *ppoints = points;
+ chosen_points = points;
}
} while_each_thread(g, p);

+ /*
+ * chosen_point==1 may be a sign that root privilege bonus is too large
+ * and we choose wrong task. Let's recalculate oom score without the
+ * dubious bonus.
+ */
+ if (protect_root && (chosen_points == 1)) {
+ protect_root = 0;
+ goto retry;
+ }
+
+ /*
+ * If any of p's children has a different mm and is eligible for kill,
+ * the one with the highest badness() score is sacrificed for its
+ * parent. This attempts to lose the minimal amount of work done while
+ * still freeing memory.
+ */
+ g = p = chosen;
+ do {
+ list_for_each_entry(child, &p->children, sibling) {
+ unsigned long child_points;
+
+ if (child->mm == p->mm)
+ continue;
+ /*
+ * oom_badness() returns 0 if the thread is unkillable
+ */
+ child_points = oom_badness(child, mem, nodemask,
+ totalpages, protect_root);
+ if (child_points > chosen_points) {
+ chosen = child;
+ chosen_points = child_points;
+ }
+ }
+ } while_each_thread(g, p);
+
+ *ppoints = chosen_points;
return chosen;
}

@@ -467,11 +507,6 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
struct mem_cgroup *mem, nodemask_t *nodemask,
const char *message)
{
- struct task_struct *victim = p;
- struct task_struct *child;
- struct task_struct *t = p;
- unsigned long victim_points = 0;
-
if (printk_ratelimit())
dump_header(p, gfp_mask, order, mem, nodemask);

@@ -485,35 +520,11 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
}

task_lock(p);
- pr_err("%s: Kill process %d (%s) points %lu or sacrifice child\n",
- message, task_pid_nr(p), p->comm, points);
+ pr_err("%s: Kill process %d (%s) points %lu\n",
+ message, task_pid_nr(p), p->comm, points);
task_unlock(p);

- /*
- * If any of p's children has a different mm and is eligible for kill,
- * the one with the highest badness() score is sacrificed for its
- * parent. This attempts to lose the minimal amount of work done while
- * still freeing memory.
- */
- do {
- list_for_each_entry(child, &t->children, sibling) {
- unsigned long child_points;
-
- if (child->mm == p->mm)
- continue;
- /*
- * oom_badness() returns 0 if the thread is unkillable
- */
- child_points = oom_badness(child, mem, nodemask,
- totalpages);
- if (child_points > victim_points) {
- victim = child;
- victim_points = child_points;
- }
- }
- } while_each_thread(p, t);
-
- return oom_kill_task(victim, mem);
+ return oom_kill_task(p, mem);
}

/*
--
1.7.3.1

2011-05-20 08:05:19

by KOSAKI Motohiro

[permalink] [raw]

Subject: [PATCH 5/5] oom: merge oom_kill_process() with oom_kill_task()

Now, oom_kill_process() become almost empty function. Let's
merge it with oom_kill_task().

Also, this patch replace task_pid_nr() with task_tgid_nr().
Because 1) oom killer kill a process, not thread. 2) a userland
don't care thread id.

Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/oom_kill.c | 53 ++++++++++++++++++++++-------------------------------
1 files changed, 22 insertions(+), 31 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 7d280d4..ec075cc 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -458,11 +458,26 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
}

#define K(x) ((x) << (PAGE_SHIFT-10))
-static int oom_kill_task(struct task_struct *p, struct mem_cgroup *mem)
+static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
+ unsigned long points, unsigned long totalpages,
+ struct mem_cgroup *mem, nodemask_t *nodemask,
+ const char *message)
{
struct task_struct *q;
struct mm_struct *mm;

+ if (printk_ratelimit())
+ dump_header(p, gfp_mask, order, mem, nodemask);
+
+ /*
+ * If the task is already exiting, don't alarm the sysadmin or kill
+ * its children or threads, just set TIF_MEMDIE so it can die quickly
+ */
+ if (p->flags & PF_EXITING) {
+ set_tsk_thread_flag(p, TIF_MEMDIE);
+ return 0;
+ }
+
p = find_lock_task_mm(p);
if (!p)
return 1;
@@ -470,10 +485,11 @@ static int oom_kill_task(struct task_struct *p, struct mem_cgroup *mem)
/* mm cannot be safely dereferenced after task_unlock(p) */
mm = p->mm;

- pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
- task_pid_nr(p), p->comm, K(p->mm->total_vm),
- K(get_mm_counter(p->mm, MM_ANONPAGES)),
- K(get_mm_counter(p->mm, MM_FILEPAGES)));
+ pr_err("%s: Kill process %d (%s) points:%lu total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
+ message, task_tgid_nr(p), p->comm, points,
+ K(p->mm->total_vm),
+ K(get_mm_counter(p->mm, MM_ANONPAGES)),
+ K(get_mm_counter(p->mm, MM_FILEPAGES)));
task_unlock(p);

/*
@@ -490,7 +506,7 @@ static int oom_kill_task(struct task_struct *p, struct mem_cgroup *mem)
if (q->mm == mm && !same_thread_group(q, p)) {
task_lock(q); /* Protect ->comm from prctl() */
pr_err("Kill process %d (%s) sharing same memory\n",
- task_pid_nr(q), q->comm);
+ task_tgid_nr(q), q->comm);
task_unlock(q);
force_sig(SIGKILL, q);
}
@@ -502,31 +518,6 @@ static int oom_kill_task(struct task_struct *p, struct mem_cgroup *mem)
}
#undef K

-static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
- unsigned long points, unsigned long totalpages,
- struct mem_cgroup *mem, nodemask_t *nodemask,
- const char *message)
-{
- if (printk_ratelimit())
- dump_header(p, gfp_mask, order, mem, nodemask);
-
- /*
- * If the task is already exiting, don't alarm the sysadmin or kill
- * its children or threads, just set TIF_MEMDIE so it can die quickly
- */
- if (p->flags & PF_EXITING) {
- set_tsk_thread_flag(p, TIF_MEMDIE);
- return 0;
- }
-
- task_lock(p);
- pr_err("%s: Kill process %d (%s) points %lu\n",
- message, task_pid_nr(p), p->comm, points);
- task_unlock(p);
-
- return oom_kill_task(p, mem);
-}
-
/*
* Determines whether the kernel must panic because of the panic_on_oom sysctl.
*/
--
1.7.3.1

2011-05-23 02:37:03

[permalink] [raw]

Subject: Re: [PATCH 2/5] oom: kill younger process first

2011/5/20 KOSAKI Motohiro <[email protected]>:
> This patch introduces do_each_thread_reverse() and select_bad_process()
> uses it. The benefits are two, 1) oom-killer can kill younger process
> than older if they have a same oom score. Usually younger process is
> less important. 2) younger task often have PF_EXITING because shell
> script makes a lot of short lived processes. Reverse order search can
> detect it faster.
>
> Reported-by: CAI Qian <[email protected]>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>

--
Kind regards,
Minchan Kim

2011-05-23 03:59:39

[permalink] [raw]

Subject: Re: [PATCH 3/5] oom: oom-killer don't use proportion of system-ram internally

2011/5/20 KOSAKI Motohiro <[email protected]>:
> CAI Qian reported his kernel did hang-up if he ran fork intensive
> workload and then invoke oom-killer.
>
> The problem is, current oom calculation uses 0-1000 normalized value
> (The unit is a permillage of system-ram). Its low precision make
> a lot of same oom score. IOW, in his case, all processes have smaller
> oom score than 1 and internal calculation round it to 1.
>
> Thus oom-killer kill ineligible process. This regression is caused by
> commit a63d83f427 (oom: badness heuristic rewrite).
>
> The solution is, the internal calculation just use number of pages
> instead of permillage of system-ram. And convert it to permillage
> value at displaying time.
>
> This patch doesn't change any ABI (included /proc/<pid>/oom_score_adj)
> even though current logic has a lot of my dislike thing.
>
> Reported-by: CAI Qian <[email protected]>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
> ---
> fs/proc/base.c | 13 ++++++----
> include/linux/oom.h | 7 +----
> mm/oom_kill.c | 60 +++++++++++++++++++++++++++++++++-----------------
> 3 files changed, 49 insertions(+), 31 deletions(-)
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index dfa5327..d6b0424 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -476,14 +476,17 @@ static const struct file_operations proc_lstats_operations = {
>
> static int proc_oom_score(struct task_struct *task, char *buffer)
> {
> - unsigned long points = 0;
> + unsigned long points;
> + unsigned long ratio = 0;
> + unsigned long totalpages = totalram_pages + total_swap_pages + 1;

Does we need +1?
oom_badness does have the check.

>
> read_lock(&tasklist_lock);
> - if (pid_alive(task))
> - points = oom_badness(task, NULL, NULL,
> - totalram_pages + total_swap_pages);
> + if (pid_alive(task)) {
> + points = oom_badness(task, NULL, NULL, totalpages);
> + ratio = points * 1000 / totalpages;
> + }
> read_unlock(&tasklist_lock);
> - return sprintf(buffer, "%lu\n", points);
> + return sprintf(buffer, "%lu\n", ratio);
> }
>
> struct limit_names {
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index 5e3aa83..0f5b588 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -40,7 +40,8 @@ enum oom_constraint {
> CONSTRAINT_MEMCG,
> };
>
> -extern unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
> +/* The badness from the OOM killer */
> +extern unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *mem,
> const nodemask_t *nodemask, unsigned long totalpages);
> extern int try_set_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
> extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
> @@ -62,10 +63,6 @@ static inline void oom_killer_enable(void)
> oom_killer_disabled = false;
> }
>
> -/* The badness from the OOM killer */
> -extern unsigned long badness(struct task_struct *p, struct mem_cgroup *mem,
> - const nodemask_t *nodemask, unsigned long uptime);
> -
> extern struct task_struct *find_lock_task_mm(struct task_struct *p);
>
> /* sysctls */
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index e6a6c6f..8bbc3df 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -132,10 +132,12 @@ static bool oom_unkillable_task(struct task_struct *p,
> * predictable as possible. The goal is to return the highest value for the
> * task consuming the most memory to avoid subsequent oom failures.
> */
> -unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
> +unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *mem,
> const nodemask_t *nodemask, unsigned long totalpages)
> {
> - int points;
> + unsigned long points;
> + unsigned long score_adj = 0;
> +
>
> if (oom_unkillable_task(p, mem, nodemask))
> return 0;
> @@ -160,7 +162,7 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
> */
> if (p->flags & PF_OOM_ORIGIN) {
> task_unlock(p);
> - return 1000;
> + return ULONG_MAX;
> }
>
> /*
> @@ -176,33 +178,49 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
> */
> points = get_mm_rss(p->mm) + p->mm->nr_ptes;
> points += get_mm_counter(p->mm, MM_SWAPENTS);
> -
> - points *= 1000;
> - points /= totalpages;
> task_unlock(p);
>
> /*
> * Root processes get 3% bonus, just like the __vm_enough_memory()
> * implementation used by LSMs.
> + *
> + * XXX: Too large bonus, example, if the system have tera-bytes memory..
> */
> - if (has_capability_noaudit(p, CAP_SYS_ADMIN))
> - points -= 30;
> + if (has_capability_noaudit(p, CAP_SYS_ADMIN)) {
> + if (points >= totalpages / 32)
> + points -= totalpages / 32;
> + else
> + points = 0;

Odd. Why do we initialize points with 0?

I think the idea is good.

--
Kind regards,
Minchan Kim

2011-05-23 04:02:35

[permalink] [raw]

Subject: Re: [PATCH 3/5] oom: oom-killer don't use proportion of system-ram internally

2011/5/20 KOSAKI Motohiro <[email protected]>:
> CAI Qian reported his kernel did hang-up if he ran fork intensive
> workload and then invoke oom-killer.
>
> The problem is, current oom calculation uses 0-1000 normalized value
> (The unit is a permillage of system-ram). Its low precision make
> a lot of same oom score. IOW, in his case, all processes have smaller
> oom score than 1 and internal calculation round it to 1.
>
> Thus oom-killer kill ineligible process. This regression is caused by
> commit a63d83f427 (oom: badness heuristic rewrite).
>
> The solution is, the internal calculation just use number of pages
> instead of permillage of system-ram. And convert it to permillage
> value at displaying time.
>
> This patch doesn't change any ABI (included /proc/<pid>/oom_score_adj)
> even though current logic has a lot of my dislike thing.
>
> Reported-by: CAI Qian <[email protected]>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
> ---
> fs/proc/base.c | 13 ++++++----
> include/linux/oom.h | 7 +----
> mm/oom_kill.c | 60 +++++++++++++++++++++++++++++++++-----------------
> 3 files changed, 49 insertions(+), 31 deletions(-)
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index dfa5327..d6b0424 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -476,14 +476,17 @@ static const struct file_operations proc_lstats_operations = {
>
> static int proc_oom_score(struct task_struct *task, char *buffer)
> {
> - unsigned long points = 0;
> + unsigned long points;
> + unsigned long ratio = 0;
> + unsigned long totalpages = totalram_pages + total_swap_pages + 1;
>
> read_lock(&tasklist_lock);
> - if (pid_alive(task))
> - points = oom_badness(task, NULL, NULL,
> - totalram_pages + total_swap_pages);
> + if (pid_alive(task)) {
> + points = oom_badness(task, NULL, NULL, totalpages);
> + ratio = points * 1000 / totalpages;
> + }
> read_unlock(&tasklist_lock);
> - return sprintf(buffer, "%lu\n", points);
> + return sprintf(buffer, "%lu\n", ratio);
> }
>
> struct limit_names {
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index 5e3aa83..0f5b588 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -40,7 +40,8 @@ enum oom_constraint {
> CONSTRAINT_MEMCG,
> };
>
> -extern unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
> +/* The badness from the OOM killer */
> +extern unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *mem,
> const nodemask_t *nodemask, unsigned long totalpages);
> extern int try_set_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
> extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
> @@ -62,10 +63,6 @@ static inline void oom_killer_enable(void)
> oom_killer_disabled = false;
> }
>
> -/* The badness from the OOM killer */
> -extern unsigned long badness(struct task_struct *p, struct mem_cgroup *mem,
> - const nodemask_t *nodemask, unsigned long uptime);
> -
> extern struct task_struct *find_lock_task_mm(struct task_struct *p);
>
> /* sysctls */
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index e6a6c6f..8bbc3df 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -132,10 +132,12 @@ static bool oom_unkillable_task(struct task_struct *p,
> * predictable as possible. The goal is to return the highest value for the
> * task consuming the most memory to avoid subsequent oom failures.
> */
> -unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
> +unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *mem,
> const nodemask_t *nodemask, unsigned long totalpages)
> {
> - int points;
> + unsigned long points;
> + unsigned long score_adj = 0;
> +
>
> if (oom_unkillable_task(p, mem, nodemask))
> return 0;
> @@ -160,7 +162,7 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
> */
> if (p->flags & PF_OOM_ORIGIN) {
> task_unlock(p);
> - return 1000;
> + return ULONG_MAX;
> }
>
> /*
> @@ -176,33 +178,49 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
> */
> points = get_mm_rss(p->mm) + p->mm->nr_ptes;
> points += get_mm_counter(p->mm, MM_SWAPENTS);
> -
> - points *= 1000;
> - points /= totalpages;
> task_unlock(p);
>
> /*
> * Root processes get 3% bonus, just like the __vm_enough_memory()
> * implementation used by LSMs.
> + *
> + * XXX: Too large bonus, example, if the system have tera-bytes memory..
> */

Nitpick. I have no opposition about adding this comment.
But strictly speaking, the comment isn't related to this patch.
No biggie and it's up to you. :)

--
Kind regards,
Minchan Kim

2011-05-23 04:32:01

[permalink] [raw]

Subject: Re: [PATCH 4/5] oom: don't kill random process

2011/5/20 KOSAKI Motohiro <[email protected]>:
> CAI Qian reported oom-killer killed all system daemons in his
> system at first if he ran fork bomb as root. The problem is,
> current logic give them bonus of 3% of system ram. Example,
> he has 16GB machine, then root processes have ~500MB oom
> immune. It bring us crazy bad result. _all_ processes have
> oom-score=1 and then, oom killer ignore process memory usage
> and kill random process. This regression is caused by commit
> a63d83f427 (oom: badness heuristic rewrite).
>
> This patch changes select_bad_process() slightly. If oom points == 1,
> it's a sign that the system have only root privileged processes or
> similar. Thus, select_bad_process() calculate oom badness without
> root bonus and select eligible process.
>
> Also, this patch move finding sacrifice child logic into
> select_bad_process(). It's necessary to implement adequate
> no root bonus recalculation. and it makes good side effect,
> current logic doesn't behave as the doc.
>
> Documentation/sysctl/vm.txt says
>
> oom_kill_allocating_task
>
> If this is set to non-zero, the OOM killer simply kills the task that
> triggered the out-of-memory condition. This avoids the expensive
> tasklist scan.
>
> IOW, oom_kill_allocating_task shouldn't search sacrifice child.
> This patch also fixes this issue.
>
> Reported-by: CAI Qian <[email protected]>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
> ---
> fs/proc/base.c | 2 +-
> include/linux/oom.h | 3 +-
> mm/oom_kill.c | 89 ++++++++++++++++++++++++++++----------------------
> 3 files changed, 53 insertions(+), 41 deletions(-)
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index d6b0424..b608b69 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -482,7 +482,7 @@ static int proc_oom_score(struct task_struct *task, char *buffer)
>
> read_lock(&tasklist_lock);
> if (pid_alive(task)) {
> - points = oom_badness(task, NULL, NULL, totalpages);
> + points = oom_badness(task, NULL, NULL, totalpages, 1);
> ratio = points * 1000 / totalpages;
> }
> read_unlock(&tasklist_lock);
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index 0f5b588..3dd3669 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -42,7 +42,8 @@ enum oom_constraint {
>
> /* The badness from the OOM killer */
> extern unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *mem,
> - const nodemask_t *nodemask, unsigned long totalpages);
> + const nodemask_t *nodemask, unsigned long totalpages,
> + int protect_root);
> extern int try_set_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
> extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 8bbc3df..7d280d4 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -133,7 +133,8 @@ static bool oom_unkillable_task(struct task_struct *p,
> * task consuming the most memory to avoid subsequent oom failures.
> */
> unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *mem,
> - const nodemask_t *nodemask, unsigned long totalpages)
> + const nodemask_t *nodemask, unsigned long totalpages,
> + int protect_root)
> {
> unsigned long points;
> unsigned long score_adj = 0;
> @@ -186,7 +187,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *mem,
> *
> * XXX: Too large bonus, example, if the system have tera-bytes memory..
> */
> - if (has_capability_noaudit(p, CAP_SYS_ADMIN)) {
> + if (protect_root && has_capability_noaudit(p, CAP_SYS_ADMIN)) {
> if (points >= totalpages / 32)
> points -= totalpages / 32;
> else
> @@ -298,8 +299,11 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
> {
> struct task_struct *g, *p;
> struct task_struct *chosen = NULL;
> - *ppoints = 0;
> + int protect_root = 1;
> + unsigned long chosen_points = 0;
> + struct task_struct *child;
>
> + retry:
> do_each_thread_reverse(g, p) {
> unsigned long points;
>
> @@ -332,7 +336,7 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
> */
> if (p == current) {
> chosen = p;
> - *ppoints = ULONG_MAX;
> + chosen_points = ULONG_MAX;
> } else {
> /*
> * If this task is not being ptraced on exit,
> @@ -345,13 +349,49 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
> }
> }
>
> - points = oom_badness(p, mem, nodemask, totalpages);
> - if (points > *ppoints) {
> + points = oom_badness(p, mem, nodemask, totalpages, protect_root);
> + if (points > chosen_points) {
> chosen = p;
> - *ppoints = points;
> + chosen_points = points;
> }
> } while_each_thread(g, p);
>
> + /*
> + * chosen_point==1 may be a sign that root privilege bonus is too large
> + * and we choose wrong task. Let's recalculate oom score without the
> + * dubious bonus.
> + */
> + if (protect_root && (chosen_points == 1)) {
> + protect_root = 0;
> + goto retry;
> + }

The idea is good to me.
But once we meet it, should we give up protecting root privileged processes?
How about decaying bonus point?

--
Kind regards,
Minchan Kim

2011-05-23 22:16:25

by David Rientjes

[permalink] [raw]

Subject: Re: [PATCH 1/5] oom: improve dump_tasks() show items

On Fri, 20 May 2011, KOSAKI Motohiro wrote:

> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index f52e85c..43d32ae 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -355,7 +355,7 @@ static void dump_tasks(const struct mem_cgroup *mem, const nodemask_t *nodemask)
> struct task_struct *p;
> struct task_struct *task;
>
> - pr_info("[ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name\n");
> + pr_info("[ pid] ppid uid total_vm rss swap score_adj name\n");
> for_each_process(p) {
> if (oom_unkillable_task(p, mem, nodemask))
> continue;
> @@ -370,11 +370,14 @@ static void dump_tasks(const struct mem_cgroup *mem, const nodemask_t *nodemask)
> continue;
> }
>
> - pr_info("[%5d] %5d %5d %8lu %8lu %3u %3d %5d %s\n",
> - task->pid, task_uid(task), task->tgid,
> - task->mm->total_vm, get_mm_rss(task->mm),
> - task_cpu(task), task->signal->oom_adj,
> - task->signal->oom_score_adj, task->comm);
> + pr_info("[%6d] %6d %5d %8lu %8lu %8lu %9d %s\n",
> + task_tgid_nr(task), task_tgid_nr(task->real_parent),
> + task_uid(task),
> + task->mm->total_vm,
> + get_mm_rss(task->mm) + p->mm->nr_ptes,
> + get_mm_counter(p->mm, MM_SWAPENTS),
> + task->signal->oom_score_adj,
> + task->comm);
> task_unlock(task);
> }
> }

Looks good, with the exception that the "score_adj" header should remain
"oom_score_adj" since that is its name within procfs that changes the
tunable.

After that's fixed:

Acked-by: David Rientjes <[email protected]>

2011-05-23 22:21:14

by David Rientjes

[permalink] [raw]

Subject: Re: [PATCH 2/5] oom: kill younger process first

On Fri, 20 May 2011, KOSAKI Motohiro wrote:

> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 013314a..3698379 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2194,6 +2194,9 @@ static inline unsigned long wait_task_inactive(struct task_struct *p,
> #define next_task(p) \
> list_entry_rcu((p)->tasks.next, struct task_struct, tasks)
>
> +#define prev_task(p) \
> + list_entry((p)->tasks.prev, struct task_struct, tasks)
> +
> #define for_each_process(p) \
> for (p = &init_task ; (p = next_task(p)) != &init_task ; )
>
> @@ -2206,6 +2209,14 @@ extern bool current_is_single_threaded(void);
> #define do_each_thread(g, t) \
> for (g = t = &init_task ; (g = t = next_task(g)) != &init_task ; ) do
>
> +/*
> + * Similar to do_each_thread(). but two difference are there.
> + * - traverse tasks reverse order (i.e. younger to older)
> + * - caller must hold tasklist_lock. rcu_read_lock isn't enough
> +*/
> +#define do_each_thread_reverse(g, t) \
> + for (g = t = &init_task ; (g = t = prev_task(g)) != &init_task ; ) do
> +
> #define while_each_thread(g, t) \
> while ((t = next_thread(t)) != g)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 43d32ae..e6a6c6f 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -282,7 +282,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
> struct task_struct *chosen = NULL;
> *ppoints = 0;
>
> - do_each_thread(g, p) {
> + do_each_thread_reverse(g, p) {
> unsigned int points;
>
> if (!p->mm)

Same response as when you initially proposed this patch: the comment needs
to explicitly state that it is not break-safe just like do_each_thread().
See http://marc.info/?l=linux-mm&m=130507027312785

A comment such as

/*
* Reverse of do_each_thread(); still not break-safe.
* Must hold tasklist_lock.
*/

would suffice. There are no "callers" to a macro.

After that:

Acked-by: David Rientjes <[email protected]>

2011-05-23 22:28:39

by David Rientjes

[permalink] [raw]

Subject: Re: [PATCH 3/5] oom: oom-killer don't use proportion of system-ram internally

On Fri, 20 May 2011, KOSAKI Motohiro wrote:

> CAI Qian reported his kernel did hang-up if he ran fork intensive
> workload and then invoke oom-killer.
>
> The problem is, current oom calculation uses 0-1000 normalized value
> (The unit is a permillage of system-ram). Its low precision make
> a lot of same oom score. IOW, in his case, all processes have smaller
> oom score than 1 and internal calculation round it to 1.
>
> Thus oom-killer kill ineligible process. This regression is caused by
> commit a63d83f427 (oom: badness heuristic rewrite).
>
> The solution is, the internal calculation just use number of pages
> instead of permillage of system-ram. And convert it to permillage
> value at displaying time.
>
> This patch doesn't change any ABI (included /proc/<pid>/oom_score_adj)
> even though current logic has a lot of my dislike thing.
>

Same response as when you initially proposed this patch:
http://marc.info/?l=linux-kernel&m=130507086613317 -- you never replied to
that.

The changelog doesn't accurately represent CAI Qian's problem; the issue
is that root processes are given too large of a bonus in comparison to
other threads that are using at most 1.9% of available memory. That can
be fixed, as I suggested by giving 1% bonus per 10% of memory used so that
the process would have to be using 10% before it even receives a bonus.

I already suggested an alternative patch to CAI Qian to greatly increase
the granularity of the oom score from a range of 0-1000 to 0-10000 to
differentiate between tasks within 0.01% of available memory (16MB on CAI
Qian's 16GB system). I'll propose this officially in a separate email.

This patch also includes undocumented changes such as changing the bonus
given to root processes.

2011-05-23 22:32:55

by David Rientjes

[permalink] [raw]

Subject: Re: [PATCH 4/5] oom: don't kill random process

On Fri, 20 May 2011, KOSAKI Motohiro wrote:

> CAI Qian reported oom-killer killed all system daemons in his
> system at first if he ran fork bomb as root. The problem is,
> current logic give them bonus of 3% of system ram. Example,
> he has 16GB machine, then root processes have ~500MB oom
> immune. It bring us crazy bad result. _all_ processes have
> oom-score=1 and then, oom killer ignore process memory usage
> and kill random process. This regression is caused by commit
> a63d83f427 (oom: badness heuristic rewrite).
>
> This patch changes select_bad_process() slightly. If oom points == 1,
> it's a sign that the system have only root privileged processes or
> similar. Thus, select_bad_process() calculate oom badness without
> root bonus and select eligible process.
>

You said earlier that you thought it was a good idea to do a proportional
based bonus for root processes. Do you have a specific objection to
giving root processes a 1% bonus for every 10% of used memory instead?

> Also, this patch move finding sacrifice child logic into
> select_bad_process(). It's necessary to implement adequate
> no root bonus recalculation. and it makes good side effect,
> current logic doesn't behave as the doc.
>

This is unnecessary and just makes the oom killer egregiously long. We
are already diagnosing problems here at Google where the oom killer holds
tasklist_lock on the readside for far too long, causing other cpus waiting
for a write_lock_irq(&tasklist_lock) to encounter issues when irqs are
disabled and it is spinning. A second tasklist scan is simply a
non-starter.

[ This is also one of the reasons why we needed to introduce
mm->oom_disable_count to prevent a second, expensive tasklist scan. ]

> Documentation/sysctl/vm.txt says
>
> oom_kill_allocating_task
>
> If this is set to non-zero, the OOM killer simply kills the task that
> triggered the out-of-memory condition. This avoids the expensive
> tasklist scan.
>
> IOW, oom_kill_allocating_task shouldn't search sacrifice child.
> This patch also fixes this issue.
>

oom_kill_allocating_task was introduced for SGI to prevent the expensive
tasklist scan, the task that is actually allocating the memory isn't
actually interesting and is usually random. This should be turned into a
documentation fix rather than changing the implementation.

Thanks.

2011-05-23 22:48:51

by David Rientjes

[permalink] [raw]

Subject: Re: [PATCH 3/5] oom: oom-killer don't use proportion of system-ram internally

On Mon, 23 May 2011, David Rientjes wrote:

> I already suggested an alternative patch to CAI Qian to greatly increase
> the granularity of the oom score from a range of 0-1000 to 0-10000 to
> differentiate between tasks within 0.01% of available memory (16MB on CAI
> Qian's 16GB system). I'll propose this officially in a separate email.
>

This is an alternative patch as earlier proposed with suggested
improvements from Minchan. CAI, would it be possible to test this out on
your usecase?

I'm indifferent to the actual scale of OOM_SCORE_MAX_FACTOR; it could be
10 as proposed in this patch or even increased higher for higher
resolution.

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -38,6 +38,9 @@ int sysctl_oom_kill_allocating_task;
int sysctl_oom_dump_tasks = 1;
static DEFINE_SPINLOCK(zone_scan_lock);

+#define OOM_SCORE_MAX_FACTOR 10
+#define OOM_SCORE_MAX (OOM_SCORE_ADJ_MAX * OOM_SCORE_MAX_FACTOR)
+
#ifdef CONFIG_NUMA
/**
* has_intersects_mems_allowed() - check task eligiblity for kill
@@ -160,7 +163,7 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
*/
if (p->flags & PF_OOM_ORIGIN) {
task_unlock(p);
- return 1000;
+ return OOM_SCORE_MAX;
}

/*
@@ -177,32 +180,38 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
points = get_mm_rss(p->mm) + p->mm->nr_ptes;
points += get_mm_counter(p->mm, MM_SWAPENTS);

- points *= 1000;
+ points *= OOM_SCORE_MAX;
points /= totalpages;
task_unlock(p);

/*
- * Root processes get 3% bonus, just like the __vm_enough_memory()
- * implementation used by LSMs.
+ * Root processes get a bonus of 1% per 10% of memory used.
*/
- if (has_capability_noaudit(p, CAP_SYS_ADMIN))
- points -= 30;
+ if (has_capability_noaudit(p, CAP_SYS_ADMIN)) {
+ int bonus;
+ int granularity;
+
+ bonus = OOM_SCORE_MAX / 100; /* bonus is 1% */
+ granularity = OOM_SCORE_MAX / 10; /* granularity is 10% */
+
+ points -= bonus * (points / granularity);
+ }

/*
* /proc/pid/oom_score_adj ranges from -1000 to +1000 such that it may
* either completely disable oom killing or always prefer a certain
* task.
*/
- points += p->signal->oom_score_adj;
+ points += p->signal->oom_score_adj * OOM_SCORE_MAX_FACTOR;

/*
* Never return 0 for an eligible task that may be killed since it's
- * possible that no single user task uses more than 0.1% of memory and
+ * possible that no single user task uses more than 0.01% of memory and
* no single admin tasks uses more than 3.0%.
*/
if (points <= 0)
return 1;
- return (points < 1000) ? points : 1000;
+ return (points < OOM_SCORE_MAX) ? points : OOM_SCORE_MAX;
}

/*
@@ -314,7 +323,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
*/
if (p == current) {
chosen = p;
- *ppoints = 1000;
+ *ppoints = OOM_SCORE_MAX;
} else {
/*
* If this task is not being ptraced on exit,

2011-05-24 01:14:31

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [PATCH 3/5] oom: oom-killer don't use proportion of system-ram internally

Hi

>> @@ -476,14 +476,17 @@ static const struct file_operations proc_lstats_operations = {
>>
>> static int proc_oom_score(struct task_struct *task, char *buffer)
>> {
>> - unsigned long points = 0;
>> + unsigned long points;
>> + unsigned long ratio = 0;
>> + unsigned long totalpages = totalram_pages + total_swap_pages + 1;
>
> Does we need +1?
> oom_badness does have the check.

"ratio = points * 1000 / totalpages;" need to avoid zero divide.

>> /*
>> * Root processes get 3% bonus, just like the __vm_enough_memory()
>> * implementation used by LSMs.
>> + *
>> + * XXX: Too large bonus, example, if the system have tera-bytes memory..
>> */
>> - if (has_capability_noaudit(p, CAP_SYS_ADMIN))
>> - points -= 30;
>> + if (has_capability_noaudit(p, CAP_SYS_ADMIN)) {
>> + if (points>= totalpages / 32)
>> + points -= totalpages / 32;
>> + else
>> + points = 0;
>
> Odd. Why do we initialize points with 0?
>
> I think the idea is good.

The points is unsigned. It's common technique to avoid underflow.

2011-05-24 01:21:58

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [PATCH 3/5] oom: oom-killer don't use proportion of system-ram internally

(2011/05/24 7:48), David Rientjes wrote:
> On Mon, 23 May 2011, David Rientjes wrote:
>
>> I already suggested an alternative patch to CAI Qian to greatly increase
>> the granularity of the oom score from a range of 0-1000 to 0-10000 to
>> differentiate between tasks within 0.01% of available memory (16MB on CAI
>> Qian's 16GB system). I'll propose this officially in a separate email.
>>
>
> This is an alternative patch as earlier proposed with suggested
> improvements from Minchan. CAI, would it be possible to test this out on
> your usecase?
>
> I'm indifferent to the actual scale of OOM_SCORE_MAX_FACTOR; it could be
> 10 as proposed in this patch or even increased higher for higher
> resolution.

I did explain why your proposal is unacceptable.

http://www.gossamer-threads.com/lists/linux/kernel/1378837#1378837

2011-05-24 01:32:45

[permalink] [raw]

Subject: Re: [PATCH 3/5] oom: oom-killer don't use proportion of system-ram internally

On Tue, May 24, 2011 at 10:14 AM, KOSAKI Motohiro
<[email protected]> wrote:
> Hi
>
>
>>> @@ -476,14 +476,17 @@ static const struct file_operations
>>> proc_lstats_operations = {
>>>
>>> static int proc_oom_score(struct task_struct *task, char *buffer)
>>> {
>>> - unsigned long points = 0;
>>> + unsigned long points;
>>> + unsigned long ratio = 0;
>>> + unsigned long totalpages = totalram_pages + total_swap_pages + 1;
>>
>> Does we need +1?
>> oom_badness does have the check.
>
> "ratio = points * 1000 / totalpages;" need to avoid zero divide.
>
>>> /*
>>> * Root processes get 3% bonus, just like the __vm_enough_memory()
>>> * implementation used by LSMs.
>>> + *
>>> + * XXX: Too large bonus, example, if the system have tera-bytes
>>> memory..
>>> */
>>> - if (has_capability_noaudit(p, CAP_SYS_ADMIN))
>>> - points -= 30;
>>> + if (has_capability_noaudit(p, CAP_SYS_ADMIN)) {
>>> + if (points>= totalpages / 32)
>>> + points -= totalpages / 32;
>>> + else
>>> + points = 0;
>>
>> Odd. Why do we initialize points with 0?
>>
>> I think the idea is good.
>
> The points is unsigned. It's common technique to avoid underflow.
>

Thanks for explanation, KOSAKI.
I need sleeping. :(

--
Kind regards,
Minchan Kim

2011-05-24 01:35:16

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [PATCH 4/5] oom: don't kill random process

(2011/05/24 7:32), David Rientjes wrote:
> On Fri, 20 May 2011, KOSAKI Motohiro wrote:
>
>> CAI Qian reported oom-killer killed all system daemons in his
>> system at first if he ran fork bomb as root. The problem is,
>> current logic give them bonus of 3% of system ram. Example,
>> he has 16GB machine, then root processes have ~500MB oom
>> immune. It bring us crazy bad result. _all_ processes have
>> oom-score=1 and then, oom killer ignore process memory usage
>> and kill random process. This regression is caused by commit
>> a63d83f427 (oom: badness heuristic rewrite).
>>
>> This patch changes select_bad_process() slightly. If oom points == 1,
>> it's a sign that the system have only root privileged processes or
>> similar. Thus, select_bad_process() calculate oom badness without
>> root bonus and select eligible process.
>>
>
> You said earlier that you thought it was a good idea to do a proportional
> based bonus for root processes. Do you have a specific objection to
> giving root processes a 1% bonus for every 10% of used memory instead?

Because it's completely another topic. You have to maek another patch.

>> Also, this patch move finding sacrifice child logic into
>> select_bad_process(). It's necessary to implement adequate
>> no root bonus recalculation. and it makes good side effect,
>> current logic doesn't behave as the doc.
>>
>
> This is unnecessary and just makes the oom killer egregiously long. We
> are already diagnosing problems here at Google where the oom killer holds
> tasklist_lock on the readside for far too long, causing other cpus waiting
> for a write_lock_irq(&tasklist_lock) to encounter issues when irqs are
> disabled and it is spinning. A second tasklist scan is simply a
> non-starter.
>
> [ This is also one of the reasons why we needed to introduce
> mm->oom_disable_count to prevent a second, expensive tasklist scan. ]

You misunderstand the code. Both select_bad_process() and oom_kill_process()
are under tasklist_lock(). IOW, no change lock holding time.

>> Documentation/sysctl/vm.txt says
>>
>> oom_kill_allocating_task
>>
>> If this is set to non-zero, the OOM killer simply kills the task that
>> triggered the out-of-memory condition. This avoids the expensive
>> tasklist scan.
>>
>> IOW, oom_kill_allocating_task shouldn't search sacrifice child.
>> This patch also fixes this issue.
>>
>
> oom_kill_allocating_task was introduced for SGI to prevent the expensive
> tasklist scan, the task that is actually allocating the memory isn't
> actually interesting and is usually random. This should be turned into a
> documentation fix rather than changing the implementation.

No benefit. I don't take it.

2011-05-24 01:40:04

by David Rientjes

[permalink] [raw]

Subject: Re: [PATCH 4/5] oom: don't kill random process

On Tue, 24 May 2011, KOSAKI Motohiro wrote:

> > > Also, this patch move finding sacrifice child logic into
> > > select_bad_process(). It's necessary to implement adequate
> > > no root bonus recalculation. and it makes good side effect,
> > > current logic doesn't behave as the doc.
> > >
> >
> > This is unnecessary and just makes the oom killer egregiously long. We
> > are already diagnosing problems here at Google where the oom killer holds
> > tasklist_lock on the readside for far too long, causing other cpus waiting
> > for a write_lock_irq(&tasklist_lock) to encounter issues when irqs are
> > disabled and it is spinning. A second tasklist scan is simply a
> > non-starter.
> >
> > [ This is also one of the reasons why we needed to introduce
> > mm->oom_disable_count to prevent a second, expensive tasklist scan. ]
>
> You misunderstand the code. Both select_bad_process() and oom_kill_process()
> are under tasklist_lock(). IOW, no change lock holding time.
>

A second iteration through the tasklist in select_bad_process() will
extend the time that tasklist_lock is held, which is what your patch does.

2011-05-24 01:45:04

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [PATCH 3/5] oom: oom-killer don't use proportion of system-ram internally

>> @@ -176,33 +178,49 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
>> */
>> points = get_mm_rss(p->mm) + p->mm->nr_ptes;
>> points += get_mm_counter(p->mm, MM_SWAPENTS);
>> -
>> - points *= 1000;
>> - points /= totalpages;
>> task_unlock(p);
>>
>> /*
>> * Root processes get 3% bonus, just like the __vm_enough_memory()
>> * implementation used by LSMs.
>> + *
>> + * XXX: Too large bonus, example, if the system have tera-bytes memory..
>> */
>
> Nitpick. I have no opposition about adding this comment.
> But strictly speaking, the comment isn't related to this patch.
> No biggie and it's up to you. :)

ok, removed.

From 3dda8863e5acdba7a714f0e7506fae931865c442 Mon Sep 17 00:00:00 2001
From: KOSAKI Motohiro <[email protected]>
Date: Tue, 24 May 2011 10:43:49 +0900
Subject: [PATCH] remove unrelated comments

Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/oom_kill.c | 2 --
1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index ec075cc..b01fa64 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -184,8 +184,6 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *mem,
/*
* Root processes get 3% bonus, just like the __vm_enough_memory()
* implementation used by LSMs.
- *
- * XXX: Too large bonus, example, if the system have tera-bytes memory..
*/
if (protect_root && has_capability_noaudit(p, CAP_SYS_ADMIN)) {
if (points >= totalpages / 32)
--
1.7.3.1

2011-05-24 01:54:08

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [PATCH 4/5] oom: don't kill random process

>> + /*
>> + * chosen_point==1 may be a sign that root privilege bonus is too large
>> + * and we choose wrong task. Let's recalculate oom score without the
>> + * dubious bonus.
>> + */
>> + if (protect_root&& (chosen_points == 1)) {
>> + protect_root = 0;
>> + goto retry;
>> + }
>
> The idea is good to me.
> But once we meet it, should we give up protecting root privileged processes?
> How about decaying bonus point?

After applying my patch, unprivileged process never get score-1. (note, mapping
anon pages naturally makes to increase nr_ptes)

Then, decaying don't make any accuracy. Am I missing something?

2011-05-24 01:56:07

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [PATCH 4/5] oom: don't kill random process

>>> This is unnecessary and just makes the oom killer egregiously long. We
>>> are already diagnosing problems here at Google where the oom killer holds
>>> tasklist_lock on the readside for far too long, causing other cpus waiting
>>> for a write_lock_irq(&tasklist_lock) to encounter issues when irqs are
>>> disabled and it is spinning. A second tasklist scan is simply a
>>> non-starter.
>>>
>>> [ This is also one of the reasons why we needed to introduce
>>> mm->oom_disable_count to prevent a second, expensive tasklist scan. ]
>>
>> You misunderstand the code. Both select_bad_process() and oom_kill_process()
>> are under tasklist_lock(). IOW, no change lock holding time.
>>
>
> A second iteration through the tasklist in select_bad_process() will
> extend the time that tasklist_lock is held, which is what your patch does.

It never happen usual case. Plz think when happen all process score = 1.

2011-05-24 01:58:33

by David Rientjes

[permalink] [raw]

Subject: Re: [PATCH 4/5] oom: don't kill random process

On Tue, 24 May 2011, KOSAKI Motohiro wrote:

> > > > This is unnecessary and just makes the oom killer egregiously long. We
> > > > are already diagnosing problems here at Google where the oom killer
> > > > holds
> > > > tasklist_lock on the readside for far too long, causing other cpus
> > > > waiting
> > > > for a write_lock_irq(&tasklist_lock) to encounter issues when irqs are
> > > > disabled and it is spinning. A second tasklist scan is simply a
> > > > non-starter.
> > > >
> > > > [ This is also one of the reasons why we needed to introduce
> > > > mm->oom_disable_count to prevent a second, expensive tasklist scan.
> > > > ]
> > >
> > > You misunderstand the code. Both select_bad_process() and
> > > oom_kill_process()
> > > are under tasklist_lock(). IOW, no change lock holding time.
> > >
> >
> > A second iteration through the tasklist in select_bad_process() will
> > extend the time that tasklist_lock is held, which is what your patch does.
>
> It never happen usual case. Plz think when happen all process score = 1.
>

I don't care if it happens in the usual case or extremely rare case. It
significantly increases the amount of time that tasklist_lock is held
which causes writelock starvation on other cpus and causes issues,
especially if the cpu being starved is updating the timer because it has
irqs disabled, i.e. write_lock_irq(&tasklist_lock) usually in the clone or
exit path. We can do better than that, and that's why I proposed my patch
to CAI that increases the resolution of the scoring and makes the root
process bonus proportional to the amount of used memory.

2011-05-24 02:03:45

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [PATCH 4/5] oom: don't kill random process

(2011/05/24 10:58), David Rientjes wrote:
> On Tue, 24 May 2011, KOSAKI Motohiro wrote:
>
>>>>> This is unnecessary and just makes the oom killer egregiously long. We
>>>>> are already diagnosing problems here at Google where the oom killer
>>>>> holds
>>>>> tasklist_lock on the readside for far too long, causing other cpus
>>>>> waiting
>>>>> for a write_lock_irq(&tasklist_lock) to encounter issues when irqs are
>>>>> disabled and it is spinning. A second tasklist scan is simply a
>>>>> non-starter.
>>>>>
>>>>> [ This is also one of the reasons why we needed to introduce
>>>>> mm->oom_disable_count to prevent a second, expensive tasklist scan.
>>>>> ]
>>>>
>>>> You misunderstand the code. Both select_bad_process() and
>>>> oom_kill_process()
>>>> are under tasklist_lock(). IOW, no change lock holding time.
>>>>
>>>
>>> A second iteration through the tasklist in select_bad_process() will
>>> extend the time that tasklist_lock is held, which is what your patch does.
>>
>> It never happen usual case. Plz think when happen all process score = 1.
>>
>
> I don't care if it happens in the usual case or extremely rare case. It
> significantly increases the amount of time that tasklist_lock is held
> which causes writelock starvation on other cpus and causes issues,
> especially if the cpu being starved is updating the timer because it has
> irqs disabled, i.e. write_lock_irq(&tasklist_lock) usually in the clone or
> exit path. We can do better than that, and that's why I proposed my patch
> to CAI that increases the resolution of the scoring and makes the root
> process bonus proportional to the amount of used memory.

Do I need to say the same word? Please read the code at first.

2011-05-24 02:08:09

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [PATCH 3/5] oom: oom-killer don't use proportion of system-ram internally

(2011/05/24 7:28), David Rientjes wrote:
> On Fri, 20 May 2011, KOSAKI Motohiro wrote:
>
>> CAI Qian reported his kernel did hang-up if he ran fork intensive
>> workload and then invoke oom-killer.
>>
>> The problem is, current oom calculation uses 0-1000 normalized value
>> (The unit is a permillage of system-ram). Its low precision make
>> a lot of same oom score. IOW, in his case, all processes have smaller
>> oom score than 1 and internal calculation round it to 1.
>>
>> Thus oom-killer kill ineligible process. This regression is caused by
>> commit a63d83f427 (oom: badness heuristic rewrite).
>>
>> The solution is, the internal calculation just use number of pages
>> instead of permillage of system-ram. And convert it to permillage
>> value at displaying time.
>>
>> This patch doesn't change any ABI (included /proc/<pid>/oom_score_adj)
>> even though current logic has a lot of my dislike thing.
>>
>
> Same response as when you initially proposed this patch:
> http://marc.info/?l=linux-kernel&m=130507086613317 -- you never replied to
> that.

I did replay. Why don't you read?
http://www.gossamer-threads.com/lists/linux/kernel/1378837#1378837

If you haven't understand the issue, you can apply following patch and
run it.

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index b01fa64..f35909b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -718,6 +718,9 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
*/
constraint = constrained_alloc(zonelist, gfp_mask, nodemask,
&totalpages);
+
+ totalpages *= 10;
+
mpol_mask = (constraint == CONSTRAINT_MEMORY_POLICY) ? nodemask : NULL;
check_panic_on_oom(constraint, gfp_mask, order, mpol_mask);

> The changelog doesn't accurately represent CAI Qian's problem; the issue
> is that root processes are given too large of a bonus in comparison to
> other threads that are using at most 1.9% of available memory. That can
> be fixed, as I suggested by giving 1% bonus per 10% of memory used so that
> the process would have to be using 10% before it even receives a bonus.
>
> I already suggested an alternative patch to CAI Qian to greatly increase
> the granularity of the oom score from a range of 0-1000 to 0-10000 to
> differentiate between tasks within 0.01% of available memory (16MB on CAI
> Qian's 16GB system). I'll propose this officially in a separate email.
>
> This patch also includes undocumented changes such as changing the bonus
> given to root processes.

2011-05-24 03:11:30

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [PATCH 3/5] oom: oom-killer don't use proportion of system-ram internally

> ok, removed.

I'm sorry. previous patch has white space damage.
Let's retry send it.

>From 3dda8863e5acdba7a714f0e7506fae931865c442 Mon Sep 17 00:00:00 2001
From: KOSAKI Motohiro <[email protected]>
Date: Tue, 24 May 2011 10:43:49 +0900
Subject: [PATCH] remove unrelated comments

Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/oom_kill.c | 2 --
1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index ec075cc..b01fa64 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -184,8 +184,6 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *mem,
/*
* Root processes get 3% bonus, just like the __vm_enough_memory()
* implementation used by LSMs.
- *
- * XXX: Too large bonus, example, if the system have tera-bytes memory..
*/
if (protect_root && has_capability_noaudit(p, CAP_SYS_ADMIN)) {
if (points >= totalpages / 32)
--
1.7.3.1

2011-05-24 08:32:13

[permalink] [raw]

Subject: Re: [PATCH 3/5] oom: oom-killer don't use proportion of system-ram internally

----- Original Message -----
> On Mon, 23 May 2011, David Rientjes wrote:
>
> > I already suggested an alternative patch to CAI Qian to greatly
> > increase
> > the granularity of the oom score from a range of 0-1000 to 0-10000
> > to
> > differentiate between tasks within 0.01% of available memory (16MB
> > on CAI
> > Qian's 16GB system). I'll propose this officially in a separate
> > email.
> >
>
> This is an alternative patch as earlier proposed with suggested
> improvements from Minchan. CAI, would it be possible to test this out
> on
> your usecase?
Sure, will test KOSAKI Motohiro's v2 patches plus this one.
> I'm indifferent to the actual scale of OOM_SCORE_MAX_FACTOR; it could
> be
> 10 as proposed in this patch or even increased higher for higher
> resolution.
>
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -38,6 +38,9 @@ int sysctl_oom_kill_allocating_task;
> int sysctl_oom_dump_tasks = 1;
> static DEFINE_SPINLOCK(zone_scan_lock);
>
> +#define OOM_SCORE_MAX_FACTOR 10
> +#define OOM_SCORE_MAX (OOM_SCORE_ADJ_MAX * OOM_SCORE_MAX_FACTOR)
> +
> #ifdef CONFIG_NUMA
> /**
> * has_intersects_mems_allowed() - check task eligiblity for kill
> @@ -160,7 +163,7 @@ unsigned int oom_badness(struct task_struct *p,
> struct mem_cgroup *mem,
> */
> if (p->flags & PF_OOM_ORIGIN) {
> task_unlock(p);
> - return 1000;
> + return OOM_SCORE_MAX;
> }
>
> /*
> @@ -177,32 +180,38 @@ unsigned int oom_badness(struct task_struct *p,
> struct mem_cgroup *mem,
> points = get_mm_rss(p->mm) + p->mm->nr_ptes;
> points += get_mm_counter(p->mm, MM_SWAPENTS);
>
> - points *= 1000;
> + points *= OOM_SCORE_MAX;
> points /= totalpages;
> task_unlock(p);
>
> /*
> - * Root processes get 3% bonus, just like the __vm_enough_memory()
> - * implementation used by LSMs.
> + * Root processes get a bonus of 1% per 10% of memory used.
> */
> - if (has_capability_noaudit(p, CAP_SYS_ADMIN))
> - points -= 30;
> + if (has_capability_noaudit(p, CAP_SYS_ADMIN)) {
> + int bonus;
> + int granularity;
> +
> + bonus = OOM_SCORE_MAX / 100; /* bonus is 1% */
> + granularity = OOM_SCORE_MAX / 10; /* granularity is 10% */
> +
> + points -= bonus * (points / granularity);
> + }
>
> /*
> * /proc/pid/oom_score_adj ranges from -1000 to +1000 such that it may
> * either completely disable oom killing or always prefer a certain
> * task.
> */
> - points += p->signal->oom_score_adj;
> + points += p->signal->oom_score_adj * OOM_SCORE_MAX_FACTOR;
>
> /*
> * Never return 0 for an eligible task that may be killed since it's
> - * possible that no single user task uses more than 0.1% of memory
> and
> + * possible that no single user task uses more than 0.01% of memory
> and
> * no single admin tasks uses more than 3.0%.
> */
> if (points <= 0)
> return 1;
> - return (points < 1000) ? points : 1000;
> + return (points < OOM_SCORE_MAX) ? points : OOM_SCORE_MAX;
> }
>
> /*
> @@ -314,7 +323,7 @@ static struct task_struct
> *select_bad_process(unsigned int *ppoints,
> */
> if (p == current) {
> chosen = p;
> - *ppoints = 1000;
> + *ppoints = OOM_SCORE_MAX;
> } else {
> /*
> * If this task is not being ptraced on exit,

2011-05-24 08:46:56

[permalink] [raw]

Subject: Re: [PATCH 4/5] oom: don't kill random process

On Tue, May 24, 2011 at 10:53 AM, KOSAKI Motohiro
<[email protected]> wrote:
>>> + /*
>>> + * chosen_point==1 may be a sign that root privilege bonus is too
>>> large
>>> + * and we choose wrong task. Let's recalculate oom score without
>>> the
>>> + * dubious bonus.
>>> + */
>>> + if (protect_root&& (chosen_points == 1)) {
>>> + protect_root = 0;
>>> + goto retry;
>>> + }
>>
>> The idea is good to me.
>> But once we meet it, should we give up protecting root privileged
>> processes?
>> How about decaying bonus point?
>
> After applying my patch, unprivileged process never get score-1. (note,
> mapping
> anon pages naturally makes to increase nr_ptes)

Hmm, If I understand your code correctly, unprivileged process can get
a score 1 by 3% bonus.
So after all, we can get a chosen_point with 1.
Why I get a chosen_point with 1 is as bonus is rather big, I think.
So I would like to use small bonus than first iteration(ie, decay bonus).

>
> Then, decaying don't make any accuracy. Am I missing something?

Maybe I miss something. :(

--
Kind regards,
Minchan Kim

2011-05-24 08:49:44

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [PATCH 4/5] oom: don't kill random process

(2011/05/24 17:46), Minchan Kim wrote:
> On Tue, May 24, 2011 at 10:53 AM, KOSAKI Motohiro
> <[email protected]> wrote:
>>>> + /*
>>>> + * chosen_point==1 may be a sign that root privilege bonus is too
>>>> large
>>>> + * and we choose wrong task. Let's recalculate oom score without
>>>> the
>>>> + * dubious bonus.
>>>> + */
>>>> + if (protect_root&& (chosen_points == 1)) {
>>>> + protect_root = 0;
>>>> + goto retry;
>>>> + }
>>>
>>> The idea is good to me.
>>> But once we meet it, should we give up protecting root privileged
>>> processes?
>>> How about decaying bonus point?
>>
>> After applying my patch, unprivileged process never get score-1. (note,
>> mapping
>> anon pages naturally makes to increase nr_ptes)
>
> Hmm, If I understand your code correctly, unprivileged process can get
> a score 1 by 3% bonus.

3% bonus is for privileged process. :)

> So after all, we can get a chosen_point with 1.
> Why I get a chosen_point with 1 is as bonus is rather big, I think.
> So I would like to use small bonus than first iteration(ie, decay bonus).
>
>>
>> Then, decaying don't make any accuracy. Am I missing something?
>
> Maybe I miss something. :(
>
>
>
>

2011-05-24 09:04:21

[permalink] [raw]

Subject: Re: [PATCH 4/5] oom: don't kill random process

On Tue, May 24, 2011 at 5:49 PM, KOSAKI Motohiro
<[email protected]> wrote:
> (2011/05/24 17:46), Minchan Kim wrote:
>> On Tue, May 24, 2011 at 10:53 AM, KOSAKI Motohiro
>> <[email protected]> wrote:
>>>>> + /*
>>>>> + * chosen_point==1 may be a sign that root privilege bonus is too
>>>>> large
>>>>> + * and we choose wrong task. Let's recalculate oom score without
>>>>> the
>>>>> + * dubious bonus.
>>>>> + */
>>>>> + if (protect_root&& (chosen_points == 1)) {
>>>>> + protect_root = 0;
>>>>> + goto retry;
>>>>> + }
>>>>
>>>> The idea is good to me.
>>>> But once we meet it, should we give up protecting root privileged
>>>> processes?
>>>> How about decaying bonus point?
>>>
>>> After applying my patch, unprivileged process never get score-1. (note,
>>> mapping
>>> anon pages naturally makes to increase nr_ptes)
>>
>> Hmm, If I understand your code correctly, unprivileged process can get
>> a score 1 by 3% bonus.
>
> 3% bonus is for privileged process. :)

OMG. Typo.
Anyway, my point is following as.
If chose_point is 1, it means root bonus is rather big. Right?
If is is, your patch does second loop with completely ignore of bonus
for root privileged process.
My point is that let's not ignore bonus completely. Instead of it,
let's recalculate 1.5% for example.

But I don't insist on my idea.
Thanks.
--
Kind regards,
Minchan Kim

2011-05-24 09:09:53

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [PATCH 4/5] oom: don't kill random process

>>> Hmm, If I understand your code correctly, unprivileged process can get
>>> a score 1 by 3% bonus.
>>
>> 3% bonus is for privileged process. :)
>
> OMG. Typo.
> Anyway, my point is following as.
> If chose_point is 1, it means root bonus is rather big. Right?
> If is is, your patch does second loop with completely ignore of bonus
> for root privileged process.
> My point is that let's not ignore bonus completely. Instead of it,
> let's recalculate 1.5% for example.

1) unpriviledged process can't get score 1 (because at least a process need one
anon, one file and two or more ptes).
2) then, score=1 mean all processes in the system are privileged. thus decay won't help.

IOW, never happen privileged and unprivileged score in this case.

>
> But I don't insist on my idea.
> Thanks.

2011-05-24 09:20:53

[permalink] [raw]

Subject: Re: [PATCH 4/5] oom: don't kill random process

On Tue, May 24, 2011 at 6:09 PM, KOSAKI Motohiro
<[email protected]> wrote:
>>>> Hmm, If I understand your code correctly, unprivileged process can get
>>>> a score 1 by 3% bonus.
>>>
>>> 3% bonus is for privileged process. :)
>>
>> OMG. Typo.
>> Anyway, my point is following as.
>> If chose_point is 1, it means root bonus is rather big. Right?
>> If is is, your patch does second loop with completely ignore of bonus
>> for root privileged process.
>> My point is that let's not ignore bonus completely. Instead of it,
>> let's recalculate 1.5% for example.
>
> 1) unpriviledged process can't get score 1 (because at least a process need one
> anon, one file and two or more ptes).
> 2) then, score=1 mean all processes in the system are privileged. thus decay won't help.
>
> IOW, never happen privileged and unprivileged score in this case.

I am blind. Thanks for open my eyes, KOSAKI.

--
Kind regards,
Minchan Kim

2011-05-24 09:38:37

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [PATCH 4/5] oom: don't kill random process

(2011/05/24 18:20), Minchan Kim wrote:
> On Tue, May 24, 2011 at 6:09 PM, KOSAKI Motohiro
> <[email protected]> wrote:
>>>>> Hmm, If I understand your code correctly, unprivileged process can get
>>>>> a score 1 by 3% bonus.
>>>>
>>>> 3% bonus is for privileged process. :)
>>>
>>> OMG. Typo.
>>> Anyway, my point is following as.
>>> If chose_point is 1, it means root bonus is rather big. Right?
>>> If is is, your patch does second loop with completely ignore of bonus
>>> for root privileged process.
>>> My point is that let's not ignore bonus completely. Instead of it,
>>> let's recalculate 1.5% for example.
>>
>> 1) unpriviledged process can't get score 1 (because at least a process need one
>> anon, one file and two or more ptes).
>> 2) then, score=1 mean all processes in the system are privileged. thus decay won't help.
>>
>> IOW, never happen privileged and unprivileged score in this case.
>
> I am blind. Thanks for open my eyes, KOSAKI.

No. Your review is very cute. Thank you for attempting this!

2011-05-25 23:50:23

by David Rientjes

[permalink] [raw]

Subject: Re: [PATCH 4/5] oom: don't kill random process

On Tue, 24 May 2011, KOSAKI Motohiro wrote:

> > I don't care if it happens in the usual case or extremely rare case. It
> > significantly increases the amount of time that tasklist_lock is held
> > which causes writelock starvation on other cpus and causes issues,
> > especially if the cpu being starved is updating the timer because it has
> > irqs disabled, i.e. write_lock_irq(&tasklist_lock) usually in the clone or
> > exit path. We can do better than that, and that's why I proposed my patch
> > to CAI that increases the resolution of the scoring and makes the root
> > process bonus proportional to the amount of used memory.
>
> Do I need to say the same word? Please read the code at first.
>

I'm afraid that a second time through the tasklist in select_bad_process()
is simply a non-starter for _any_ case; it significantly increases the
amount of time that tasklist_lock is held and causes problems elsewhere on
large systems -- such as some of ours -- since irqs are disabled while
waiting for the writeside of the lock. I think it would be better to use
a proportional privilege for root processes based on the amount of memory
they are using (discounting 1% of memory per 10% of memory used, as
proposed earlier, seems sane) so we can always protect root when necessary
and never iterate through the list again.

Please look into the earlier review comments on the other patches, refresh
the series, and post it again. Thanks!

2011-05-26 07:09:16

[permalink] [raw]

Subject: Re: [PATCH 3/5] oom: oom-killer don't use proportion of system-ram internally

----- Original Message -----
> On Mon, 23 May 2011, David Rientjes wrote:
>
> > I already suggested an alternative patch to CAI Qian to greatly
> > increase
> > the granularity of the oom score from a range of 0-1000 to 0-10000
> > to
> > differentiate between tasks within 0.01% of available memory (16MB
> > on CAI
> > Qian's 16GB system). I'll propose this officially in a separate
> > email.
> >
>
> This is an alternative patch as earlier proposed with suggested
> improvements from Minchan. CAI, would it be possible to test this out
> on
> your usecase?
Here is the results for the testing. Running the reproducer as non-root
user, the results look good as OOM killer just killed each python process
in-turn that the reproducer forked. However, when running it as root
user, sshd and other random processes had been killed.

[ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name
[ 567] 0 567 2935 365 0 -17 -1000 udevd
[ 2116] 0 2116 3099 464 8 -17 -1000 udevd
[ 2117] 0 2117 3099 503 2 -17 -1000 udevd
[ 2317] 0 2317 6404 39 8 -17 -1000 auditd
[ 3221] 0 3221 15998 153 9 0 0 sshd
[ 3223] 0 3223 24421 204 0 0 0 sshd
[ 3227] 0 3227 27093 86 4 0 0 bash
[ 3246] 0 3246 1029 18 1 0 0 agetty
[ 3251] 0 3251 243710 98227 11 0 0 python
[ 3252] 0 3252 243710 109999 9 0 0 python
[ 3253] 0 3253 243710 111538 12 0 0 python
[ 3254] 0 3254 243710 106931 1 0 0 python
[ 3255] 0 3255 243710 103367 9 0 0 python
[ 3256] 0 3256 243710 97715 1 0 0 python
[ 3257] 0 3257 243710 107443 9 0 0 python
[ 3258] 0 3258 243710 101298 4 0 0 python
[ 3259] 0 3259 243710 118707 1 0 0 python
[ 3260] 0 3260 243710 104882 9 0 0 python
[ 3261] 0 3261 243710 108979 12 0 0 python
[ 3262] 0 3262 243710 93106 1 0 0 python
[ 3263] 0 3263 243710 97714 12 0 0 python
[ 3264] 0 3264 243710 91571 12 0 0 python
[ 3265] 0 3265 243710 93107 1 0 0 python
[ 3266] 0 3266 243710 83790 9 0 0 python
[ 3267] 0 3267 243710 81330 5 0 0 python
[ 3268] 0 3268 243710 83378 5 0 0 python
[ 3269] 0 3269 243710 77235 4 0 0 python
[ 3270] 0 3270 243710 80732 1 0 0 python
[ 3271] 0 3271 243710 72626 11 0 0 python
[ 3272] 0 3272 243710 81385 7 0 0 python
[ 3273] 0 3273 243710 71749 3 0 0 python
[ 3274] 0 3274 243710 70735 1 0 0 python
[ 3275] 0 3275 243710 84403 9 0 0 python
[ 3276] 0 3276 243710 72255 13 0 0 python
[ 3277] 0 3277 243710 65971 3 0 0 python
[ 3278] 0 3278 243710 66172 15 0 0 python
[ 3279] 0 3279 243710 69555 1 0 0 python
[ 3280] 0 3280 243710 68689 9 0 0 python
[ 3281] 0 3281 243710 69553 9 0 0 python
[ 3282] 0 3282 243710 64439 6 0 0 python
[ 3283] 0 3283 243710 56753 11 0 0 python
[ 3284] 0 3284 243710 57917 6 0 0 python
[ 3285] 0 3285 243710 55730 9 0 0 python
[ 3286] 0 3286 243710 54193 9 0 0 python
[ 3287] 0 3287 243710 51123 1 0 0 python
[ 3288] 0 3288 243710 52146 15 0 0 python
[ 3289] 0 3289 243710 48220 9 0 0 python
[ 3290] 0 3290 243710 48051 3 0 0 python
[ 3291] 0 3291 243710 40371 3 0 0 python
[ 3292] 0 3292 243710 49229 13 0 0 python
[ 3293] 0 3293 243710 40549 9 0 0 python
[ 3294] 0 3294 243710 41618 5 0 0 python
[ 3295] 0 3295 243710 40429 9 0 0 python
[ 3296] 0 3296 243710 36787 1 0 0 python
[ 3297] 0 3297 243710 39346 11 0 0 python
[ 3298] 0 3298 243710 35251 3 0 0 python
[ 3299] 0 3299 243710 32872 3 0 0 python
[ 3300] 0 3300 243710 29781 1 0 0 python
[ 3301] 0 3301 243710 27570 11 0 0 python
[ 3302] 0 3302 243710 28081 9 0 0 python
[ 3303] 0 3303 243710 24499 1 0 0 python
[ 3304] 0 3304 243710 21427 1 0 0 python
[ 3305] 0 3305 243710 25522 9 0 0 python
[ 3306] 0 3306 243710 28081 9 0 0 python
[ 3307] 0 3307 243710 21939 9 0 0 python
[ 3308] 0 3308 243710 19890 9 0 0 python
[ 3309] 0 3309 243710 18354 3 0 0 python
[ 3310] 0 3310 243710 16590 14 0 0 python
[ 3311] 0 3311 243710 18718 11 0 0 python
[ 3312] 0 3312 243710 17841 1 0 0 python
[ 3313] 0 3313 243710 14258 11 0 0 python
[ 3314] 0 3314 243710 14426 4 0 0 python
[ 3315] 0 3315 243710 15282 6 0 0 python
[ 3316] 0 3316 243710 9650 6 0 0 python
[ 3317] 0 3317 243710 11699 1 0 0 python
[ 3318] 0 3318 243710 11372 3 0 0 python
[ 3319] 0 3319 243710 9650 9 0 0 python
[ 3320] 0 3320 243710 8426 11 0 0 python
[ 3321] 0 3321 243710 4531 3 0 0 python
[ 3322] 0 3322 243710 8627 9 0 0 python
[ 3323] 0 3323 243710 6578 1 0 0 python
[ 3324] 0 3324 243710 5553 7 0 0 python
[ 3325] 0 3325 243710 10673 3 0 0 python
[ 3326] 0 3326 243710 6578 11 0 0 python
[ 3327] 0 3327 243710 3505 1 0 0 python
[ 3328] 0 3328 243710 3530 1 0 0 python
[ 3329] 0 3329 243710 5205 11 0 0 python
[ 3330] 0 3330 243710 1970 9 0 0 python
[ 3331] 0 3331 243710 4021 11 0 0 python
[ 3332] 0 3332 243710 5043 1 0 0 python
[ 3333] 0 3333 243710 2481 1 0 0 python
[ 3334] 0 3334 243710 4530 1 0 0 python
[ 3343] 0 3343 41835 773 9 0 0 python
[ 3344] 0 3344 41835 773 4 0 0 python
[ 3345] 0 3345 41835 773 1 0 0 python
[ 3346] 0 3346 41835 773 1 0 0 python
[ 3347] 0 3347 41835 773 9 0 0 python
[ 3348] 0 3348 41835 773 3 0 0 python
[ 3349] 0 3349 41835 773 1 0 0 python
[ 3350] 0 3350 41835 773 11 0 0 python
Out of memory: Kill process 3221 (sshd) score 1 or sacrifice child
Killed process 3223 (sshd) total-vm:97684kB, anon-rss:816kB, file-rss:0kB
sshd invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
sshd cpuset=/ mems_allowed=0-1

[ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name
[ 567] 0 567 2935 0 0 -17 -1000 udevd
[ 2103] 0 2103 1025 0 9 0 0 mingetty
[ 2105] 0 2105 1025 0 2 0 0 mingetty
[ 2109] 0 2109 19263 100 0 0 0 login
[ 2116] 0 2116 3099 0 8 -17 -1000 udevd
[ 2117] 0 2117 3099 0 2 -17 -1000 udevd
[ 2317] 0 2317 6404 20 0 -17 -1000 auditd
[ 2338] 0 2338 27093 11 10 0 0 bash
[ 2358] 0 2358 245248 8337 6 0 0 python
[ 2359] 0 2359 245248 11151 6 0 0 python
[ 2360] 0 2360 245248 12487 10 0 0 python
[ 2361] 0 2361 245248 11702 9 0 0 python
[ 2362] 0 2362 245248 6751 1 0 0 python
[ 2363] 0 2363 245248 10952 2 0 0 python
[ 2364] 0 2364 245248 12113 1 0 0 python
[ 2365] 0 2365 245248 11258 9 0 0 python
[ 2366] 0 2366 245248 9697 10 0 0 python
[ 2367] 0 2367 245248 12453 2 0 0 python
[ 2368] 0 2368 245248 14357 10 0 0 python
[ 2369] 0 2369 245248 11282 10 0 0 python
[ 2370] 0 2370 245248 11138 0 0 0 python
[ 2371] 0 2371 245248 10615 13 0 0 python
[ 2372] 0 2372 245248 10742 2 0 0 python
[ 2373] 0 2373 245248 9024 7 0 0 python
[ 2374] 0 2374 245248 12176 12 0 0 python
[ 2375] 0 2375 245248 13886 10 0 0 python
[ 2376] 0 2376 245248 10974 5 0 0 python
[ 2377] 0 2377 245248 8416 11 0 0 python
[ 2378] 0 2378 245248 9469 11 0 0 python
[ 2379] 0 2379 245248 11312 13 0 0 python
[ 2380] 0 2380 245248 9317 1 0 0 python
[ 2381] 0 2381 245248 10424 0 0 0 python
[ 2382] 0 2382 245248 15806 1 0 0 python
[ 2383] 0 2383 245248 15340 7 0 0 python
[ 2384] 0 2384 245248 7932 9 0 0 python
[ 2385] 0 2385 245248 10420 0 0 0 python
[ 2386] 0 2386 245248 14376 9 0 0 python
[ 2387] 0 2387 245248 12410 2 0 0 python
[ 2388] 0 2388 245248 14596 9 0 0 python
[ 2389] 0 2389 245248 7898 9 0 0 python
[ 2390] 0 2390 245248 10943 10 0 0 python
[ 2391] 0 2391 245248 8787 2 0 0 python
[ 2392] 0 2392 245248 7252 10 0 0 python
[ 2393] 0 2393 245248 12978 15 0 0 python
[ 2394] 0 2394 245248 7034 11 0 0 python
[ 2395] 0 2395 245248 10903 2 0 0 python
[ 2396] 0 2396 245248 10280 10 0 0 python
[ 2397] 0 2397 245248 10793 9 0 0 python
[ 2398] 0 2398 245248 8205 9 0 0 python
[ 2399] 0 2399 245248 9675 0 0 0 python
[ 2400] 0 2400 245248 11304 5 0 0 python
[ 2401] 0 2401 245248 15053 5 0 0 python
[ 2402] 0 2402 245248 14449 10 0 0 python
[ 2403] 0 2403 245248 8466 1 0 0 python
[ 2404] 0 2404 245248 14250 10 0 0 python
[ 2405] 0 2405 245248 11630 9 0 0 python
[ 2406] 0 2406 245248 9562 9 0 0 python
[ 2407] 0 2407 245248 8802 1 0 0 python
[ 2408] 0 2408 245248 9521 1 0 0 python
[ 2409] 0 2409 245248 4827 13 0 0 python
[ 2410] 0 2410 245248 10364 1 0 0 python
[ 2411] 0 2411 245248 8749 0 0 0 python
[ 2412] 0 2412 245248 15082 0 0 0 python
[ 2413] 0 2413 245248 11023 10 0 0 python
[ 2414] 0 2414 245248 9087 1 0 0 python
[ 2415] 0 2415 245248 9906 2 0 0 python
[ 2416] 0 2416 245248 13862 5 0 0 python
[ 2417] 0 2417 245248 9553 2 0 0 python
[ 2418] 0 2418 245248 8556 13 0 0 python
[ 2419] 0 2419 245248 9246 9 0 0 python
[ 2420] 0 2420 245248 11084 2 0 0 python
[ 2421] 0 2421 245248 16256 2 0 0 python
[ 2422] 0 2422 245248 13057 12 0 0 python
[ 2423] 0 2423 245248 10578 7 0 0 python
[ 2424] 0 2424 245248 10407 3 0 0 python
[ 2425] 0 2425 245248 10329 3 0 0 python
[ 2426] 0 2426 245248 9489 9 0 0 python
[ 2427] 0 2427 245248 10004 3 0 0 python
[ 2428] 0 2428 245248 7411 0 0 0 python
[ 2429] 0 2429 245248 13647 1 0 0 python
[ 2430] 0 2430 245248 10134 2 0 0 python
[ 2431] 0 2431 245248 12157 10 0 0 python
[ 2432] 0 2432 245248 11158 1 0 0 python
[ 2433] 0 2433 245248 9829 14 0 0 python
[ 2434] 0 2434 245248 5859 3 0 0 python
[ 2435] 0 2435 245248 11456 9 0 0 python
[ 2436] 0 2436 245248 12754 3 0 0 python
[ 2437] 0 2437 245248 11098 0 0 0 python
[ 2438] 0 2438 245248 10676 0 0 0 python
[ 2439] 0 2439 245248 9105 2 0 0 python
[ 2440] 0 2440 245248 10539 10 0 0 python
[ 2441] 0 2441 245248 11514 10 0 0 python
[ 2442] 0 2442 245248 10019 4 0 0 python
[ 2443] 0 2443 245248 7545 14 0 0 python
[ 2444] 0 2444 245248 11830 10 0 0 python
[ 2445] 0 2445 245248 4708 10 0 0 python
[ 2446] 0 2446 245248 8227 10 0 0 python
[ 2447] 0 2447 245248 6306 10 0 0 python
[ 2448] 0 2448 245248 8888 0 0 0 python
[ 2449] 0 2449 245248 11337 3 0 0 python
[ 2450] 0 2450 245248 4856 0 0 0 python
[ 2451] 0 2451 245248 12369 0 0 0 python
[ 2452] 0 2452 245248 11077 10 0 0 python
[ 2453] 0 2453 245248 6757 0 0 0 python
[ 2454] 0 2454 245248 6785 10 0 0 python
[ 2455] 0 2455 245248 6532 3 0 0 python
[ 2456] 0 2456 245248 6265 9 0 0 python
[ 2457] 0 2457 245248 8126 3 0 0 python
[ 2458] 0 2458 245248 9573 10 0 0 python
[ 2459] 0 2459 245248 6954 10 0 0 python
[ 2460] 0 2460 245248 7539 3 0 0 python
[ 2461] 0 2461 245248 7623 0 0 0 python
[ 2462] 0 2462 245248 4853 2 0 0 python
[ 2463] 0 2463 245248 9488 10 0 0 python
[ 2464] 0 2464 245248 6415 0 0 0 python
[ 2465] 0 2465 245248 9745 1 0 0 python
[ 2466] 0 2466 245248 7332 3 0 0 python
[ 2467] 0 2467 245248 7408 11 0 0 python
[ 2468] 0 2468 245248 8311 0 0 0 python
[ 2469] 0 2469 245248 6963 0 0 0 python
[ 2470] 0 2470 245248 8620 10 0 0 python
[ 2471] 0 2471 245248 5799 10 0 0 python
[ 2472] 0 2472 245248 12855 10 0 0 python
[ 2473] 0 2473 245248 8718 9 0 0 python
[ 2474] 0 2474 245248 6782 2 0 0 python
[ 2475] 0 2475 245248 9566 0 0 0 python
[ 2476] 0 2476 245248 8083 9 0 0 python
[ 2477] 0 2477 245248 8657 10 0 0 python
[ 2478] 0 2478 245248 8997 9 0 0 python
[ 2479] 0 2479 245248 6539 11 0 0 python
[ 2480] 0 2480 245248 8906 9 0 0 python
[ 2481] 0 2481 245248 8916 11 0 0 python
[ 2482] 0 2482 245248 8083 0 0 0 python
[ 2483] 0 2483 245248 9490 7 0 0 python
[ 2484] 0 2484 245248 8123 0 0 0 python
[ 2485] 0 2485 245248 7315 11 0 0 python
[ 2486] 0 2486 245248 9084 4 0 0 python
[ 2487] 0 2487 245248 8036 15 0 0 python
[ 2488] 0 2488 245248 6839 2 0 0 python
[ 2489] 0 2489 245248 9478 11 0 0 python
[ 2490] 0 2490 245248 11535 11 0 0 python
[ 2491] 0 2491 245248 7895 2 0 0 python
[ 2492] 0 2492 245248 8831 0 0 0 python
[ 2493] 0 2493 245248 9219 0 0 0 python
[ 2494] 0 2494 245248 8472 11 0 0 python
[ 2495] 0 2495 245248 6666 1 0 0 python
[ 2496] 0 2496 245248 4875 11 0 0 python
[ 2497] 0 2497 245248 6802 11 0 0 python
[ 2498] 0 2498 245248 4901 9 0 0 python
[ 2499] 0 2499 245248 8510 11 0 0 python
[ 2500] 0 2500 245248 8620 15 0 0 python
[ 2501] 0 2501 245248 7169 10 0 0 python
[ 2502] 0 2502 245248 6283 0 0 0 python
[ 2503] 0 2503 245248 9497 0 0 0 python
[ 2504] 0 2504 245248 10091 2 0 0 python
[ 2505] 0 2505 245248 11700 0 0 0 python
[ 2506] 0 2506 245248 8353 3 0 0 python
[ 2507] 0 2507 245248 8505 2 0 0 python
[ 2508] 0 2508 245248 10486 0 0 0 python
[ 2509] 0 2509 245248 6641 3 0 0 python
[ 2510] 0 2510 245248 7175 10 0 0 python
[ 2511] 0 2511 245248 10100 9 0 0 python
[ 2512] 0 2512 245248 6984 13 0 0 python
[ 2513] 0 2513 245248 7677 13 0 0 python
[ 2514] 0 2514 245248 7645 11 0 0 python
[ 2515] 0 2515 245248 8854 4 0 0 python
[ 2516] 0 2516 245248 6888 0 0 0 python
[ 2517] 0 2517 245248 6297 11 0 0 python
[ 2518] 0 2518 245248 8011 11 0 0 python
[ 2519] 0 2519 245248 6353 10 0 0 python
[ 2520] 0 2520 245248 5168 9 0 0 python
[ 2521] 0 2521 245248 7274 11 0 0 python
[ 2522] 0 2522 245248 6374 11 0 0 python
[ 2523] 0 2523 245248 9404 1 0 0 python
[ 2524] 0 2524 245248 7486 0 0 0 python
[ 2525] 0 2525 245248 7290 10 0 0 python
[ 2526] 0 2526 245248 5940 0 0 0 python
[ 2527] 0 2527 245248 7999 10 0 0 python
[ 2528] 0 2528 245248 8201 0 0 0 python
[ 2529] 0 2529 245248 8065 0 0 0 python
[ 2530] 0 2530 245248 6452 9 0 0 python
[ 2531] 0 2531 245248 6162 11 0 0 python
[ 2532] 0 2532 245248 6808 0 0 0 python
[ 2533] 0 2533 245248 4331 2 0 0 python
[ 2534] 0 2534 245248 6458 0 0 0 python
[ 2535] 0 2535 245248 3250 0 0 0 python
[ 2536] 0 2536 245248 5289 9 0 0 python
[ 2537] 0 2537 245248 9369 13 0 0 python
[ 2538] 0 2538 245248 9187 15 0 0 python
[ 2539] 0 2539 245248 8274 0 0 0 python
[ 2540] 0 2540 245248 8051 2 0 0 python
[ 2541] 0 2541 245248 4732 4 0 0 python
[ 2542] 0 2542 245248 4662 0 0 0 python
[ 2543] 0 2543 245248 12070 0 0 0 python
[ 2546] 0 2546 245248 6923 4 0 0 python
[ 2547] 0 2547 245248 4550 0 0 0 python
[ 2548] 0 2548 245248 4700 12 0 0 python
[ 2549] 0 2549 245248 5822 11 0 0 python
[ 2550] 0 2550 245248 6179 10 0 0 python
[ 2551] 0 2551 245248 7794 0 0 0 python
[ 2552] 0 2552 245248 6456 10 0 0 python
[ 2553] 0 2553 245248 4932 4 0 0 python
[ 2554] 0 2554 245248 7680 11 0 0 python
[ 2555] 0 2555 245248 1642 10 0 0 python
[ 2556] 0 2556 245248 7480 10 0 0 python
[ 2557] 0 2557 245248 3598 0 0 0 python
[ 2558] 0 2558 245248 7949 0 0 0 python
[ 2559] 0 2559 245248 4294 0 0 0 python
[ 2560] 0 2560 245248 5138 0 0 0 python
[ 2561] 0 2561 245248 11045 9 0 0 python
[ 2562] 0 2562 245248 4290 9 0 0 python
[ 2563] 0 2563 245248 7603 0 0 0 python
[ 2564] 0 2564 245248 8683 12 0 0 python
[ 2565] 0 2565 245248 6409 12 0 0 python
[ 2566] 0 2566 245248 8321 9 0 0 python
[ 2567] 0 2567 245248 7416 0 0 0 python
[ 2568] 0 2568 245248 5272 2 0 0 python
[ 2569] 0 2569 245248 7359 10 0 0 python
[ 2570] 0 2570 245248 4641 9 0 0 python
[ 2571] 0 2571 245248 7698 2 0 0 python
[ 2572] 0 2572 245248 6118 11 0 0 python
[ 2573] 0 2573 245248 4822 0 0 0 python
[ 2574] 0 2574 245248 4745 0 0 0 python
[ 2575] 0 2575 245248 8029 0 0 0 python
[ 2576] 0 2576 245248 6350 9 0 0 python
[ 2577] 0 2577 245248 5537 9 0 0 python
[ 2578] 0 2578 245248 6861 3 0 0 python
[ 2579] 0 2579 245248 5632 4 0 0 python
[ 2580] 0 2580 245248 6023 0 0 0 python
[ 2581] 0 2581 245248 7947 11 0 0 python
[ 2582] 0 2582 245248 6752 9 0 0 python
[ 2583] 0 2583 245248 4282 12 0 0 python
[ 2584] 0 2584 245248 6069 4 0 0 python
[ 2585] 0 2585 245248 5472 11 0 0 python
[ 2586] 0 2586 245248 4729 0 0 0 python
[ 2587] 0 2587 245248 8205 0 0 0 python
[ 2588] 0 2588 245248 6234 10 0 0 python
[ 2589] 0 2589 245248 7687 11 0 0 python
[ 2590] 0 2590 245248 8817 11 0 0 python
[ 2591] 0 2591 245248 5784 11 0 0 python
[ 2592] 0 2592 245248 7518 10 0 0 python
[ 2593] 0 2593 245248 7213 12 0 0 python
[ 2594] 0 2594 245248 9752 3 0 0 python
[ 2595] 0 2595 245248 7039 0 0 0 python
[ 2596] 0 2596 245248 8164 0 0 0 python
[ 2597] 0 2597 245248 4113 11 0 0 python
[ 2598] 0 2598 245248 4153 0 0 0 python
[ 2599] 0 2599 245248 6651 11 0 0 python
[ 2600] 0 2600 245248 3933 9 0 0 python
[ 2601] 0 2601 245248 7722 14 0 0 python
[ 2602] 0 2602 245248 7535 4 0 0 python
[ 2603] 0 2603 245248 4903 2 0 0 python
[ 2604] 0 2604 245248 5542 0 0 0 python
[ 2605] 0 2605 245248 4589 10 0 0 python
[ 2606] 0 2606 245248 7672 2 0 0 python
[ 2607] 0 2607 245248 6656 2 0 0 python
[ 2608] 0 2608 245248 6467 2 0 0 python
[ 2609] 0 2609 245248 8780 0 0 0 python
[ 2610] 0 2610 245248 11257 0 0 0 python
[ 2611] 0 2611 245248 6748 0 0 0 python
[ 2612] 0 2612 245248 8885 11 0 0 python
[ 2613] 0 2613 245248 4232 0 0 0 python
[ 2614] 0 2614 245248 5724 11 0 0 python
[ 2615] 0 2615 245248 2842 11 0 0 python
[ 2616] 0 2616 245248 4994 15 0 0 python
[ 2617] 0 2617 245248 5417 11 0 0 python
[ 2618] 0 2618 245248 4660 0 0 0 python
[ 2619] 0 2619 245248 5655 11 0 0 python
[ 2620] 0 2620 245248 5952 0 0 0 python
[ 2621] 0 2621 245248 6983 11 0 0 python
[ 2622] 0 2622 245248 6066 12 0 0 python
[ 2623] 0 2623 245248 7743 11 0 0 python
[ 2624] 0 2624 245248 3138 11 0 0 python
[ 2625] 0 2625 245248 6144 0 0 0 python
[ 2626] 0 2626 245248 5238 9 0 0 python
[ 2627] 0 2627 245248 9371 11 0 0 python
[ 2628] 0 2628 245248 13048 10 0 0 python
[ 2629] 0 2629 245248 6702 3 0 0 python
[ 2630] 0 2630 245248 5319 10 0 0 python
[ 2631] 0 2631 245248 7964 0 0 0 python
[ 2632] 0 2632 245248 5787 14 0 0 python
[ 2633] 0 2633 245248 9816 0 0 0 python
[ 2634] 0 2634 245248 5415 6 0 0 python
[ 2635] 0 2635 245248 6740 3 0 0 python
[ 2636] 0 2636 245248 10180 3 0 0 python
[ 2637] 0 2637 245248 5007 11 0 0 python
[ 2638] 0 2638 245248 5801 9 0 0 python
[ 2639] 0 2639 245248 7823 3 0 0 python
[ 2640] 0 2640 245248 9127 0 0 0 python
[ 2641] 0 2641 245248 5614 0 0 0 python
[ 2642] 0 2642 245248 4686 10 0 0 python
[ 2643] 0 2643 245248 4305 11 0 0 python
[ 2644] 0 2644 245248 4714 2 0 0 python
[ 2645] 0 2645 245248 5964 11 0 0 python
[ 2646] 0 2646 245248 7440 10 0 0 python
[ 2647] 0 2647 245248 6062 4 0 0 python
[ 2648] 0 2648 245248 5733 6 0 0 python
[ 2649] 0 2649 245248 5063 0 0 0 python
[ 2650] 0 2650 245248 4793 2 0 0 python
[ 2651] 0 2651 245248 5806 4 0 0 python
[ 2652] 0 2652 245248 8126 10 0 0 python
[ 2653] 0 2653 245248 5794 3 0 0 python
[ 2654] 0 2654 245248 4370 12 0 0 python
[ 2655] 0 2655 245248 5621 0 0 0 python
[ 2656] 0 2656 245248 6514 11 0 0 python
[ 2657] 0 2657 245248 6560 3 0 0 python
[ 2658] 0 2658 245248 7352 2 0 0 python
[ 2659] 0 2659 245248 4456 0 0 0 python
[ 2660] 0 2660 245248 6508 3 0 0 python
[ 2661] 0 2661 245248 4231 4 0 0 python
[ 2662] 0 2662 245248 5967 0 0 0 python
[ 2663] 0 2663 245248 5007 3 0 0 python
[ 2664] 0 2664 245248 5878 3 0 0 python
[ 2665] 0 2665 245248 7469 11 0 0 python
[ 2666] 0 2666 245248 4697 4 0 0 python
[ 2667] 0 2667 245248 3484 11 0 0 python
[ 2668] 0 2668 245248 4223 3 0 0 python
[ 2669] 0 2669 245248 10490 10 0 0 python
[ 2670] 0 2670 245248 3395 3 0 0 python
[ 2671] 0 2671 245248 7004 12 0 0 python
[ 2672] 0 2672 245248 6340 0 0 0 python
[ 2673] 0 2673 245248 3384 0 0 0 python
[ 2674] 0 2674 245248 5563 0 0 0 python
[ 2675] 0 2675 245248 4799 14 0 0 python
[ 2676] 0 2676 245248 10170 15 0 0 python
[ 2677] 0 2677 245248 4793 10 0 0 python
[ 2678] 0 2678 245248 6221 0 0 0 python
[ 2679] 0 2679 245248 4710 10 0 0 python
[ 2680] 0 2680 245248 6231 0 0 0 python
[ 2681] 0 2681 245248 3573 3 0 0 python
[ 2682] 0 2682 245248 3332 0 0 0 python
[ 2683] 0 2683 245248 6929 2 0 0 python
[ 2684] 0 2684 245248 6015 11 0 0 python
[ 2685] 0 2685 245248 5167 14 0 0 python
[ 2688] 0 2688 245248 5195 2 0 0 python
[ 2689] 0 2689 245248 5293 2 0 0 python
[ 2690] 0 2690 245248 4398 10 0 0 python
[ 2691] 0 2691 245248 4672 11 0 0 python
[ 2692] 0 2692 245248 5772 6 0 0 python
[ 2693] 0 2693 245248 4550 2 0 0 python
[ 2694] 0 2694 245248 6926 0 0 0 python
[ 2695] 0 2695 245248 3137 2 0 0 python
[ 2696] 0 2696 245248 4804 10 0 0 python
[ 2697] 0 2697 245248 7152 0 0 0 python
[ 2698] 0 2698 245248 3031 3 0 0 python
[ 2699] 0 2699 245248 6700 0 0 0 python
[ 2700] 0 2700 245248 4299 6 0 0 python
[ 2701] 0 2701 245248 3678 0 0 0 python
[ 2702] 0 2702 245248 4665 0 0 0 python
[ 2703] 0 2703 245248 5555 5 0 0 python
[ 2704] 0 2704 245248 5672 0 0 0 python
[ 2705] 0 2705 245248 3480 0 0 0 python
[ 2706] 0 2706 245248 4387 10 0 0 python
[ 2707] 0 2707 245248 4539 0 0 0 python
[ 2708] 0 2708 245248 3206 11 0 0 python
[ 2711] 0 2711 245248 6383 10 0 0 python
[ 2712] 0 2712 245248 6077 2 0 0 python
[ 2713] 0 2713 245248 4819 0 0 0 python
[ 2714] 0 2714 245248 6774 0 0 0 python
[ 2715] 0 2715 245248 4395 0 0 0 python
[ 2716] 0 2716 245248 9053 11 0 0 python
[ 2717] 0 2717 245248 8341 7 0 0 python
[ 2718] 0 2718 245248 4305 0 0 0 python
[ 2723] 0 2723 1027964 156 8 0 0 console-kit-dae
[ 2790] 0 2790 27092 54 4 0 0 bash
[ 2808] 0 2808 245248 4255 11 0 0 python
[ 2809] 0 2809 245248 7280 2 0 0 python
[ 2810] 0 2810 245248 5922 11 0 0 python
[ 2811] 0 2811 245248 4383 0 0 0 python
[ 2812] 0 2812 245248 4755 15 0 0 python
[ 2813] 0 2813 245248 6075 10 0 0 python
[ 2814] 0 2814 245248 4818 2 0 0 python
[ 2815] 0 2815 245248 4671 3 0 0 python
[ 2816] 0 2816 245248 5975 0 0 0 python
[ 2817] 0 2817 245248 4209 0 0 0 python
[ 2818] 0 2818 245248 5534 12 0 0 python
[ 2819] 0 2819 245248 2562 0 0 0 python
[ 2820] 0 2820 245248 4585 7 0 0 python
[ 2821] 0 2821 245248 6823 10 0 0 python
[ 2822] 0 2822 245248 5243 11 0 0 python
[ 2823] 0 2823 245248 7690 0 0 0 python
[ 2824] 0 2824 245248 5813 11 0 0 python
[ 2825] 0 2825 245248 3626 7 0 0 python
[ 2826] 0 2826 245248 4024 3 0 0 python
[ 2827] 0 2827 245248 6512 0 0 0 python
[ 2828] 0 2828 245248 4419 7 0 0 python
[ 2829] 0 2829 245248 13229 0 0 0 python
[ 2830] 0 2830 245248 2401 0 0 0 python
[ 2831] 0 2831 245248 2651 10 0 0 python
[ 2832] 0 2832 245248 4976 0 0 0 python
[ 2833] 0 2833 245248 6267 10 0 0 python
[ 2834] 0 2834 245248 3703 11 0 0 python
[ 2835] 0 2835 245248 4086 2 0 0 python
[ 2836] 0 2836 245248 6895 14 0 0 python
[ 2837] 0 2837 245248 3800 10 0 0 python
[ 2838] 0 2838 245248 8418 10 0 0 python
[ 2839] 0 2839 245248 3809 10 0 0 python
[ 2840] 0 2840 245248 2784 11 0 0 python
[ 2841] 0 2841 245248 3494 6 0 0 python
[ 2842] 0 2842 245248 4246 2 0 0 python
[ 2843] 0 2843 245248 5831 0 0 0 python
[ 2844] 0 2844 245248 7335 3 0 0 python
[ 2845] 0 2845 245248 5514 0 0 0 python
[ 2846] 0 2846 245248 6125 0 0 0 python
[ 2847] 0 2847 245248 5592 14 0 0 python
[ 2848] 0 2848 245248 5769 0 0 0 python
[ 2849] 0 2849 245248 4548 2 0 0 python
[ 2850] 0 2850 245248 7435 7 0 0 python
[ 2851] 0 2851 245248 6527 3 0 0 python
[ 2852] 0 2852 245248 3152 0 0 0 python
[ 2853] 0 2853 245248 5106 0 0 0 python
[ 2854] 0 2854 245248 5215 10 0 0 python
[ 2855] 0 2855 245248 4286 2 0 0 python
[ 2856] 0 2856 245248 6282 0 0 0 python
[ 2857] 0 2857 245248 3207 15 0 0 python
[ 2858] 0 2858 245248 5448 11 0 0 python
[ 2859] 0 2859 245248 3807 10 0 0 python
[ 2860] 0 2860 245248 3279 14 0 0 python
[ 2861] 0 2861 245248 4322 3 0 0 python
[ 2862] 0 2862 245248 4324 0 0 0 python
[ 2863] 0 2863 245248 3590 11 0 0 python
[ 2864] 0 2864 245248 7398 2 0 0 python
[ 2865] 0 2865 245248 5345 3 0 0 python
[ 2866] 0 2866 245248 5494 0 0 0 python
[ 2867] 0 2867 245248 5302 0 0 0 python
[ 2868] 0 2868 245248 6553 4 0 0 python
[ 2869] 0 2869 245248 4227 0 0 0 python
[ 2870] 0 2870 245248 4746 15 0 0 python
[ 2871] 0 2871 245248 5238 2 0 0 python
[ 2872] 0 2872 245248 4250 14 0 0 python
[ 2873] 0 2873 245248 7820 2 0 0 python
[ 2874] 0 2874 245248 3762 0 0 0 python
[ 2875] 0 2875 245248 4310 3 0 0 python
[ 2876] 0 2876 245248 3243 2 0 0 python
[ 2877] 0 2877 245248 3813 11 0 0 python
[ 2878] 0 2878 245248 5350 11 0 0 python
[ 2879] 0 2879 245248 5832 11 0 0 python
[ 2880] 0 2880 245248 4321 3 0 0 python
[ 2881] 0 2881 245248 4831 3 0 0 python
[ 2882] 0 2882 245248 3215 0 0 0 python
[ 2883] 0 2883 245248 2718 0 0 0 python
[ 2884] 0 2884 245248 5707 3 0 0 python
[ 2885] 0 2885 245248 4566 3 0 0 python
[ 2886] 0 2886 245248 5540 3 0 0 python
[ 2887] 0 2887 245248 6340 3 0 0 python
[ 2888] 0 2888 245248 4824 3 0 0 python
[ 2889] 0 2889 245248 4877 10 0 0 python
[ 2890] 0 2890 245248 3616 3 0 0 python
[ 2891] 0 2891 245248 3814 2 0 0 python
[ 2892] 0 2892 245248 4341 9 0 0 python
[ 2893] 0 2893 245248 5771 9 0 0 python
[ 2894] 0 2894 245248 3303 2 0 0 python
[ 2895] 0 2895 245248 4327 10 0 0 python
[ 2896] 0 2896 245248 2791 2 0 0 python
[ 2897] 0 2897 245248 4728 3 0 0 python
[ 2898] 0 2898 245248 4823 3 0 0 python
[ 2899] 0 2899 245248 4221 2 0 0 python
[ 2900] 0 2900 245248 3692 13 0 0 python
[ 2901] 0 2901 245248 7446 9 0 0 python
[ 2902] 0 2902 245248 3719 10 0 0 python
[ 2903] 0 2903 245248 6232 3 0 0 python
[ 2904] 0 2904 245248 4791 2 0 0 python
[ 2905] 0 2905 245248 6689 2 0 0 python
[ 2906] 0 2906 245248 6370 6 0 0 python
[ 2909] 0 2909 245248 3934 6 0 0 python
[ 2910] 0 2910 245248 2908 10 0 0 python
[ 2911] 0 2911 245248 2299 11 0 0 python
[ 2912] 0 2912 245248 5449 7 0 0 python
[ 2913] 0 2913 245248 3814 3 0 0 python
[ 2914] 0 2914 245248 3302 10 0 0 python
[ 2915] 0 2915 245248 4840 3 0 0 python
[ 2916] 0 2916 245248 3236 6 0 0 python
[ 2917] 0 2917 245248 4037 11 0 0 python
[ 2918] 0 2918 245248 2266 11 0 0 python
[ 2919] 0 2919 245248 2786 3 0 0 python
[ 2920] 0 2920 245248 8194 11 0 0 python
[ 2921] 0 2921 245248 2247 10 0 0 python
[ 2922] 0 2922 245248 4847 1 0 0 python
[ 2923] 0 2923 245248 3302 1 0 0 python
[ 2924] 0 2924 245248 3940 1 0 0 python
[ 2925] 0 2925 245248 4866 2 0 0 python
[ 2926] 0 2926 245248 3301 1 0 0 python
[ 2927] 0 2927 245248 1462 10 0 0 python
[ 2928] 0 2928 245248 1829 2 0 0 python
[ 2929] 0 2929 245248 4283 1 0 0 python
[ 2930] 0 2930 245248 3398 2 0 0 python
[ 2931] 0 2931 245248 7905 1 0 0 python
[ 2932] 0 2932 245248 4302 2 0 0 python
[ 2933] 0 2933 245248 2885 2 0 0 python
[ 2934] 0 2934 245248 6637 2 0 0 python
[ 2935] 0 2935 245248 2876 11 0 0 python
[ 2936] 0 2936 245248 3719 3 0 0 python
[ 2937] 0 2937 245248 2768 1 0 0 python
[ 2938] 0 2938 245248 1984 11 0 0 python
[ 2939] 0 2939 245248 2280 15 0 0 python
[ 2940] 0 2940 245248 1767 1 0 0 python
[ 2941] 0 2941 245248 3816 10 0 0 python
[ 2942] 0 2942 245248 2790 3 0 0 python
[ 2943] 0 2943 245248 3831 3 0 0 python
[ 2944] 0 2944 245248 3813 9 0 0 python
[ 2945] 0 2945 245248 4326 14 0 0 python
[ 2946] 0 2946 245248 2793 6 0 0 python
[ 2947] 0 2947 245248 4247 9 0 0 python
[ 2948] 0 2948 245248 3304 2 0 0 python
[ 2949] 0 2949 245248 4391 3 0 0 python
[ 2950] 0 2950 245248 3810 15 0 0 python
[ 2951] 0 2951 245248 2293 10 0 0 python
[ 2952] 0 2952 245248 4311 3 0 0 python
[ 2953] 0 2953 245248 4378 2 0 0 python
[ 2954] 0 2954 245248 4086 2 0 0 python
[ 2955] 0 2955 245248 2982 3 0 0 python
[ 2956] 0 2956 245248 2287 9 0 0 python
[ 2957] 0 2957 245248 5347 10 0 0 python
[ 2958] 0 2958 245248 5331 11 0 0 python
[ 2959] 0 2959 245248 1307 3 0 0 python
[ 2960] 0 2960 245248 4327 10 0 0 python
[ 2961] 0 2961 245248 3236 9 0 0 python
[ 2962] 0 2962 245248 3681 9 0 0 python
[ 2963] 0 2963 245248 3304 1 0 0 python
[ 2964] 0 2964 245248 3298 11 0 0 python
[ 2965] 0 2965 245248 5123 14 0 0 python
[ 2966] 0 2966 245248 4327 3 0 0 python
[ 2967] 0 2967 245248 4278 3 0 0 python
[ 2968] 0 2968 245248 2778 1 0 0 python
[ 2969] 0 2969 245248 3963 2 0 0 python
[ 2970] 0 2970 245248 3994 1 0 0 python
[ 2971] 0 2971 245248 3292 2 0 0 python
[ 2972] 0 2972 245248 3815 3 0 0 python
[ 2973] 0 2973 245248 5351 3 0 0 python
[ 2974] 0 2974 245248 6424 10 0 0 python
[ 2975] 0 2975 245248 2794 1 0 0 python
[ 2976] 0 2976 245248 4327 1 0 0 python
[ 2977] 0 2977 245248 3029 1 0 0 python
[ 2978] 0 2978 245248 4914 1 0 0 python
[ 2979] 0 2979 245248 6850 1 0 0 python
[ 2980] 0 2980 245248 3301 1 0 0 python
[ 2981] 0 2981 245248 3454 2 0 0 python
[ 2982] 0 2982 245248 2856 1 0 0 python
[ 2983] 0 2983 245248 2295 7 0 0 python
[ 2984] 0 2984 245248 4732 10 0 0 python
[ 2985] 0 2985 245248 3815 9 0 0 python
[ 2986] 0 2986 245248 1705 13 0 0 python
[ 2987] 0 2987 245248 2282 9 0 0 python
[ 2988] 0 2988 245248 3817 9 0 0 python
[ 2989] 0 2989 245248 2783 9 0 0 python
[ 2990] 0 2990 245248 4835 2 0 0 python
[ 2991] 0 2991 245248 4838 3 0 0 python
[ 2992] 0 2992 245248 229 12 0 0 python
[ 2993] 0 2993 245248 1768 3 0 0 python
[ 2994] 0 2994 245248 4802 3 0 0 python
[ 2995] 0 2995 245248 7995 9 0 0 python
[ 2996] 0 2996 245248 2141 12 0 0 python
[ 2997] 0 2997 245248 1741 2 0 0 python
[ 2998] 0 2998 245248 4905 14 0 0 python
[ 2999] 0 2999 245248 2789 3 0 0 python
[ 3000] 0 3000 245248 4321 2 0 0 python
[ 3001] 0 3001 245248 3816 11 0 0 python
[ 3002] 0 3002 245248 2790 2 0 0 python
[ 3003] 0 3003 245248 1760 2 0 0 python
[ 3004] 0 3004 245248 3290 9 0 0 python
[ 3005] 0 3005 245248 2793 3 0 0 python
[ 3006] 0 3006 245248 3811 3 0 0 python
[ 3007] 0 3007 245248 3302 9 0 0 python
[ 3008] 0 3008 245248 2304 12 0 0 python
[ 3009] 0 3009 245248 2797 9 0 0 python
[ 3010] 0 3010 245248 2723 9 0 0 python
[ 3011] 0 3011 245248 1769 9 0 0 python
[ 3017] 0 3017 245248 1823 11 0 0 python
[ 3018] 0 3018 245248 2794 11 0 0 python
[ 3019] 0 3019 245248 3817 3 0 0 python
[ 3020] 0 3020 245248 1769 14 0 0 python
[ 3022] 0 3022 245248 1837 15 0 0 python
[ 3023] 0 3023 245248 2282 10 0 0 python
[ 3024] 0 3024 245248 2282 10 0 0 python
[ 3025] 0 3025 245248 2278 3 0 0 python
[ 3026] 0 3026 245248 2282 14 0 0 python
[ 3027] 0 3027 245248 2791 2 0 0 python
[ 3028] 0 3028 245248 1461 9 0 0 python
[ 3029] 0 3029 245248 1773 3 0 0 python
[ 3030] 0 3030 245248 2280 9 0 0 python
[ 3031] 0 3031 245248 3862 9 0 0 python
[ 3032] 0 3032 245248 2381 11 0 0 python
[ 3033] 0 3033 245248 2437 9 0 0 python
[ 3034] 0 3034 245248 1769 9 0 0 python
[ 3035] 0 3035 245248 3144 10 0 0 python
[ 3036] 0 3036 245248 2676 11 0 0 python
[ 3037] 0 3037 245248 214 11 0 0 python
[ 3038] 0 3038 245248 2389 9 0 0 python
[ 3039] 0 3039 245248 2386 9 0 0 python
[ 3040] 0 3040 245248 2334 2 0 0 python
[ 3041] 0 3041 245248 3819 0 0 0 python
[ 3042] 0 3042 245248 2373 3 0 0 python
[ 3043] 0 3043 245248 1259 9 0 0 python
[ 3044] 0 3044 245248 2183 3 0 0 python
[ 3045] 0 3045 245248 5869 14 0 0 python
[ 3046] 0 3046 245248 2281 10 0 0 python
[ 3047] 0 3047 245248 2791 9 0 0 python
[ 3048] 0 3048 245248 3820 12 0 0 python
[ 3049] 0 3049 245248 2792 10 0 0 python
[ 3050] 0 3050 245248 1449 3 0 0 python
[ 3051] 0 3051 245248 1769 9 0 0 python
[ 3052] 0 3052 245248 4330 10 0 0 python
[ 3053] 0 3053 245248 1731 9 0 0 python
[ 3054] 0 3054 245248 1257 9 0 0 python
[ 3055] 0 3055 245248 1207 14 0 0 python
[ 3056] 0 3056 245248 184 9 0 0 python
[ 3057] 0 3057 245248 1255 4 0 0 python
[ 3058] 0 3058 245248 1769 2 0 0 python
[ 3059] 0 3059 245248 2234 9 0 0 python
[ 3060] 0 3060 245248 2795 4 0 0 python
[ 3061] 0 3061 245248 1768 4 0 0 python
[ 3062] 0 3062 245248 748 10 0 0 python
[ 3063] 0 3063 245248 1955 15 0 0 python
[ 3064] 0 3064 245248 1260 9 0 0 python
[ 3065] 0 3065 245248 1350 6 0 0 python
[ 3066] 0 3066 245248 1769 9 0 0 python
[ 3067] 0 3067 245248 3307 2 0 0 python
[ 3068] 0 3068 245248 2276 6 0 0 python
[ 3069] 0 3069 245248 1877 10 0 0 python
[ 3070] 0 3070 245248 2702 0 0 0 python
[ 3071] 0 3071 245248 1805 10 0 0 python
[ 3072] 0 3072 245248 1283 9 0 0 python
[ 3073] 0 3073 245248 2282 6 0 0 python
[ 3074] 0 3074 245248 3306 2 0 0 python
[ 3075] 0 3075 245248 2283 2 0 0 python
[ 3076] 0 3076 245248 216 3 0 0 python
[ 3077] 0 3077 245248 2282 11 0 0 python
[ 3078] 0 3078 245248 2045 2 0 0 python
[ 3079] 0 3079 245248 2794 7 0 0 python
[ 3080] 0 3080 245248 1764 10 0 0 python
[ 3081] 0 3081 245248 1769 13 0 0 python
[ 3082] 0 3082 245248 1258 3 0 0 python
[ 3083] 0 3083 245248 2283 9 0 0 python
[ 3084] 0 3084 245248 1351 9 0 0 python
[ 3085] 0 3085 245248 1256 9 0 0 python
[ 3086] 0 3086 245248 2282 9 0 0 python
[ 3087] 0 3087 245248 2771 4 0 0 python
[ 3088] 0 3088 245248 3839 3 0 0 python
[ 3089] 0 3089 245248 2271 11 0 0 python
[ 3090] 0 3090 245248 2082 10 0 0 python
[ 3091] 0 3091 245248 3285 2 0 0 python
[ 3092] 0 3092 245248 722 9 0 0 python
[ 3093] 0 3093 245248 1768 2 0 0 python
[ 3094] 0 3094 245248 1259 9 0 0 python
[ 3095] 0 3095 245248 2283 9 0 0 python
[ 3096] 0 3096 245248 1314 10 0 0 python
[ 3097] 0 3097 245248 2441 9 0 0 python
[ 3098] 0 3098 245248 1770 2 0 0 python
[ 3099] 0 3099 245248 1261 10 0 0 python
[ 3100] 0 3100 245248 2338 9 0 0 python
[ 3101] 0 3101 245248 1770 2 0 0 python
[ 3102] 0 3102 245248 1752 9 0 0 python
[ 3103] 0 3103 245248 1937 10 0 0 python
[ 3104] 0 3104 245248 1768 10 0 0 python
[ 3108] 0 3108 245248 1773 9 0 0 python
[ 3109] 0 3109 245248 746 2 0 0 python
[ 3110] 0 3110 245248 2794 11 0 0 python
[ 3111] 0 3111 245248 3546 9 0 0 python
[ 3112] 0 3112 245248 3307 10 0 0 python
[ 3113] 0 3113 245248 2665 11 0 0 python
[ 3114] 0 3114 245248 214 9 0 0 python
[ 3115] 0 3115 245248 2268 9 0 0 python
[ 3116] 0 3116 245248 1772 9 0 0 python
[ 3117] 0 3117 245248 216 11 0 0 python
[ 3118] 0 3118 245248 2791 10 0 0 python
[ 3119] 0 3119 245248 746 3 0 0 python
[ 3120] 0 3120 245248 1257 10 0 0 python
[ 3121] 0 3121 245248 1418 10 0 0 python
[ 3122] 0 3122 245248 1262 9 0 0 python
[ 3123] 0 3123 245248 1260 9 0 0 python
[ 3124] 0 3124 245248 1771 15 0 0 python
[ 3125] 0 3125 245248 216 11 0 0 python
[ 3126] 0 3126 245248 1305 9 0 0 python
[ 3127] 0 3127 245248 1247 12 0 0 python
[ 3128] 0 3128 245248 2221 4 0 0 python
[ 3129] 0 3129 245248 746 2 0 0 python
[ 3130] 0 3130 245248 746 11 0 0 python
[ 3131] 0 3131 245248 743 11 0 0 python
[ 3132] 0 3132 245248 218 4 0 0 python
[ 3133] 0 3133 245248 1770 2 0 0 python
[ 3134] 0 3134 245248 232 10 0 0 python
[ 3135] 0 3135 41834 474 2 0 0 python
[ 3136] 0 3136 245248 217 1 0 0 python
[ 3139] 0 3139 245248 215 11 0 0 python
[ 3140] 0 3140 245248 214 1 0 0 python
[ 3141] 0 3141 245248 215 3 0 0 python
[ 3142] 0 3142 245248 216 1 0 0 python
[ 3143] 0 3143 245248 215 7 0 0 python
[ 3144] 0 3144 245248 217 10 0 0 python
[ 3145] 0 3145 245248 216 12 0 0 python
[ 3146] 0 3146 41834 140 2 0 0 python
[ 3157] 0 3157 41834 140 0 0 0 python
[ 3158] 0 3158 41834 127 3 0 0 python
[ 3159] 0 3159 41834 133 2 0 0 python
[ 3160] 0 3160 41834 123 3 0 0 python
[ 3161] 0 3161 41834 117 3 0 0 python
[ 3162] 0 3162 41834 113 3 0 0 python
[ 3164] 0 3164 41834 107 1 0 0 python
[ 3166] 0 3166 41834 98 3 0 0 python
Out of memory: Kill process 2103 (mingetty) score 1 or sacrifice child
Killed process 2103 (mingetty) total-vm:4100kB, anon-rss:0kB, file-rss:0kB
python invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
python cpuset=/ mems_allowed=0-1

Out of memory: Kill process 3246 (agetty) score 1 or sacrifice child
Killed process 3246 (agetty) total-vm:4116kB, anon-rss:72kB, file-rss:0kB
init: tty (init: tty (/devinit: tty (/dev/tty4) main process (3169) killed by KILL signal
init: tty (/dev/tty4) main process ended, respawning
init: tty (/dev/tty5) main process (3170) killed by KILL signal
init: tty (/dev/tty5) main process ended, respawning
init: tty (/dev/tty6) main process (3171) killed by KILL signal
init: tty (/dev/tty6) main process ended, respawning
init: serial (ttyS0) main process (3246) killed by KILL signal
init: serial (ttyS0) main process ended, respawning

> I'm indifferent to the actual scale of OOM_SCORE_MAX_FACTOR; it could
> be
> 10 as proposed in this patch or even increased higher for higher
> resolution.
>
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -38,6 +38,9 @@ int sysctl_oom_kill_allocating_task;
> int sysctl_oom_dump_tasks = 1;
> static DEFINE_SPINLOCK(zone_scan_lock);
>
> +#define OOM_SCORE_MAX_FACTOR 10
> +#define OOM_SCORE_MAX (OOM_SCORE_ADJ_MAX * OOM_SCORE_MAX_FACTOR)
> +
> #ifdef CONFIG_NUMA
> /**
> * has_intersects_mems_allowed() - check task eligiblity for kill
> @@ -160,7 +163,7 @@ unsigned int oom_badness(struct task_struct *p,
> struct mem_cgroup *mem,
> */
> if (p->flags & PF_OOM_ORIGIN) {
> task_unlock(p);
> - return 1000;
> + return OOM_SCORE_MAX;
> }
>
> /*
> @@ -177,32 +180,38 @@ unsigned int oom_badness(struct task_struct *p,
> struct mem_cgroup *mem,
> points = get_mm_rss(p->mm) + p->mm->nr_ptes;
> points += get_mm_counter(p->mm, MM_SWAPENTS);
>
> - points *= 1000;
> + points *= OOM_SCORE_MAX;
> points /= totalpages;
> task_unlock(p);
>
> /*
> - * Root processes get 3% bonus, just like the __vm_enough_memory()
> - * implementation used by LSMs.
> + * Root processes get a bonus of 1% per 10% of memory used.
> */
> - if (has_capability_noaudit(p, CAP_SYS_ADMIN))
> - points -= 30;
> + if (has_capability_noaudit(p, CAP_SYS_ADMIN)) {
> + int bonus;
> + int granularity;
> +
> + bonus = OOM_SCORE_MAX / 100; /* bonus is 1% */
> + granularity = OOM_SCORE_MAX / 10; /* granularity is 10% */
> +
> + points -= bonus * (points / granularity);
> + }
>
> /*
> * /proc/pid/oom_score_adj ranges from -1000 to +1000 such that it may
> * either completely disable oom killing or always prefer a certain
> * task.
> */
> - points += p->signal->oom_score_adj;
> + points += p->signal->oom_score_adj * OOM_SCORE_MAX_FACTOR;
>
> /*
> * Never return 0 for an eligible task that may be killed since it's
> - * possible that no single user task uses more than 0.1% of memory
> and
> + * possible that no single user task uses more than 0.01% of memory
> and
> * no single admin tasks uses more than 3.0%.
> */
> if (points <= 0)
> return 1;
> - return (points < 1000) ? points : 1000;
> + return (points < OOM_SCORE_MAX) ? points : OOM_SCORE_MAX;
> }
>
> /*
> @@ -314,7 +323,7 @@ static struct task_struct
> *select_bad_process(unsigned int *ppoints,
> */
> if (p == current) {
> chosen = p;
> - *ppoints = 1000;
> + *ppoints = OOM_SCORE_MAX;
> } else {
> /*
> * If this task is not being ptraced on exit,

2011-05-26 09:34:51

[permalink] [raw]

Subject: Re: [PATCH 3/5] oom: oom-killer don't use proportion of system-ram internally

Hello KOSAKI,

----- Original Message -----
> CAI Qian reported his kernel did hang-up if he ran fork intensive
> workload and then invoke oom-killer.
>
> The problem is, current oom calculation uses 0-1000 normalized value
> (The unit is a permillage of system-ram). Its low precision make
> a lot of same oom score. IOW, in his case, all processes have smaller
> oom score than 1 and internal calculation round it to 1.
>
> Thus oom-killer kill ineligible process. This regression is caused by
> commit a63d83f427 (oom: badness heuristic rewrite).
>
> The solution is, the internal calculation just use number of pages
> instead of permillage of system-ram. And convert it to permillage
> value at displaying time.
>
> This patch doesn't change any ABI (included /proc/<pid>/oom_score_adj)
> even though current logic has a lot of my dislike thing.
>
> Reported-by: CAI Qian <[email protected]>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
> ---
> fs/proc/base.c | 13 ++++++----
> include/linux/oom.h | 7 +----
> mm/oom_kill.c | 60 +++++++++++++++++++++++++++++++++-----------------
> 3 files changed, 49 insertions(+), 31 deletions(-)
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index dfa5327..d6b0424 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -476,14 +476,17 @@ static const struct file_operations
> proc_lstats_operations = {
>
> static int proc_oom_score(struct task_struct *task, char *buffer)
> {
> - unsigned long points = 0;
> + unsigned long points;
> + unsigned long ratio = 0;
> + unsigned long totalpages = totalram_pages + total_swap_pages + 1;
>
> read_lock(&tasklist_lock);
> - if (pid_alive(task))
> - points = oom_badness(task, NULL, NULL,
> - totalram_pages + total_swap_pages);
> + if (pid_alive(task)) {
> + points = oom_badness(task, NULL, NULL, totalpages);
> + ratio = points * 1000 / totalpages;
> + }
> read_unlock(&tasklist_lock);
> - return sprintf(buffer, "%lu\n", points);
> + return sprintf(buffer, "%lu\n", ratio);
> }
>
> struct limit_names {
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index 5e3aa83..0f5b588 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -40,7 +40,8 @@ enum oom_constraint {
> CONSTRAINT_MEMCG,
> };
>
> -extern unsigned int oom_badness(struct task_struct *p, struct
> mem_cgroup *mem,
> +/* The badness from the OOM killer */
> +extern unsigned long oom_badness(struct task_struct *p, struct
> mem_cgroup *mem,
> const nodemask_t *nodemask, unsigned long totalpages);
> extern int try_set_zonelist_oom(struct zonelist *zonelist, gfp_t
> gfp_flags);
> extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t
> gfp_flags);
> @@ -62,10 +63,6 @@ static inline void oom_killer_enable(void)
> oom_killer_disabled = false;
> }
>
> -/* The badness from the OOM killer */
> -extern unsigned long badness(struct task_struct *p, struct mem_cgroup
> *mem,
> - const nodemask_t *nodemask, unsigned long uptime);
> -
> extern struct task_struct *find_lock_task_mm(struct task_struct *p);
>
> /* sysctls */
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index e6a6c6f..8bbc3df 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -132,10 +132,12 @@ static bool oom_unkillable_task(struct
> task_struct *p,
> * predictable as possible. The goal is to return the highest value for
> the
> * task consuming the most memory to avoid subsequent oom failures.
> */
> -unsigned int oom_badness(struct task_struct *p, struct mem_cgroup
> *mem,
> +unsigned long oom_badness(struct task_struct *p, struct mem_cgroup
> *mem,
> const nodemask_t *nodemask, unsigned long totalpages)
> {
> - int points;
> + unsigned long points;
> + unsigned long score_adj = 0;
> +
>
> if (oom_unkillable_task(p, mem, nodemask))
> return 0;
> @@ -160,7 +162,7 @@ unsigned int oom_badness(struct task_struct *p,
> struct mem_cgroup *mem,
> */
> if (p->flags & PF_OOM_ORIGIN) {
> task_unlock(p);
> - return 1000;
> + return ULONG_MAX;
> }
This part failed to apply to the latest git tree so unable to test those
patches this time. Can you fix that?

Thanks,
CAI Qian
> /*
> @@ -176,33 +178,49 @@ unsigned int oom_badness(struct task_struct *p,
> struct mem_cgroup *mem,
> */
> points = get_mm_rss(p->mm) + p->mm->nr_ptes;
> points += get_mm_counter(p->mm, MM_SWAPENTS);
> -
> - points *= 1000;
> - points /= totalpages;
> task_unlock(p);
>
> /*
> * Root processes get 3% bonus, just like the __vm_enough_memory()
> * implementation used by LSMs.
> + *
> + * XXX: Too large bonus, example, if the system have tera-bytes
> memory..
> */
> - if (has_capability_noaudit(p, CAP_SYS_ADMIN))
> - points -= 30;
> + if (has_capability_noaudit(p, CAP_SYS_ADMIN)) {
> + if (points >= totalpages / 32)
> + points -= totalpages / 32;
> + else
> + points = 0;
> + }
>
> /*
> * /proc/pid/oom_score_adj ranges from -1000 to +1000 such that it may
> * either completely disable oom killing or always prefer a certain
> * task.
> */
> - points += p->signal->oom_score_adj;
> + if (p->signal->oom_score_adj >= 0) {
> + score_adj = p->signal->oom_score_adj * (totalpages / 1000);
> + if (ULONG_MAX - points >= score_adj)
> + points += score_adj;
> + else
> + points = ULONG_MAX;
> + } else {
> + score_adj = -p->signal->oom_score_adj * (totalpages / 1000);
> + if (points >= score_adj)
> + points -= score_adj;
> + else
> + points = 0;
> + }
>
> /*
> * Never return 0 for an eligible task that may be killed since it's
> * possible that no single user task uses more than 0.1% of memory and
> * no single admin tasks uses more than 3.0%.
> */
> - if (points <= 0)
> - return 1;
> - return (points < 1000) ? points : 1000;
> + if (!points)
> + points = 1;
> +
> + return points;
> }
>
> /*
> @@ -274,7 +292,7 @@ static enum oom_constraint
> constrained_alloc(struct zonelist *zonelist,
> *
> * (not docbooked, we don't want this one cluttering up the manual)
> */
> -static struct task_struct *select_bad_process(unsigned int *ppoints,
> +static struct task_struct *select_bad_process(unsigned long *ppoints,
> unsigned long totalpages, struct mem_cgroup *mem,
> const nodemask_t *nodemask)
> {
> @@ -283,7 +301,7 @@ static struct task_struct
> *select_bad_process(unsigned int *ppoints,
> *ppoints = 0;
>
> do_each_thread_reverse(g, p) {
> - unsigned int points;
> + unsigned long points;
>
> if (!p->mm)
> continue;
> @@ -314,7 +332,7 @@ static struct task_struct
> *select_bad_process(unsigned int *ppoints,
> */
> if (p == current) {
> chosen = p;
> - *ppoints = 1000;
> + *ppoints = ULONG_MAX;
> } else {
> /*
> * If this task is not being ptraced on exit,
> @@ -445,14 +463,14 @@ static int oom_kill_task(struct task_struct *p,
> struct mem_cgroup *mem)
> #undef K
>
> static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int
> order,
> - unsigned int points, unsigned long totalpages,
> + unsigned long points, unsigned long totalpages,
> struct mem_cgroup *mem, nodemask_t *nodemask,
> const char *message)
> {
> struct task_struct *victim = p;
> struct task_struct *child;
> struct task_struct *t = p;
> - unsigned int victim_points = 0;
> + unsigned long victim_points = 0;
>
> if (printk_ratelimit())
> dump_header(p, gfp_mask, order, mem, nodemask);
> @@ -467,7 +485,7 @@ static int oom_kill_process(struct task_struct *p,
> gfp_t gfp_mask, int order,
> }
>
> task_lock(p);
> - pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n",
> + pr_err("%s: Kill process %d (%s) points %lu or sacrifice child\n",
> message, task_pid_nr(p), p->comm, points);
> task_unlock(p);
>
> @@ -479,7 +497,7 @@ static int oom_kill_process(struct task_struct *p,
> gfp_t gfp_mask, int order,
> */
> do {
> list_for_each_entry(child, &t->children, sibling) {
> - unsigned int child_points;
> + unsigned long child_points;
>
> if (child->mm == p->mm)
> continue;
> @@ -526,7 +544,7 @@ static void check_panic_on_oom(enum oom_constraint
> constraint, gfp_t gfp_mask,
> void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
> {
> unsigned long limit;
> - unsigned int points = 0;
> + unsigned long points = 0;
> struct task_struct *p;
>
> /*
> @@ -675,7 +693,7 @@ void out_of_memory(struct zonelist *zonelist,
> gfp_t gfp_mask,
> struct task_struct *p;
> unsigned long totalpages;
> unsigned long freed = 0;
> - unsigned int points;
> + unsigned long points;
> enum oom_constraint constraint = CONSTRAINT_NONE;
> int killed = 0;
>
> --
> 1.7.3.1
>
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign
> http://stopthemeter.ca/
> Don't email: href=mailto:"[email protected]"> [email protected]

2011-05-26 09:56:38

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [PATCH 3/5] oom: oom-killer don't use proportion of system-ram internally

>> @@ -160,7 +162,7 @@ unsigned int oom_badness(struct task_struct *p,
>> struct mem_cgroup *mem,
>> */
>> if (p->flags & PF_OOM_ORIGIN) {
>> task_unlock(p);
>> - return 1000;
>> + return ULONG_MAX;
>> }
> This part failed to apply to the latest git tree so unable to test those
> patches this time. Can you fix that?

Please apply ontop mmotm-0512.

Thanks.

2011-05-27 19:13:07

by David Rientjes

[permalink] [raw]

Subject: Re: [PATCH 3/5] oom: oom-killer don't use proportion of system-ram internally

On Thu, 26 May 2011, CAI Qian wrote:

> Here is the results for the testing. Running the reproducer as non-root
> user, the results look good as OOM killer just killed each python process
> in-turn that the reproducer forked. However, when running it as root
> user, sshd and other random processes had been killed.
>

Thanks for testing! The patch that I proposed for you was a little more
conservative in terms of providing a bonus to root processes that aren't
using a certain threshold of memory. My latest proposal was to give root
processes only a 1% bonus for every 10% of memory they consume, so it
would be impossible for them to have an oom score of 1 as reported in your
logs.

I believe that KOSAKI-san is refreshing his series of patches, so let's
look at how your workload behaves on the next iteration. Thanks CAI!

2011-05-30 01:17:38

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [PATCH 4/5] oom: don't kill random process

> I'm afraid that a second time through the tasklist in select_bad_process()
> is simply a non-starter for _any_ case; it significantly increases the
> amount of time that tasklist_lock is held and causes problems elsewhere on
> large systems -- such as some of ours -- since irqs are disabled while
> waiting for the writeside of the lock. I think it would be better to use
> a proportional privilege for root processes based on the amount of memory
> they are using (discounting 1% of memory per 10% of memory used, as
> proposed earlier, seems sane) so we can always protect root when necessary
> and never iterate through the list again.
>
> Please look into the earlier review comments on the other patches, refresh
> the series, and post it again. Thanks!

Never mind.

You never see to increase tasklist_lock. You never seen all processes
have root privilege case.

2011-05-31 01:33:46

[permalink] [raw]

Subject: Re: [PATCH v2 0/5] Fix oom killer doesn't work at all if system have > gigabytes memory (aka CAI founded issue)

Hello,

Have tested those patches rebased from KOSAKI for the latest mainline.
It still killed random processes and recevied a panic at the end by
using root user. The full oom output can be found here.
http://people.redhat.com/qcai/oom

Cheers,
CAI Qian

----- Original Message -----
> CAI Qian reported current oom logic doesn't work at all on his 16GB
> RAM
> machine. oom killer killed all system daemon at first and his system
> stopped responding.
>
> The brief log is below.
>
> > Out of memory: Kill process 1175 (dhclient) score 1 or sacrifice
> > child
> > Out of memory: Kill process 1247 (rsyslogd) score 1 or sacrifice
> > child
> > Out of memory: Kill process 1284 (irqbalance) score 1 or sacrifice
> > child
> > Out of memory: Kill process 1303 (rpcbind) score 1 or sacrifice
> > child
> > Out of memory: Kill process 1321 (rpc.statd) score 1 or sacrifice
> > child
> > Out of memory: Kill process 1333 (mdadm) score 1 or sacrifice child
> > Out of memory: Kill process 1365 (rpc.idmapd) score 1 or sacrifice
> > child
> > Out of memory: Kill process 1403 (dbus-daemon) score 1 or sacrifice
> > child
> > Out of memory: Kill process 1438 (acpid) score 1 or sacrifice child
> > Out of memory: Kill process 1447 (hald) score 1 or sacrifice child
> > Out of memory: Kill process 1447 (hald) score 1 or sacrifice child
> > Out of memory: Kill process 1487 (hald-addon-inpu) score 1 or
> > sacrifice child
> > Out of memory: Kill process 1488 (hald-addon-acpi) score 1 or
> > sacrifice child
> > Out of memory: Kill process 1507 (automount) score 1 or sacrifice
> > child
>
>
> The problems are three.
>
> 1) if two processes have the same oom score, we should kill younger
> process.
> but current logic kill older. Typically oldest processes are system
> daemons.
> 2) Current logic use 'unsigned int' for internal score calculation.
> (exactly says,
> it only use 0-1000 value). its very low precision calculation makes a
> lot of
> same oom score and kill an ineligible process.
> 3) Current logic give 3% of SystemRAM to root processes. It obviously
> too big
> if you have plenty memory. Now, your fork-bomb processes have 500MB
> OOM immune
> bonus. then your fork-bomb never ever be killed.
>
>
> KOSAKI Motohiro (5):
> oom: improve dump_tasks() show items
> oom: kill younger process first
> oom: oom-killer don't use proportion of system-ram internally
> oom: don't kill random process
> oom: merge oom_kill_process() with oom_kill_task()
>
> fs/proc/base.c | 13 ++-
> include/linux/oom.h | 10 +--
> include/linux/sched.h | 11 +++
> mm/oom_kill.c | 201 +++++++++++++++++++++++++++----------------------
> 4 files changed, 135 insertions(+), 100 deletions(-)
>
> --
> 1.7.3.1
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign
> http://stopthemeter.ca/
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2011-05-31 04:11:06

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [PATCH v2 0/5] Fix oom killer doesn't work at all if system have > gigabytes memory (aka CAI founded issue)

(2011/05/31 10:33), CAI Qian wrote:
> Hello,
>
> Have tested those patches rebased from KOSAKI for the latest mainline.
> It still killed random processes and recevied a panic at the end by
> using root user. The full oom output can be found here.
> http://people.redhat.com/qcai/oom

You ran fork-bomb as root. Therefore unprivileged process was killed at first.
It's no random. It's intentional and desirable. I mean

- If you run the same progream as non-root, python will be killed at first.
Because it consume a lot of memory than daemons.
- If you run the same program as root, non root process and privilege explicit
dropping processes (e.g. irqbalance) will be killed at first.

Look, your log says, highest oom score process was killed first.

Out of memory: Kill process 5462 (abrtd) points:393 total-vm:262300kB, anon-rss:1024kB, file-rss:0kB
Out of memory: Kill process 5277 (hald) points:303 total-vm:25444kB, anon-rss:1116kB, file-rss:0kB
Out of memory: Kill process 5720 (sshd) points:258 total-vm:97684kB, anon-rss:824kB, file-rss:0kB
Out of memory: Kill process 5457 (pickup) points:236 total-vm:78672kB, anon-rss:768kB, file-rss:0kB
Out of memory: Kill process 5451 (master) points:235 total-vm:78592kB, anon-rss:796kB, file-rss:0kB
Out of memory: Kill process 5458 (qmgr) points:233 total-vm:78740kB, anon-rss:764kB, file-rss:0kB
Out of memory: Kill process 5353 (sshd) points:189 total-vm:63992kB, anon-rss:620kB, file-rss:0kB
Out of memory: Kill process 1626 (dhclient) points:129 total-vm:9148kB, anon-rss:484kB, file-rss:0kB

2011-05-31 04:15:15

[permalink] [raw]

Subject: Re: [PATCH v2 0/5] Fix oom killer doesn't work at all if system have > gigabytes memory (aka CAI founded issue)

----- Original Message -----
> (2011/05/31 10:33), CAI Qian wrote:
> > Hello,
> >
> > Have tested those patches rebased from KOSAKI for the latest
> > mainline.
> > It still killed random processes and recevied a panic at the end by
> > using root user. The full oom output can be found here.
> > http://people.redhat.com/qcai/oom
>
> You ran fork-bomb as root. Therefore unprivileged process was killed
> at first.
> It's no random. It's intentional and desirable. I mean
>
> - If you run the same progream as non-root, python will be killed at
> first.
> Because it consume a lot of memory than daemons.
> - If you run the same program as root, non root process and privilege
> explicit
> dropping processes (e.g. irqbalance) will be killed at first.
>
>
> Look, your log says, highest oom score process was killed first.
>
> Out of memory: Kill process 5462 (abrtd) points:393 total-vm:262300kB,
> anon-rss:1024kB, file-rss:0kB
> Out of memory: Kill process 5277 (hald) points:303 total-vm:25444kB,
> anon-rss:1116kB, file-rss:0kB
> Out of memory: Kill process 5720 (sshd) points:258 total-vm:97684kB,
> anon-rss:824kB, file-rss:0kB
> Out of memory: Kill process 5457 (pickup) points:236 total-vm:78672kB,
> anon-rss:768kB, file-rss:0kB
> Out of memory: Kill process 5451 (master) points:235 total-vm:78592kB,
> anon-rss:796kB, file-rss:0kB
> Out of memory: Kill process 5458 (qmgr) points:233 total-vm:78740kB,
> anon-rss:764kB, file-rss:0kB
> Out of memory: Kill process 5353 (sshd) points:189 total-vm:63992kB,
> anon-rss:620kB, file-rss:0kB
> Out of memory: Kill process 1626 (dhclient) points:129
> total-vm:9148kB, anon-rss:484kB, file-rss:0kB
OK, there was also a panic at the end. Is that expected?

BUG: unable to handle kernel NULL pointer dereference at 00000000000002a8
IP: [<ffffffff811227d4>] get_mm_counter+0x14/0x30
PGD 0
Oops: 0000 [#1] SMP
CPU 7
Modules linked in: autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ipv6 dm_mirror dm_region_hash dm_log microcode serio_raw pcspkr cdc_ether usbnet mii i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support sg shpchp ioatdma dca i7core_edac edac_core bnx2 ext4 mbcache jbd2 sd_mod crc_t10dif pata_acpi ata_generic ata_piix mptsas mptscsih mptbase scsi_transport_sas dm_mod [last unloaded: scsi_wait_scan]

Pid: 5232, comm: dbus-daemon Not tainted 3.0.0-rc1+ #3 IBM System x3550 M3 -[7944I21]-/69Y4438
RIP: 0010:[<ffffffff811227d4>] [<ffffffff811227d4>] get_mm_counter+0x14/0x30
RSP: 0000:ffff88027116b828 EFLAGS: 00010286
RAX: 00000000000002a0 RBX: ffff880470cd8a80 RCX: 0000000000000003
RDX: 000000000000000e RSI: 0000000000000002 RDI: 0000000000000000
RBP: ffff88027116b828 R08: 0000000000000000 R09: 0000000000000010
R10: 0000000000000000 R11: 0000000000000007 R12: ffff88027116b880
R13: 0000000000000000 R14: 0000000000000000 R15: ffff880270df2100
FS: 00007f78a3837700(0000) GS:ffff88047fc60000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000000002a8 CR3: 000000047238f000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process dbus-daemon (pid: 5232, threadinfo ffff88027116a000, task ffff880270df2100)
Stack:
ffff88027116b8b8 ffffffff81104c60 0000000000000000 0000000000000000
ffff8802704c4680 0000000000000000 ffff8802705161c0 0000000000000000
0000000000000000 0000000000000000 0000000000000286 ffff880470cd8e98
Call Trace:
[<ffffffff81104c60>] dump_tasks+0xa0/0x160
[<ffffffff81104dd5>] dump_header+0xb5/0xd0
[<ffffffff81104f15>] oom_kill_process+0xa5/0x1c0
[<ffffffff811055ef>] out_of_memory+0xff/0x220
[<ffffffff8110a962>] __alloc_pages_slowpath+0x632/0x6b0
[<ffffffff8110ab84>] __alloc_pages_nodemask+0x1a4/0x1f0
[<ffffffff81147d52>] kmem_getpages+0x62/0x170
[<ffffffff8114886a>] fallback_alloc+0x1ba/0x270
[<ffffffff811482e3>] ? cache_grow+0x2c3/0x2f0
[<ffffffff811485f5>] ____cache_alloc_node+0x95/0x150
[<ffffffff8114901d>] kmem_cache_alloc+0xfd/0x190
[<ffffffff810d20ed>] taskstats_exit+0x1cd/0x240
[<ffffffff81066667>] do_exit+0x177/0x430
[<ffffffff81066971>] do_group_exit+0x51/0xc0
[<ffffffff81078583>] get_signal_to_deliver+0x203/0x470
[<ffffffff8100b939>] do_signal+0x69/0x190
[<ffffffff8100bac5>] do_notify_resume+0x65/0x80
[<ffffffff814db6d0>] int_signal+0x12/0x17
Code: 48 8b 00 c9 48 d1 e8 83 e0 01 c3 0f 1f 40 00 31 c0 c9 c3 0f 1f 40 00 55 48 89 e5 66 66 66 66 90 48 63 f6 48 8d 84 f7 90 02 00 00
8b 50 08 31 c0 c9 48 85 d2 48 0f 49 c2 c3 66 66 66 66 2e 0f
RIP [<ffffffff811227d4>] get_mm_counter+0x14/0x30
RSP <ffff88027116b828>
CR2: 00000000000002a8
---[ end trace 742b26ee0c4fab73 ]---
Fixing recursive fault but reboot is needed!
Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0
Pid: 4, comm: kworker/0:0 Tainted: G D 3.0.0-rc1+ #3
Call Trace:
<NMI> [<ffffffff814d062f>] panic+0x91/0x1a8
[<ffffffff810c76e1>] watchdog_overflow_callback+0xb1/0xc0
[<ffffffff810fbbdd>] __perf_event_overflow+0x9d/0x250
[<ffffffff810fc1c4>] perf_event_overflow+0x14/0x20
[<ffffffff8101df36>] intel_pmu_handle_irq+0x326/0x530
[<ffffffff814d4ba9>] perf_event_nmi_handler+0x29/0xa0
[<ffffffff814d6f65>] notifier_call_chain+0x55/0x80
[<ffffffff814d6fca>] atomic_notifier_call_chain+0x1a/0x20
[<ffffffff814d6ffe>] notify_die+0x2e/0x30
[<ffffffff814d4199>] default_do_nmi+0x39/0x1f0
[<ffffffff814d43d0>] do_nmi+0x80/0xa0
[<ffffffff814d3b90>] nmi+0x20/0x30
[<ffffffff8123f379>] ? __write_lock_failed+0x9/0x20
<<EOE>> [<ffffffff814d32de>] ? _raw_write_lock_irq+0x1e/0x20
[<ffffffff81065cec>] forget_original_parent+0x3c/0x330
[<ffffffff81065ffb>] exit_notify+0x1b/0x190
[<ffffffff810666ed>] do_exit+0x1fd/0x430
[<ffffffff8107fae0>] ? manage_workers+0x120/0x120
[<ffffffff810846ce>] kthread+0x8e/0xa0
[<ffffffff814dc544>] kernel_thread_helper+0x4/0x10
[<ffffffff81084640>] ? kthread_worker_fn+0x1a0/0x1a0
[<ffffffff814dc540>] ? gs_change+0x13/0x13

2011-05-31 04:32:59

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [PATCH v2 0/5] Fix oom killer doesn't work at all if system have > gigabytes memory (aka CAI founded issue)

(2011/05/31 13:10), KOSAKI Motohiro wrote:
> (2011/05/31 10:33), CAI Qian wrote:
>> Hello,
>>
>> Have tested those patches rebased from KOSAKI for the latest mainline.
>> It still killed random processes and recevied a panic at the end by
>> using root user. The full oom output can be found here.
>> http://people.redhat.com/qcai/oom
>
> You ran fork-bomb as root. Therefore unprivileged process was killed at first.
> It's no random. It's intentional and desirable. I mean
>
> - If you run the same progream as non-root, python will be killed at first.
> Because it consume a lot of memory than daemons.
> - If you run the same program as root, non root process and privilege explicit
> dropping processes (e.g. irqbalance) will be killed at first.

I mean, oom-killer start to kill python after killing all unprivilege process
in this case. Please wait & see ahile after sequence.

2011-05-31 04:34:19

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [PATCH v2 0/5] Fix oom killer doesn't work at all if system have > gigabytes memory (aka CAI founded issue)

> OK, there was also a panic at the end. Is that expected?

Definitely, no.
At least, I can't reproduce it. Can you reproduce it?

>
> BUG: unable to handle kernel NULL pointer dereference at 00000000000002a8
> IP: [<ffffffff811227d4>] get_mm_counter+0x14/0x30
> PGD 0
> Oops: 0000 [#1] SMP
> CPU 7
> Modules linked in: autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ipv6 dm_mirror dm_region_hash dm_log microcode serio_raw pcspkr cdc_ether usbnet mii i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support sg shpchp ioatdma dca i7core_edac edac_core bnx2 ext4 mbcache jbd2 sd_mod crc_t10dif pata_acpi ata_generic ata_piix mptsas mptscsih mptbase scsi_transport_sas dm_mod [last unloaded: scsi_wait_scan]

2011-05-31 04:48:37

by David Rientjes

[permalink] [raw]

Subject: Re: [PATCH 4/5] oom: don't kill random process

On Mon, 30 May 2011, KOSAKI Motohiro wrote:

> Never mind.
>
> You never see to increase tasklist_lock. You never seen all processes
> have root privilege case.
>

I don't really understand what you're trying to say, sorry.

2011-05-31 04:49:14

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [PATCH v2 0/5] Fix oom killer doesn't work at all if system have > gigabytes memory (aka CAI founded issue)

> OK, there was also a panic at the end. Is that expected?
>
> BUG: unable to handle kernel NULL pointer dereference at 00000000000002a8
> IP: [<ffffffff811227d4>] get_mm_counter+0x14/0x30
> PGD 0
> Oops: 0000 [#1] SMP
> CPU 7
> Modules linked in: autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ipv6 dm_mirror dm_region_hash dm_log microcode serio_raw pcspkr cdc_ether usbnet mii i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support sg shpchp ioatdma dca i7core_edac edac_core bnx2 ext4 mbcache jbd2 sd_mod crc_t10dif pata_acpi ata_generic ata_piix mptsas mptscsih mptbase scsi_transport_sas dm_mod [last unloaded: scsi_wait_scan]

My fault. my [1/5] has a bug. please apply following incremental patch.

index 9c7f149..f0e34d4 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -448,8 +448,8 @@ static void dump_tasks(const struct mem_cgroup *mem, const nodemask_t *no
task_tgid_nr(task), task_tgid_nr(task->real_parent),
task_uid(task),
task->mm->total_vm,
- get_mm_rss(task->mm) + p->mm->nr_ptes,
- get_mm_counter(p->mm, MM_SWAPENTS),
+ get_mm_rss(task->mm) + task->mm->nr_ptes,
+ get_mm_counter(task->mm, MM_SWAPENTS),
task->signal->oom_score_adj,
task->comm);
task_unlock(task);

2011-05-31 04:52:18

[permalink] [raw]

Subject: Re: [PATCH v2 0/5] Fix oom killer doesn't work at all if system have > gigabytes memory (aka CAI founded issue)

----- Original Message -----
> (2011/05/31 10:33), CAI Qian wrote:
> > Hello,
> >
> > Have tested those patches rebased from KOSAKI for the latest
> > mainline.
> > It still killed random processes and recevied a panic at the end by
> > using root user. The full oom output can be found here.
> > http://people.redhat.com/qcai/oom
>
> You ran fork-bomb as root. Therefore unprivileged process was killed
> at first.
> It's no random. It's intentional and desirable. I mean
>
> - If you run the same progream as non-root, python will be killed at
> first.
> Because it consume a lot of memory than daemons.
> - If you run the same program as root, non root process and privilege
> explicit
> dropping processes (e.g. irqbalance) will be killed at first.
Hmm, at least there were some programs were root processes but were killed
first.
[ pid] ppid uid total_vm rss swap score_adj name
[ 5720] 5353 0 24421 257 0 0 sshd
[ 5353] 1 0 15998 189 0 0 sshd
[ 5451] 1 0 19648 235 0 0 master
[ 1626] 1 0 2287 129 0 0 dhclient
>
> Look, your log says, highest oom score process was killed first.
>
> Out of memory: Kill process 5462 (abrtd) points:393 total-vm:262300kB,
> anon-rss:1024kB, file-rss:0kB
> Out of memory: Kill process 5277 (hald) points:303 total-vm:25444kB,
> anon-rss:1116kB, file-rss:0kB
> Out of memory: Kill process 5720 (sshd) points:258 total-vm:97684kB,
> anon-rss:824kB, file-rss:0kB
> Out of memory: Kill process 5457 (pickup) points:236 total-vm:78672kB,
> anon-rss:768kB, file-rss:0kB
> Out of memory: Kill process 5451 (master) points:235 total-vm:78592kB,
> anon-rss:796kB, file-rss:0kB
> Out of memory: Kill process 5458 (qmgr) points:233 total-vm:78740kB,
> anon-rss:764kB, file-rss:0kB
> Out of memory: Kill process 5353 (sshd) points:189 total-vm:63992kB,
> anon-rss:620kB, file-rss:0kB
> Out of memory: Kill process 1626 (dhclient) points:129
> total-vm:9148kB, anon-rss:484kB, file-rss:0kB
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign
> http://stopthemeter.ca/
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2011-05-31 04:54:37

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [PATCH 4/5] oom: don't kill random process

(2011/05/31 13:48), David Rientjes wrote:
> On Mon, 30 May 2011, KOSAKI Motohiro wrote:
>
>> Never mind.
>>
>> You never see to increase tasklist_lock. You never seen all processes
>> have root privilege case.
>
> I don't really understand what you're trying to say, sorry.

It's no for job server workload. I mean.

2011-05-31 07:05:06

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [PATCH v2 0/5] Fix oom killer doesn't work at all if system have > gigabytes memory (aka CAI founded issue)

>> - If you run the same program as root, non root process and privilege
>> explicit
>> dropping processes (e.g. irqbalance) will be killed at first.
> Hmm, at least there were some programs were root processes but were killed
> first.
> [ pid] ppid uid total_vm rss swap score_adj name
> [ 5720] 5353 0 24421 257 0 0 sshd
> [ 5353] 1 0 15998 189 0 0 sshd
> [ 5451] 1 0 19648 235 0 0 master
> [ 1626] 1 0 2287 129 0 0 dhclient

Hi

I can't reproduce this too. Are you sure these processes have a full root privilege?
I've made new debugging patch. After applying following patch, do these processes show
cap=1?

index f0e34d4..fe788df 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -429,7 +429,7 @@ static void dump_tasks(const struct mem_cgroup *mem, const nodemask_t *no
struct task_struct *p;
struct task_struct *task;

- pr_info("[ pid] ppid uid total_vm rss swap score_adj name\n");
+ pr_info("[ pid] ppid uid cap total_vm rss swap score_adj name\n");
for_each_process(p) {
if (oom_unkillable_task(p, mem, nodemask))
continue;
@@ -444,9 +444,9 @@ static void dump_tasks(const struct mem_cgroup *mem, const nodemask_t *no
continue;
}

- pr_info("[%6d] %6d %5d %8lu %8lu %8lu %9d %s\n",
+ pr_info("[%6d] %6d %5d %3d %8lu %8lu %8lu %9d %s\n",
task_tgid_nr(task), task_tgid_nr(task->real_parent),
- task_uid(task),
+ task_uid(task), has_capability_noaudit(task, CAP_SYS_ADMIN),
task->mm->total_vm,
get_mm_rss(task->mm) + task->mm->nr_ptes,
get_mm_counter(task->mm, MM_SWAPENTS),

2011-05-31 07:51:25

[permalink] [raw]

Subject: Re: [PATCH v2 0/5] Fix oom killer doesn't work at all if system have > gigabytes memory (aka CAI founded issue)

----- Original Message -----
> >> - If you run the same program as root, non root process and
> >> privilege
> >> explicit
> >> dropping processes (e.g. irqbalance) will be killed at first.
> > Hmm, at least there were some programs were root processes but were
> > killed
> > first.
> > [ pid] ppid uid total_vm rss swap score_adj name
> > [ 5720] 5353 0 24421 257 0 0 sshd
> > [ 5353] 1 0 15998 189 0 0 sshd
> > [ 5451] 1 0 19648 235 0 0 master
> > [ 1626] 1 0 2287 129 0 0 dhclient
>
> Hi
>
> I can't reproduce this too. Are you sure these processes have a full
> root privilege?
> I've made new debugging patch. After applying following patch, do
> these processes show
> cap=1?
No, all of them had cap=0. Wondering why something like sshd not been
made cap=1 to avoid early oom kill.
>
>
>
> index f0e34d4..fe788df 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -429,7 +429,7 @@ static void dump_tasks(const struct mem_cgroup
> *mem, const nodemask_t *no
> struct task_struct *p;
> struct task_struct *task;
>
> - pr_info("[ pid] ppid uid total_vm rss swap score_adj name\n");
> + pr_info("[ pid] ppid uid cap total_vm rss swap score_adj name\n");
> for_each_process(p) {
> if (oom_unkillable_task(p, mem, nodemask))
> continue;
> @@ -444,9 +444,9 @@ static void dump_tasks(const struct mem_cgroup
> *mem, const nodemask_t *no
> continue;
> }
>
> - pr_info("[%6d] %6d %5d %8lu %8lu %8lu %9d %s\n",
> + pr_info("[%6d] %6d %5d %3d %8lu %8lu %8lu %9d %s\n",
> task_tgid_nr(task), task_tgid_nr(task->real_parent),
> - task_uid(task),
> + task_uid(task), has_capability_noaudit(task, CAP_SYS_ADMIN),
> task->mm->total_vm,
> get_mm_rss(task->mm) + task->mm->nr_ptes,
> get_mm_counter(task->mm, MM_SWAPENTS),
>
>
>
>
>
>
>
>
>
>
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign
> http://stopthemeter.ca/
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2011-05-31 07:57:05

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [PATCH v2 0/5] Fix oom killer doesn't work at all if system have > gigabytes memory (aka CAI founded issue)

(2011/05/31 16:50), CAI Qian wrote:
>
>
> ----- Original Message -----
>>>> - If you run the same program as root, non root process and
>>>> privilege
>>>> explicit
>>>> dropping processes (e.g. irqbalance) will be killed at first.
>>> Hmm, at least there were some programs were root processes but were
>>> killed
>>> first.
>>> [ pid] ppid uid total_vm rss swap score_adj name
>>> [ 5720] 5353 0 24421 257 0 0 sshd
>>> [ 5353] 1 0 15998 189 0 0 sshd
>>> [ 5451] 1 0 19648 235 0 0 master
>>> [ 1626] 1 0 2287 129 0 0 dhclient
>>
>> Hi
>>
>> I can't reproduce this too. Are you sure these processes have a full
>> root privilege?
>> I've made new debugging patch. After applying following patch, do
>> these processes show
>> cap=1?
> No, all of them had cap=0. Wondering why something like sshd not been
> made cap=1 to avoid early oom kill.

Then, I believe your distro applying distro specific patch to ssh.
Which distro are you using now?

2011-05-31 08:00:05

[permalink] [raw]

Subject: Re: [PATCH v2 0/5] Fix oom killer doesn't work at all if system have > gigabytes memory (aka CAI founded issue)

----- Original Message -----
> (2011/05/31 16:50), CAI Qian wrote:
> >
> >
> > ----- Original Message -----
> >>>> - If you run the same program as root, non root process and
> >>>> privilege
> >>>> explicit
> >>>> dropping processes (e.g. irqbalance) will be killed at first.
> >>> Hmm, at least there were some programs were root processes but
> >>> were
> >>> killed
> >>> first.
> >>> [ pid] ppid uid total_vm rss swap score_adj name
> >>> [ 5720] 5353 0 24421 257 0 0 sshd
> >>> [ 5353] 1 0 15998 189 0 0 sshd
> >>> [ 5451] 1 0 19648 235 0 0 master
> >>> [ 1626] 1 0 2287 129 0 0 dhclient
> >>
> >> Hi
> >>
> >> I can't reproduce this too. Are you sure these processes have a
> >> full
> >> root privilege?
> >> I've made new debugging patch. After applying following patch, do
> >> these processes show
> >> cap=1?
> > No, all of them had cap=0. Wondering why something like sshd not
> > been
> > made cap=1 to avoid early oom kill.
>
> Then, I believe your distro applying distro specific patch to ssh.
> Which distro are you using now?
It is a Fedora-like distro.

2011-05-31 08:11:28

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [PATCH v2 0/5] Fix oom killer doesn't work at all if system have > gigabytes memory (aka CAI founded issue)

>> Then, I believe your distro applying distro specific patch to ssh.
>> Which distro are you using now?
> It is a Fedora-like distro.

Ho Hm.
Actually, I'm using Fedora14 and I don't see this phenomenon.
I'll try to version up to Fedora15 in near future.

Thanks.

2011-05-31 10:01:21

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [PATCH v2 0/5] Fix oom killer doesn't work at all if system have > gigabytes memory (aka CAI founded issue)

(2011/05/31 17:11), KOSAKI Motohiro wrote:
>>> Then, I believe your distro applying distro specific patch to ssh.
>>> Which distro are you using now?
>> It is a Fedora-like distro.

So, Does this makes sense?

>From e47fedaa546499fa3d4196753194db0609cfa2e5 Mon Sep 17 00:00:00 2001
From: KOSAKI Motohiro <[email protected]>
Date: Tue, 31 May 2011 18:28:30 +0900
Subject: [PATCH] oom: use euid instead of CAP_SYS_ADMIN for protection root process

Recently, many userland daemon prefer to use libcap-ng and drop
all privilege just after startup. Because of (1) Almost privilege
are necessary only when special file open, and aren't necessary
read and write. (2) In general, privilege dropping brings better
protection from exploit when bugs are found in the daemon.

But, it makes suboptimal oom-killer behavior. CAI Qian reported
oom killer killed some important daemon at first on his fedora
like distro. Because they've lost CAP_SYS_ADMIN.

Of course, we recommend to drop privileges as far as possible
instead of keeping them. Thus, oom killer don't have to check
any capability. It implicitly suggest wrong programming style.

This patch change root process check way from CAP_SYS_ADMIN to
just euid==0.

Reported-by: CAI Qian <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/oom_kill.c | 8 ++++----
1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 59eda6e..4e1e8a5 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -203,7 +203,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *mem,
* Root processes get 3% bonus, just like the __vm_enough_memory()
* implementation used by LSMs.
*/
- if (protect_root && has_capability_noaudit(p, CAP_SYS_ADMIN)) {
+ if (protect_root && (task_euid(p) == 0)) {
if (points >= totalpages / 32)
points -= totalpages / 32;
else
@@ -429,7 +429,7 @@ static void dump_tasks(const struct mem_cgroup *mem, const nodemask_t *nodemask)
struct task_struct *p;
struct task_struct *task;

- pr_info("[ pid] ppid uid cap total_vm rss swap score_adj name\n");
+ pr_info("[ pid] ppid uid euid total_vm rss swap score_adj name\n");
for_each_process(p) {
if (oom_unkillable_task(p, mem, nodemask))
continue;
@@ -444,9 +444,9 @@ static void dump_tasks(const struct mem_cgroup *mem, const nodemask_t *nodemask)
continue;
}

- pr_info("[%6d] %6d %5d %3d %8lu %8lu %8lu %9d %s\n",
+ pr_info("[%6d] %6d %5d %5d %8lu %8lu %8lu %9d %s\n",
task_tgid_nr(task), task_tgid_nr(task->real_parent),
- task_uid(task), has_capability_noaudit(task, CAP_SYS_ADMIN),
+ task_uid(task), task_euid(task),
task->mm->total_vm,
get_mm_rss(task->mm) + task->mm->nr_ptes,
get_mm_counter(task->mm, MM_SWAPENTS),
--
1.7.3.1

2011-06-01 01:17:56

[permalink] [raw]

Subject: Re: [PATCH v2 0/5] Fix oom killer doesn't work at all if system have > gigabytes memory (aka CAI founded issue)

----- Original Message -----
> (2011/05/31 17:11), KOSAKI Motohiro wrote:
> >>> Then, I believe your distro applying distro specific patch to ssh.
> >>> Which distro are you using now?
> >> It is a Fedora-like distro.
>
> So, Does this makes sense?
Looks like so, at least now sshd can survive from oom-killed.
>
>
>
> From e47fedaa546499fa3d4196753194db0609cfa2e5 Mon Sep 17 00:00:00 2001
> From: KOSAKI Motohiro <[email protected]>
> Date: Tue, 31 May 2011 18:28:30 +0900
> Subject: [PATCH] oom: use euid instead of CAP_SYS_ADMIN for protection
> root process
>
> Recently, many userland daemon prefer to use libcap-ng and drop
> all privilege just after startup. Because of (1) Almost privilege
> are necessary only when special file open, and aren't necessary
> read and write. (2) In general, privilege dropping brings better
> protection from exploit when bugs are found in the daemon.
>
> But, it makes suboptimal oom-killer behavior. CAI Qian reported
> oom killer killed some important daemon at first on his fedora
> like distro. Because they've lost CAP_SYS_ADMIN.
>
> Of course, we recommend to drop privileges as far as possible
> instead of keeping them. Thus, oom killer don't have to check
> any capability. It implicitly suggest wrong programming style.
>
> This patch change root process check way from CAP_SYS_ADMIN to
> just euid==0.
>
> Reported-by: CAI Qian <[email protected]>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
> ---
> mm/oom_kill.c | 8 ++++----
> 1 files changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 59eda6e..4e1e8a5 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -203,7 +203,7 @@ unsigned long oom_badness(struct task_struct *p,
> struct mem_cgroup *mem,
> * Root processes get 3% bonus, just like the __vm_enough_memory()
> * implementation used by LSMs.
> */
> - if (protect_root && has_capability_noaudit(p, CAP_SYS_ADMIN)) {
> + if (protect_root && (task_euid(p) == 0)) {
> if (points >= totalpages / 32)
> points -= totalpages / 32;
> else
> @@ -429,7 +429,7 @@ static void dump_tasks(const struct mem_cgroup
> *mem, const nodemask_t *nodemask)
> struct task_struct *p;
> struct task_struct *task;
>
> - pr_info("[ pid] ppid uid cap total_vm rss swap score_adj name\n");
> + pr_info("[ pid] ppid uid euid total_vm rss swap score_adj name\n");
> for_each_process(p) {
> if (oom_unkillable_task(p, mem, nodemask))
> continue;
> @@ -444,9 +444,9 @@ static void dump_tasks(const struct mem_cgroup
> *mem, const nodemask_t *nodemask)
> continue;
> }
>
> - pr_info("[%6d] %6d %5d %3d %8lu %8lu %8lu %9d %s\n",
> + pr_info("[%6d] %6d %5d %5d %8lu %8lu %8lu %9d %s\n",
> task_tgid_nr(task), task_tgid_nr(task->real_parent),
> - task_uid(task), has_capability_noaudit(task, CAP_SYS_ADMIN),
> + task_uid(task), task_euid(task),
> task->mm->total_vm,
> get_mm_rss(task->mm) + task->mm->nr_ptes,
> get_mm_counter(task->mm, MM_SWAPENTS),
> --
> 1.7.3.1
>
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign
> http://stopthemeter.ca/
> Don't email: href=mailto:"[email protected]"> [email protected]

2011-06-01 03:33:09

[permalink] [raw]

Subject: Re: [PATCH v2 0/5] Fix oom killer doesn't work at all if system have > gigabytes memory (aka CAI founded issue)

Hi KOSAKI,

On Tue, May 31, 2011 at 07:01:08PM +0900, KOSAKI Motohiro wrote:
> (2011/05/31 17:11), KOSAKI Motohiro wrote:
> >>> Then, I believe your distro applying distro specific patch to ssh.
> >>> Which distro are you using now?
> >> It is a Fedora-like distro.
>
> So, Does this makes sense?
>
>
>
> From e47fedaa546499fa3d4196753194db0609cfa2e5 Mon Sep 17 00:00:00 2001
> From: KOSAKI Motohiro <[email protected]>
> Date: Tue, 31 May 2011 18:28:30 +0900
> Subject: [PATCH] oom: use euid instead of CAP_SYS_ADMIN for protection root process
>
> Recently, many userland daemon prefer to use libcap-ng and drop
> all privilege just after startup. Because of (1) Almost privilege
> are necessary only when special file open, and aren't necessary
> read and write. (2) In general, privilege dropping brings better
> protection from exploit when bugs are found in the daemon.
>
> But, it makes suboptimal oom-killer behavior. CAI Qian reported
> oom killer killed some important daemon at first on his fedora
> like distro. Because they've lost CAP_SYS_ADMIN.
>
> Of course, we recommend to drop privileges as far as possible
> instead of keeping them. Thus, oom killer don't have to check
> any capability. It implicitly suggest wrong programming style.
>
> This patch change root process check way from CAP_SYS_ADMIN to
> just euid==0.

I like this but I have some comments.
Firstly, it's not dependent with your series so I think this could
be merged firstly.
Before that, I would like to make clear my concern.
As I look below comment, 3% bonus is dependent with __vm_enough_memory's logic?
If it isn't, we can remove the comment. It would be another patch.
If is is, could we change __vm_enough_memory for euid instead of cap?

* Root processes get 3% bonus, just like the __vm_enough_memory()
* implementation used by LSMs.

--
Kind regards
Minchan Kim

2011-06-06 03:07:21

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [PATCH v2 0/5] Fix oom killer doesn't work at all if system have > gigabytes memory (aka CAI founded issue)

>> Of course, we recommend to drop privileges as far as possible
>> instead of keeping them. Thus, oom killer don't have to check
>> any capability. It implicitly suggest wrong programming style.
>>
>> This patch change root process check way from CAP_SYS_ADMIN to
>> just euid==0.
>
> I like this but I have some comments.
> Firstly, it's not dependent with your series so I think this could
> be merged firstly.

I agree.

> Before that, I would like to make clear my concern.
> As I look below comment, 3% bonus is dependent with __vm_enough_memory's logic?

No. completely independent.

vm_enough_memory() check the task _can_ allocate more memory. IOW, the task
is subjective. And oom-killer check the task should be protected from oom-killer.
IOW, the task is objective.

> If it isn't, we can remove the comment. It would be another patch.
> If is is, could we change __vm_enough_memory for euid instead of cap?
>
> * Root processes get 3% bonus, just like the __vm_enough_memory()
> * implementation used by LSMs.

vm_enough_memory() is completely correct. I don't see any reason to change it.

2011-06-06 14:45:13

[permalink] [raw]

Subject: Re: [PATCH v2 0/5] Fix oom killer doesn't work at all if system have > gigabytes memory (aka CAI founded issue)

On Mon, Jun 06, 2011 at 12:07:15PM +0900, KOSAKI Motohiro wrote:
> >> Of course, we recommend to drop privileges as far as possible
> >> instead of keeping them. Thus, oom killer don't have to check
> >> any capability. It implicitly suggest wrong programming style.
> >>
> >> This patch change root process check way from CAP_SYS_ADMIN to
> >> just euid==0.
> >
> > I like this but I have some comments.
> > Firstly, it's not dependent with your series so I think this could
> > be merged firstly.
>
> I agree.
>
> > Before that, I would like to make clear my concern.
> > As I look below comment, 3% bonus is dependent with __vm_enough_memory's logic?
>
> No. completely independent.
>
> vm_enough_memory() check the task _can_ allocate more memory. IOW, the task
> is subjective. And oom-killer check the task should be protected from oom-killer.
> IOW, the task is objective.
>

Hmm, maybe I can't understand your point.
My though was below.

Assumption)
1. root allocation bonus point -> 10%
2. OOM have no bonus about root process

Scenario)
1.
System has 101 free pages and 10 normal tasks.
Ideally, 10 tasks allocates free memory fairly so each task will have 10 pages.
So OOM killer can select victim fairly when new task which requires 10 pages forks.

2.
System has 101 free pages and 10 tasks. (9 normal task , 1 root task)
10 * 9 + 11 will be consumed. So each normal task will have 10 pages
but a root task have 11 pages.
So OOM killer can always selectd root process as vicim.(We assumed
OOM doesn't have a bonus on root process)

Conclusion)
For solving above problem, we have to give bonus which was given
in allocation to OOM, too. It's fair.
So I think it has a dependency.

--
Kind regards
Minchan Kim