2011-06-22 10:45:57

by KOSAKI Motohiro

[permalink] [raw]
Subject: [PATCH v3 0/6] Fix oom killer doesn't work at all if system have > gigabytes memory (aka CAI founded issue)

CAI Qian reported current oom logic doesn't work at all on his 16GB RAM
machine. oom killer killed all system daemon at first and his system
stopped responding.

The brief log is below.

> > Out of memory: Kill process 1175 (dhclient) score 1 or sacrifice child
> > Out of memory: Kill process 1247 (rsyslogd) score 1 or sacrifice child
> > Out of memory: Kill process 1284 (irqbalance) score 1 or sacrifice child
> > Out of memory: Kill process 1303 (rpcbind) score 1 or sacrifice child
> > Out of memory: Kill process 1321 (rpc.statd) score 1 or sacrifice child
> > Out of memory: Kill process 1333 (mdadm) score 1 or sacrifice child
> > Out of memory: Kill process 1365 (rpc.idmapd) score 1 or sacrifice child
> > Out of memory: Kill process 1403 (dbus-daemon) score 1 or sacrifice child
> > Out of memory: Kill process 1438 (acpid) score 1 or sacrifice child
> > Out of memory: Kill process 1447 (hald) score 1 or sacrifice child
> > Out of memory: Kill process 1447 (hald) score 1 or sacrifice child
> > Out of memory: Kill process 1487 (hald-addon-inpu) score 1 or sacrifice child
> > Out of memory: Kill process 1488 (hald-addon-acpi) score 1 or sacrifice child
> > Out of memory: Kill process 1507 (automount) score 1 or sacrifice child

The problems are three.

1) if two processes have the same oom score, we should kill younger process.
but current logic kill older. Typically oldest processes are system daemons.
2) Current logic use 'unsigned int' for internal score calculation. (exactly says,
it only use 0-1000 value). its very low precision calculation makes a lot of
same oom score and kill an ineligible process.
3) Current logic give 3% of SystemRAM to root processes. It obviously too big
if you have plenty memory. Now, your fork-bomb processes have 500MB OOM immune
bonus. then your fork-bomb never ever be killed.

Changes from v2
o added [patch 1/5] use euid instead of CAP_SYS_ADMIN


KOSAKI Motohiro (6):
oom: use euid instead of CAP_SYS_ADMIN for protection root process
oom: improve dump_tasks() show items
oom: kill younger process first
oom: oom-killer don't use proportion of system-ram internally
oom: don't kill random process
oom: merge oom_kill_process() with oom_kill_task()

fs/proc/base.c | 13 ++-
include/linux/oom.h | 5 +-
include/linux/sched.h | 11 +++
mm/oom_kill.c | 201 ++++++++++++++++++++++++++----------------------
4 files changed, 131 insertions(+), 99 deletions(-)

--
1.7.3.1


2011-06-22 10:46:46

by KOSAKI Motohiro

[permalink] [raw]
Subject: [PATCH 1/6] oom: use euid instead of CAP_SYS_ADMIN for protection root process

Recently, many userland daemon prefer to use libcap-ng and drop
all privilege just after startup. Because of (1) Almost privilege
are necessary only when special file open, and aren't necessary
read and write. (2) In general, privilege dropping brings better
protection from exploit when bugs are found in the daemon.

But, it makes suboptimal oom-killer behavior. CAI Qian reported
oom killer killed some important daemon at first on his fedora
like distro. Because they've lost CAP_SYS_ADMIN.

Of course, we recommend to drop privileges as far as possible
instead of keeping them. Thus, oom killer don't have to check
any capability. It implicitly suggest wrong programming style.

This patch change root process check way from CAP_SYS_ADMIN to
just euid==0.

Reported-by: CAI Qian <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/oom_kill.c | 9 +++++----
1 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index e4b0991..f552e39 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -203,7 +203,7 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
* Root processes get 3% bonus, just like the __vm_enough_memory()
* implementation used by LSMs.
*/
- if (has_capability_noaudit(p, CAP_SYS_ADMIN))
+ if (task_euid(p) == 0)
points -= 30;

/*
@@ -373,7 +373,7 @@ static void dump_tasks(const struct mem_cgroup *mem, const nodemask_t *nodemask)
struct task_struct *p;
struct task_struct *task;

- pr_info("[ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name\n");
+ pr_info("[ pid ] uid euid tgid total_vm rss cpu oom_adj oom_score_adj name\n");
for_each_process(p) {
if (oom_unkillable_task(p, mem, nodemask))
continue;
@@ -388,8 +388,9 @@ static void dump_tasks(const struct mem_cgroup *mem, const nodemask_t *nodemask)
continue;
}

- pr_info("[%5d] %5d %5d %8lu %8lu %3u %3d %5d %s\n",
- task->pid, task_uid(task), task->tgid,
+ pr_info("[%5d] %5d %5d %5d %8lu %8lu %3u %3d %5d %s\n",
+ task->pid, task_uid(task), task_euid(task),
+ task->tgid,
task->mm->total_vm, get_mm_rss(task->mm),
task_cpu(task), task->signal->oom_adj,
task->signal->oom_score_adj, task->comm);
--
1.7.3.1


2011-06-22 10:47:22

by KOSAKI Motohiro

[permalink] [raw]
Subject: [PATCH 2/6] oom: improve dump_tasks() show items

Recently, oom internal logic was dramatically changed. Thus
dump_tasks() doesn't show enough information for bug report
analysis. it has some meaningless items and don't have some
oom socre related items.

This patch adapt displaying fields to new oom logic.

details
--------
removed: pid (we always kill process. don't need thread id),
signal->oom_adj (we no longer uses it internally)
cpu (we no longer uses it)
added: ppid (we often kill sacrifice child process)
swap (it's accounted)
modify: RSS (account mm->nr_ptes too)

<old>
[ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name
[ 3886] 0 3886 2893 441 1 0 0 bash
[ 3905] 0 3905 29361 25833 0 0 0 memtoy

<new>
[ pid] ppid uid euid total_vm rss swap score_adj name
[ 417] 1 0 0 3298 12 184 -1000 udevd
[ 830] 1 0 0 1776 11 16 0 system-setup-ke
[ 973] 1 0 0 61179 35 116 0 rsyslogd
[ 1733] 1732 0 0 1052337 958582 0 0 memtoy

Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/oom_kill.c | 16 +++++++++-------
1 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f552e39..9412657 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -373,7 +373,7 @@ static void dump_tasks(const struct mem_cgroup *mem, const nodemask_t *nodemask)
struct task_struct *p;
struct task_struct *task;

- pr_info("[ pid ] uid euid tgid total_vm rss cpu oom_adj oom_score_adj name\n");
+ pr_info("[ pid] ppid uid euid total_vm rss swap score_adj name\n");
for_each_process(p) {
if (oom_unkillable_task(p, mem, nodemask))
continue;
@@ -388,12 +388,14 @@ static void dump_tasks(const struct mem_cgroup *mem, const nodemask_t *nodemask)
continue;
}

- pr_info("[%5d] %5d %5d %5d %8lu %8lu %3u %3d %5d %s\n",
- task->pid, task_uid(task), task_euid(task),
- task->tgid,
- task->mm->total_vm, get_mm_rss(task->mm),
- task_cpu(task), task->signal->oom_adj,
- task->signal->oom_score_adj, task->comm);
+ pr_info("[%6d] %6d %5d %5d %8lu %8lu %8lu %9d %s\n",
+ task_tgid_nr(task), task_tgid_nr(task->real_parent),
+ task_uid(task), task_euid(task),
+ task->mm->total_vm,
+ get_mm_rss(task->mm) + task->mm->nr_ptes,
+ get_mm_counter(task->mm, MM_SWAPENTS),
+ task->signal->oom_score_adj,
+ task->comm);
task_unlock(task);
}
}
--
1.7.3.1


2011-06-22 10:47:52

by KOSAKI Motohiro

[permalink] [raw]
Subject: [PATCH 3/6] oom: kill younger process first

This patch introduces do_each_thread_reverse() and select_bad_process()
uses it. The benefits are two, 1) oom-killer can kill younger process
than older if they have a same oom score. Usually younger process is
less important. 2) younger task often have PF_EXITING because shell
script makes a lot of short lived processes. Reverse order search can
detect it faster.

Reported-by: CAI Qian <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
Acked-by: David Rientjes <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/sched.h | 11 +++++++++++
mm/oom_kill.c | 2 +-
2 files changed, 12 insertions(+), 1 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index e4e6d7b..392ff30 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2257,6 +2257,9 @@ static inline unsigned long wait_task_inactive(struct task_struct *p,
#define next_task(p) \
list_entry_rcu((p)->tasks.next, struct task_struct, tasks)

+#define prev_task(p) \
+ list_entry((p)->tasks.prev, struct task_struct, tasks)
+
#define for_each_process(p) \
for (p = &init_task ; (p = next_task(p)) != &init_task ; )

@@ -2269,6 +2272,14 @@ extern bool current_is_single_threaded(void);
#define do_each_thread(g, t) \
for (g = t = &init_task ; (g = t = next_task(g)) != &init_task ; ) do

+/*
+ * Similar to do_each_thread(). but two difference are there.
+ * - traverse tasks reverse order (i.e. younger to older)
+ * - caller must hold tasklist_lock. rcu_read_lock isn't enough
+*/
+#define do_each_thread_reverse(g, t) \
+ for (g = t = &init_task ; (g = t = prev_task(g)) != &init_task ; ) do
+
#define while_each_thread(g, t) \
while ((t = next_thread(t)) != g)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 9412657..797308b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -300,7 +300,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
struct task_struct *chosen = NULL;
*ppoints = 0;

- do_each_thread(g, p) {
+ do_each_thread_reverse(g, p) {
unsigned int points;

if (!p->mm)
--
1.7.3.1


2011-06-22 10:48:27

by KOSAKI Motohiro

[permalink] [raw]
Subject: [PATCH 4/6] oom: oom-killer don't use proportion of system-ram internally

CAI Qian reported his kernel did hang-up if he ran fork intensive
workload and then invoke oom-killer.

The problem is, current oom calculation uses 0-1000 normalized value
(The unit is a permillage of system-ram). Its low precision make
a lot of same oom score. IOW, in his case, all processes have smaller
oom score than 1 and internal calculation round it to 1.

Thus oom-killer kill ineligible process. This regression is caused by
commit a63d83f427 (oom: badness heuristic rewrite).

The solution is, the internal calculation just use number of pages
instead of permillage of system-ram. And convert it to permillage
value at displaying time.

This patch doesn't change any ABI (included /proc/<pid>/oom_score_adj)
even though current logic has a lot of my dislike thing.

Reported-by: CAI Qian <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
fs/proc/base.c | 13 ++++++----
include/linux/oom.h | 2 +-
mm/oom_kill.c | 60 ++++++++++++++++++++++++++++++--------------------
3 files changed, 45 insertions(+), 30 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 14def99..4a10763 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -479,14 +479,17 @@ static const struct file_operations proc_lstats_operations = {

static int proc_oom_score(struct task_struct *task, char *buffer)
{
- unsigned long points = 0;
+ unsigned long points;
+ unsigned long ratio = 0;
+ unsigned long totalpages = totalram_pages + total_swap_pages + 1;

read_lock(&tasklist_lock);
- if (pid_alive(task))
- points = oom_badness(task, NULL, NULL,
- totalram_pages + total_swap_pages);
+ if (pid_alive(task)) {
+ points = oom_badness(task, NULL, NULL, totalpages);
+ ratio = points * 1000 / totalpages;
+ }
read_unlock(&tasklist_lock);
- return sprintf(buffer, "%lu\n", points);
+ return sprintf(buffer, "%lu\n", ratio);
}

struct limit_names {
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 4952fb8..75b104c 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -42,7 +42,7 @@ enum oom_constraint {

extern int test_set_oom_score_adj(int new_val);

-extern unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
+extern unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *mem,
const nodemask_t *nodemask, unsigned long totalpages);
extern int try_set_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 797308b..cff8000 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -159,10 +159,11 @@ static bool oom_unkillable_task(struct task_struct *p,
* predictable as possible. The goal is to return the highest value for the
* task consuming the most memory to avoid subsequent oom failures.
*/
-unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
+unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *mem,
const nodemask_t *nodemask, unsigned long totalpages)
{
- int points;
+ unsigned long points;
+ unsigned long score_adj = 0;

if (oom_unkillable_task(p, mem, nodemask))
return 0;
@@ -194,33 +195,44 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
*/
points = get_mm_rss(p->mm) + p->mm->nr_ptes;
points += get_mm_counter(p->mm, MM_SWAPENTS);
-
- points *= 1000;
- points /= totalpages;
task_unlock(p);

- /*
- * Root processes get 3% bonus, just like the __vm_enough_memory()
- * implementation used by LSMs.
- */
- if (task_euid(p) == 0)
- points -= 30;
+ /* Root processes get 3% bonus. */
+ if (task_euid(p) == 0) {
+ if (points >= totalpages / 32)
+ points -= totalpages / 32;
+ else
+ points = 0;
+ }

/*
* /proc/pid/oom_score_adj ranges from -1000 to +1000 such that it may
* either completely disable oom killing or always prefer a certain
* task.
*/
- points += p->signal->oom_score_adj;
+ if (p->signal->oom_score_adj >= 0) {
+ score_adj = p->signal->oom_score_adj * (totalpages / 1000);
+ if (ULONG_MAX - points >= score_adj)
+ points += score_adj;
+ else
+ points = ULONG_MAX;
+ } else {
+ score_adj = -p->signal->oom_score_adj * (totalpages / 1000);
+ if (points >= score_adj)
+ points -= score_adj;
+ else
+ points = 0;
+ }

/*
* Never return 0 for an eligible task that may be killed since it's
* possible that no single user task uses more than 0.1% of memory and
* no single admin tasks uses more than 3.0%.
*/
- if (points <= 0)
- return 1;
- return (points < 1000) ? points : 1000;
+ if (!points)
+ points = 1;
+
+ return points;
}

/*
@@ -292,7 +304,7 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
*
* (not docbooked, we don't want this one cluttering up the manual)
*/
-static struct task_struct *select_bad_process(unsigned int *ppoints,
+static struct task_struct *select_bad_process(unsigned long *ppoints,
unsigned long totalpages, struct mem_cgroup *mem,
const nodemask_t *nodemask)
{
@@ -301,7 +313,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
*ppoints = 0;

do_each_thread_reverse(g, p) {
- unsigned int points;
+ unsigned long points;

if (!p->mm)
continue;
@@ -332,7 +344,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
*/
if (p == current) {
chosen = p;
- *ppoints = 1000;
+ *ppoints = ULONG_MAX;
} else {
/*
* If this task is not being ptraced on exit,
@@ -463,14 +475,14 @@ static int oom_kill_task(struct task_struct *p, struct mem_cgroup *mem)
#undef K

static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
- unsigned int points, unsigned long totalpages,
+ unsigned long points, unsigned long totalpages,
struct mem_cgroup *mem, nodemask_t *nodemask,
const char *message)
{
struct task_struct *victim = p;
struct task_struct *child;
struct task_struct *t = p;
- unsigned int victim_points = 0;
+ unsigned long victim_points = 0;

if (printk_ratelimit())
dump_header(p, gfp_mask, order, mem, nodemask);
@@ -485,7 +497,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
}

task_lock(p);
- pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n",
+ pr_err("%s: Kill process %d (%s) points %lu or sacrifice child\n",
message, task_pid_nr(p), p->comm, points);
task_unlock(p);

@@ -497,7 +509,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
*/
do {
list_for_each_entry(child, &t->children, sibling) {
- unsigned int child_points;
+ unsigned long child_points;

if (child->mm == p->mm)
continue;
@@ -544,7 +556,7 @@ static void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
{
unsigned long limit;
- unsigned int points = 0;
+ unsigned long points = 0;
struct task_struct *p;

/*
@@ -693,7 +705,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
struct task_struct *p;
unsigned long totalpages;
unsigned long freed = 0;
- unsigned int points;
+ unsigned long points;
enum oom_constraint constraint = CONSTRAINT_NONE;
int killed = 0;

--
1.7.3.1


2011-06-22 10:48:59

by KOSAKI Motohiro

[permalink] [raw]
Subject: [PATCH 5/6] oom: don't kill random process

CAI Qian reported oom-killer killed all system daemons in his
system at first if he ran fork bomb as root. The problem is,
current logic give them bonus of 3% of system ram. Example,
he has 16GB machine, then root processes have ~500MB oom
immune. It bring us crazy bad result. _all_ processes have
oom-score=1 and then, oom killer ignore process memroy usage
and kill random process. This regression is caused by commit
a63d83f427 (oom: badness heuristic rewrite).

This patch changes select_bad_process() slightly. If oom points == 1,
it's a sign that the system have only root privileged processes or
similar. Thus, select_bad_process() calculate oom badness without
root bonus and select eligible process.

Also, this patch move finding sacrifice child logic into
select_bad_process(). It's necessary to implement adequate
no root bonus recalculation. and it makes good side effect,
current logic doesn't behave as the doc.

Documentation/sysctl/vm.txt says

oom_kill_allocating_task

If this is set to non-zero, the OOM killer simply kills the task that
triggered the out-of-memory condition. This avoids the expensive
tasklist scan.

IOW, oom_kill_allocating_task shouldn't search sacrifice child.
This patch also fixes this issue.

Reported-by: CAI Qian <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
fs/proc/base.c | 2 +-
include/linux/oom.h | 3 +-
mm/oom_kill.c | 89 ++++++++++++++++++++++++++++----------------------
3 files changed, 53 insertions(+), 41 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 4a10763..5e4a8a1 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -485,7 +485,7 @@ static int proc_oom_score(struct task_struct *task, char *buffer)

read_lock(&tasklist_lock);
if (pid_alive(task)) {
- points = oom_badness(task, NULL, NULL, totalpages);
+ points = oom_badness(task, NULL, NULL, totalpages, 1);
ratio = points * 1000 / totalpages;
}
read_unlock(&tasklist_lock);
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 75b104c..272e3bb 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -43,7 +43,8 @@ enum oom_constraint {
extern int test_set_oom_score_adj(int new_val);

extern unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *mem,
- const nodemask_t *nodemask, unsigned long totalpages);
+ const nodemask_t *nodemask, unsigned long totalpages,
+ int protect_root);
extern int try_set_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index cff8000..cf48fd5 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -160,7 +160,8 @@ static bool oom_unkillable_task(struct task_struct *p,
* task consuming the most memory to avoid subsequent oom failures.
*/
unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *mem,
- const nodemask_t *nodemask, unsigned long totalpages)
+ const nodemask_t *nodemask, unsigned long totalpages,
+ int protect_root)
{
unsigned long points;
unsigned long score_adj = 0;
@@ -198,7 +199,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *mem,
task_unlock(p);

/* Root processes get 3% bonus. */
- if (task_euid(p) == 0) {
+ if (protect_root && task_euid(p) == 0) {
if (points >= totalpages / 32)
points -= totalpages / 32;
else
@@ -310,8 +311,11 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
{
struct task_struct *g, *p;
struct task_struct *chosen = NULL;
- *ppoints = 0;
+ int protect_root = 1;
+ unsigned long chosen_points = 0;
+ struct task_struct *child;

+ retry:
do_each_thread_reverse(g, p) {
unsigned long points;

@@ -344,7 +348,7 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
*/
if (p == current) {
chosen = p;
- *ppoints = ULONG_MAX;
+ chosen_points = ULONG_MAX;
} else {
/*
* If this task is not being ptraced on exit,
@@ -357,13 +361,49 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
}
}

- points = oom_badness(p, mem, nodemask, totalpages);
- if (points > *ppoints) {
+ points = oom_badness(p, mem, nodemask, totalpages, protect_root);
+ if (points > chosen_points) {
chosen = p;
- *ppoints = points;
+ chosen_points = points;
}
} while_each_thread(g, p);

+ /*
+ * chosen_point==1 may be a sign that root privilege bonus is too large
+ * and we choose wrong task. Let's recalculate oom score without the
+ * dubious bonus.
+ */
+ if (protect_root && (chosen_points == 1)) {
+ protect_root = 0;
+ goto retry;
+ }
+
+ /*
+ * If any of p's children has a different mm and is eligible for kill,
+ * the one with the highest badness() score is sacrificed for its
+ * parent. This attempts to lose the minimal amount of work done while
+ * still freeing memory.
+ */
+ g = p = chosen;
+ do {
+ list_for_each_entry(child, &p->children, sibling) {
+ unsigned long child_points;
+
+ if (child->mm == p->mm)
+ continue;
+ /*
+ * oom_badness() returns 0 if the thread is unkillable
+ */
+ child_points = oom_badness(child, mem, nodemask,
+ totalpages, protect_root);
+ if (child_points > chosen_points) {
+ chosen = child;
+ chosen_points = child_points;
+ }
+ }
+ } while_each_thread(g, p);
+
+ *ppoints = chosen_points;
return chosen;
}

@@ -479,11 +519,6 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
struct mem_cgroup *mem, nodemask_t *nodemask,
const char *message)
{
- struct task_struct *victim = p;
- struct task_struct *child;
- struct task_struct *t = p;
- unsigned long victim_points = 0;
-
if (printk_ratelimit())
dump_header(p, gfp_mask, order, mem, nodemask);

@@ -497,35 +532,11 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
}

task_lock(p);
- pr_err("%s: Kill process %d (%s) points %lu or sacrifice child\n",
- message, task_pid_nr(p), p->comm, points);
+ pr_err("%s: Kill process %d (%s) points %lu\n",
+ message, task_pid_nr(p), p->comm, points);
task_unlock(p);

- /*
- * If any of p's children has a different mm and is eligible for kill,
- * the one with the highest badness() score is sacrificed for its
- * parent. This attempts to lose the minimal amount of work done while
- * still freeing memory.
- */
- do {
- list_for_each_entry(child, &t->children, sibling) {
- unsigned long child_points;
-
- if (child->mm == p->mm)
- continue;
- /*
- * oom_badness() returns 0 if the thread is unkillable
- */
- child_points = oom_badness(child, mem, nodemask,
- totalpages);
- if (child_points > victim_points) {
- victim = child;
- victim_points = child_points;
- }
- }
- } while_each_thread(p, t);
-
- return oom_kill_task(victim, mem);
+ return oom_kill_task(p, mem);
}

/*
--
1.7.3.1


2011-06-22 10:49:56

by KOSAKI Motohiro

[permalink] [raw]
Subject: [PATCH 6/6] oom: merge oom_kill_process() with oom_kill_task()

Now, oom_kill_process() become almost empty function. Let's
merge it with oom_kill_task().

Also, this patch replace task_pid_nr() with task_tgid_nr().
Because 1) oom killer kill a process, not thread. 2) a userland
don't care thread id.

Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/oom_kill.c | 53 ++++++++++++++++++++++-------------------------------
1 files changed, 22 insertions(+), 31 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index cf48fd5..2fdbb96 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -470,11 +470,26 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
}

#define K(x) ((x) << (PAGE_SHIFT-10))
-static int oom_kill_task(struct task_struct *p, struct mem_cgroup *mem)
+static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
+ unsigned long points, unsigned long totalpages,
+ struct mem_cgroup *mem, nodemask_t *nodemask,
+ const char *message)
{
struct task_struct *q;
struct mm_struct *mm;

+ if (printk_ratelimit())
+ dump_header(p, gfp_mask, order, mem, nodemask);
+
+ /*
+ * If the task is already exiting, don't alarm the sysadmin or kill
+ * its children or threads, just set TIF_MEMDIE so it can die quickly
+ */
+ if (p->flags & PF_EXITING) {
+ set_tsk_thread_flag(p, TIF_MEMDIE);
+ return 0;
+ }
+
p = find_lock_task_mm(p);
if (!p)
return 1;
@@ -482,10 +497,11 @@ static int oom_kill_task(struct task_struct *p, struct mem_cgroup *mem)
/* mm cannot be safely dereferenced after task_unlock(p) */
mm = p->mm;

- pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
- task_pid_nr(p), p->comm, K(p->mm->total_vm),
- K(get_mm_counter(p->mm, MM_ANONPAGES)),
- K(get_mm_counter(p->mm, MM_FILEPAGES)));
+ pr_err("%s: Kill process %d (%s) points:%lu total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
+ message, task_tgid_nr(p), p->comm, points,
+ K(p->mm->total_vm),
+ K(get_mm_counter(p->mm, MM_ANONPAGES)),
+ K(get_mm_counter(p->mm, MM_FILEPAGES)));
task_unlock(p);

/*
@@ -502,7 +518,7 @@ static int oom_kill_task(struct task_struct *p, struct mem_cgroup *mem)
if (q->mm == mm && !same_thread_group(q, p)) {
task_lock(q); /* Protect ->comm from prctl() */
pr_err("Kill process %d (%s) sharing same memory\n",
- task_pid_nr(q), q->comm);
+ task_tgid_nr(q), q->comm);
task_unlock(q);
force_sig(SIGKILL, q);
}
@@ -514,31 +530,6 @@ static int oom_kill_task(struct task_struct *p, struct mem_cgroup *mem)
}
#undef K

-static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
- unsigned long points, unsigned long totalpages,
- struct mem_cgroup *mem, nodemask_t *nodemask,
- const char *message)
-{
- if (printk_ratelimit())
- dump_header(p, gfp_mask, order, mem, nodemask);
-
- /*
- * If the task is already exiting, don't alarm the sysadmin or kill
- * its children or threads, just set TIF_MEMDIE so it can die quickly
- */
- if (p->flags & PF_EXITING) {
- set_tsk_thread_flag(p, TIF_MEMDIE);
- return 0;
- }
-
- task_lock(p);
- pr_err("%s: Kill process %d (%s) points %lu\n",
- message, task_pid_nr(p), p->comm, points);
- task_unlock(p);
-
- return oom_kill_task(p, mem);
-}
-
/*
* Determines whether the kernel must panic because of the panic_on_oom sysctl.
*/
--
1.7.3.1


2011-06-22 22:57:47

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH 1/6] oom: use euid instead of CAP_SYS_ADMIN for protection root process

On Wed, 22 Jun 2011, KOSAKI Motohiro wrote:

> Recently, many userland daemon prefer to use libcap-ng and drop
> all privilege just after startup. Because of (1) Almost privilege
> are necessary only when special file open, and aren't necessary
> read and write. (2) In general, privilege dropping brings better
> protection from exploit when bugs are found in the daemon.
>

You could also say that dropping the capability drops the bonus it is
given in the oom killer. We've never promised any benefit in the oom
killer badness scoring without the capability.

> But, it makes suboptimal oom-killer behavior. CAI Qian reported
> oom killer killed some important daemon at first on his fedora
> like distro. Because they've lost CAP_SYS_ADMIN.
>

I disagree that we should be identifying "important daemons" by tying it
the effective uid of the process and thus making some sort of inference
because a thread was forked by root. I think it is more clear to tie that
to an actual capability that is present, such as CAP_SYS_ADMIN, or suggest
that the user give the "important daemon" it's own bonus by tuning
/proc/pid/oom_score_adj.

We already know that the kernel will not be able to identify critical
processes perfectly, that's an assumption that we can live with. We must
rely on userspace to influence that decision by using the tunable.

If this patch were merged, I could easily imagine an argument in the
reverse that would just simply revert it: it would be very easy to say
that CAP_SYS_ADMIN has always given this bonus in recent memory so
changing it would be a regression over the previous behavior and/or that
giving the capability to a thread as it runs implies that it should have
the bonus when the euid may not be 0.

2011-06-22 22:59:47

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH 2/6] oom: improve dump_tasks() show items

On Wed, 22 Jun 2011, KOSAKI Motohiro wrote:

> Recently, oom internal logic was dramatically changed. Thus
> dump_tasks() doesn't show enough information for bug report
> analysis. it has some meaningless items and don't have some
> oom socre related items.
>
> This patch adapt displaying fields to new oom logic.
>
> details
> --------
> removed: pid (we always kill process. don't need thread id),
> signal->oom_adj (we no longer uses it internally)
> cpu (we no longer uses it)
> added: ppid (we often kill sacrifice child process)
> swap (it's accounted)
> modify: RSS (account mm->nr_ptes too)
>
> <old>
> [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name
> [ 3886] 0 3886 2893 441 1 0 0 bash
> [ 3905] 0 3905 29361 25833 0 0 0 memtoy
>
> <new>
> [ pid] ppid uid euid total_vm rss swap score_adj name
> [ 417] 1 0 0 3298 12 184 -1000 udevd
> [ 830] 1 0 0 1776 11 16 0 system-setup-ke
> [ 973] 1 0 0 61179 35 116 0 rsyslogd
> [ 1733] 1732 0 0 1052337 958582 0 0 memtoy
>

I like this very much! I'm always supportive of providing additional
information that will allow users to investigate oom conditions more
thoroughly.

I'm not sure that we should be exporting the euid, however, since I
disagreed with using it in the badness heuristic of the first patch.
Let's talk about it there and then perhaps it can be removed from the
tasklist dump if we don't actually end up using it?

Otherwise, it looks good!

2011-06-22 23:01:52

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH 3/6] oom: kill younger process first

On Wed, 22 Jun 2011, KOSAKI Motohiro wrote:

> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index e4e6d7b..392ff30 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2257,6 +2257,9 @@ static inline unsigned long wait_task_inactive(struct task_struct *p,
> #define next_task(p) \
> list_entry_rcu((p)->tasks.next, struct task_struct, tasks)
>
> +#define prev_task(p) \
> + list_entry((p)->tasks.prev, struct task_struct, tasks)
> +
> #define for_each_process(p) \
> for (p = &init_task ; (p = next_task(p)) != &init_task ; )
>
> @@ -2269,6 +2272,14 @@ extern bool current_is_single_threaded(void);
> #define do_each_thread(g, t) \
> for (g = t = &init_task ; (g = t = next_task(g)) != &init_task ; ) do
>
> +/*
> + * Similar to do_each_thread(). but two difference are there.
> + * - traverse tasks reverse order (i.e. younger to older)
> + * - caller must hold tasklist_lock. rcu_read_lock isn't enough
> +*/
> +#define do_each_thread_reverse(g, t) \
> + for (g = t = &init_task ; (g = t = prev_task(g)) != &init_task ; ) do
> +
> #define while_each_thread(g, t) \
> while ((t = next_thread(t)) != g)
>

I've already ack'd the patch, but if there is another version posted as a
result of our discussion of using euid in the heuristic, I think it would
be helpful to reiterate in this comment that, like do_each_thread(),
do_each_thread_reverse() is not break-safe either. It might end up
preventing a bug down the road.

2011-06-22 23:16:27

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH 4/6] oom: oom-killer don't use proportion of system-ram internally

On Wed, 22 Jun 2011, KOSAKI Motohiro wrote:

> CAI Qian reported his kernel did hang-up if he ran fork intensive
> workload and then invoke oom-killer.
>
> The problem is, current oom calculation uses 0-1000 normalized value
> (The unit is a permillage of system-ram). Its low precision make
> a lot of same oom score. IOW, in his case, all processes have smaller
> oom score than 1 and internal calculation round it to 1.
>
> Thus oom-killer kill ineligible process. This regression is caused by
> commit a63d83f427 (oom: badness heuristic rewrite).
>
> The solution is, the internal calculation just use number of pages
> instead of permillage of system-ram. And convert it to permillage
> value at displaying time.
>

Ok, I agree this is better and I like that you've kept the userspace
interfaces compatible.

> This patch doesn't change any ABI (included /proc/<pid>/oom_score_adj)
> even though current logic has a lot of my dislike thing.
>
> Reported-by: CAI Qian <[email protected]>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
> ---
> fs/proc/base.c | 13 ++++++----
> include/linux/oom.h | 2 +-
> mm/oom_kill.c | 60 ++++++++++++++++++++++++++++++--------------------
> 3 files changed, 45 insertions(+), 30 deletions(-)
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 14def99..4a10763 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -479,14 +479,17 @@ static const struct file_operations proc_lstats_operations = {
>
> static int proc_oom_score(struct task_struct *task, char *buffer)
> {
> - unsigned long points = 0;
> + unsigned long points;
> + unsigned long ratio = 0;
> + unsigned long totalpages = totalram_pages + total_swap_pages + 1;
>
> read_lock(&tasklist_lock);
> - if (pid_alive(task))
> - points = oom_badness(task, NULL, NULL,
> - totalram_pages + total_swap_pages);
> + if (pid_alive(task)) {
> + points = oom_badness(task, NULL, NULL, totalpages);
> + ratio = points * 1000 / totalpages;
> + }
> read_unlock(&tasklist_lock);
> - return sprintf(buffer, "%lu\n", points);
> + return sprintf(buffer, "%lu\n", ratio);
> }
>
> struct limit_names {
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index 4952fb8..75b104c 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -42,7 +42,7 @@ enum oom_constraint {
>
> extern int test_set_oom_score_adj(int new_val);
>
> -extern unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
> +extern unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *mem,
> const nodemask_t *nodemask, unsigned long totalpages);
> extern int try_set_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
> extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 797308b..cff8000 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -159,10 +159,11 @@ static bool oom_unkillable_task(struct task_struct *p,
> * predictable as possible. The goal is to return the highest value for the
> * task consuming the most memory to avoid subsequent oom failures.
> */
> -unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
> +unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *mem,
> const nodemask_t *nodemask, unsigned long totalpages)
> {
> - int points;
> + unsigned long points;
> + unsigned long score_adj = 0;

Does this need to be initialized to 0?

>
> if (oom_unkillable_task(p, mem, nodemask))
> return 0;

I was going to suggest changing the comment for oom_badness(), but then
realized that it never stated that it returns a proportion in the first
place! I suggest, however, that you modify the comment to specify what
the return value is: a value up to the point of totalpages that represents
the amount of rss, swap, and ptes that the process is using.

> @@ -194,33 +195,44 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
> */
> points = get_mm_rss(p->mm) + p->mm->nr_ptes;
> points += get_mm_counter(p->mm, MM_SWAPENTS);
> -
> - points *= 1000;
> - points /= totalpages;
> task_unlock(p);
>
> - /*
> - * Root processes get 3% bonus, just like the __vm_enough_memory()
> - * implementation used by LSMs.
> - */
> - if (task_euid(p) == 0)
> - points -= 30;
> + /* Root processes get 3% bonus. */
> + if (task_euid(p) == 0) {
> + if (points >= totalpages / 32)
> + points -= totalpages / 32;
> + else
> + points = 0;
> + }
>
> /*
> * /proc/pid/oom_score_adj ranges from -1000 to +1000 such that it may
> * either completely disable oom killing or always prefer a certain
> * task.
> */
> - points += p->signal->oom_score_adj;
> + if (p->signal->oom_score_adj >= 0) {
> + score_adj = p->signal->oom_score_adj * (totalpages / 1000);
> + if (ULONG_MAX - points >= score_adj)
> + points += score_adj;
> + else
> + points = ULONG_MAX;

Does points = max(points + score_adj, ULONG_MAX) work here?

> + } else {
> + score_adj = -p->signal->oom_score_adj * (totalpages / 1000);
> + if (points >= score_adj)
> + points -= score_adj;
> + else
> + points = 0;
> + }
>

points = min(points - score_adj, 0)?

> /*
> * Never return 0 for an eligible task that may be killed since it's
> * possible that no single user task uses more than 0.1% of memory and
> * no single admin tasks uses more than 3.0%.
> */
> - if (points <= 0)
> - return 1;
> - return (points < 1000) ? points : 1000;
> + if (!points)
> + points = 1;
> +

Comment needs to be updated to say that an eligible task gets at least a
charge of 1 page instead of 0.1% of memory.

Everything else looks good, thanks for looking at this KOSAKI!

2011-06-22 23:22:24

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH 5/6] oom: don't kill random process

On Wed, 22 Jun 2011, KOSAKI Motohiro wrote:

> CAI Qian reported oom-killer killed all system daemons in his
> system at first if he ran fork bomb as root. The problem is,
> current logic give them bonus of 3% of system ram. Example,
> he has 16GB machine, then root processes have ~500MB oom
> immune. It bring us crazy bad result. _all_ processes have
> oom-score=1 and then, oom killer ignore process memroy usage
> and kill random process. This regression is caused by commit
> a63d83f427 (oom: badness heuristic rewrite).
>

Isn't it better to give admin processes a proportional bonus instead of a
strict 3% bonus? I suggested 1% per 10% of memory used earlier and I
think it would work quite well as an alternative to this. The highest
bonus that would actually make any differences in which thread to kill
would be 5% when an admin process is using 50% of memory: in that case,
another non-admin thread would have to be using >45% of memory to be
killed instead.

Would you be satisfied with something like

points -= (points * 10 / totalpages);

be better?