LinuxLists.cc - Re: 3.4-rc4 oom killer out of control.

2012-05-03 22:14:13

Subject: Re: 3.4-rc4 oom killer out of control.

On Thu, 26 Apr 2012, David Rientjes wrote:

> diff --git a/fs/proc/base.c b/fs/proc/base.c
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -410,12 +410,13 @@ static const struct file_operations proc_lstats_operations = {
>
> static int proc_oom_score(struct task_struct *task, char *buffer)
> {
> + unsigned long totalpages = totalram_pages + total_swap_pages;
> unsigned long points = 0;
>
> read_lock(&tasklist_lock);
> if (pid_alive(task))
> - points = oom_badness(task, NULL, NULL,
> - totalram_pages + total_swap_pages);
> + points = oom_badness(task, NULL, NULL, totalpages) *
> + 1000 / totalpages;
> read_unlock(&tasklist_lock);
> return sprintf(buffer, "%lu\n", points);
> }
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index 3d76475..e4c29bc 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -43,8 +43,9 @@ enum oom_constraint {
> extern void compare_swap_oom_score_adj(int old_val, int new_val);
> extern int test_set_oom_score_adj(int new_val);
>
> -extern unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
> - const nodemask_t *nodemask, unsigned long totalpages);
> +extern unsigned long oom_badness(struct task_struct *p,
> + struct mem_cgroup *memcg, const nodemask_t *nodemask,
> + unsigned long totalpages);
> extern int try_set_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
> extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 46bf2ed5..4bbf085 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -180,10 +180,10 @@ static bool oom_unkillable_task(struct task_struct *p,
> * predictable as possible. The goal is to return the highest value for the
> * task consuming the most memory to avoid subsequent oom failures.
> */
> -unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
> - const nodemask_t *nodemask, unsigned long totalpages)
> +unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
> + const nodemask_t *nodemask, unsigned long totalpages)
> {
> - long points;
> + unsigned long points;
>
> if (oom_unkillable_task(p, memcg, nodemask))
> return 0;
> @@ -198,45 +198,33 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
> }
>
> /*
> - * The memory controller may have a limit of 0 bytes, so avoid a divide
> - * by zero, if necessary.
> - */
> - if (!totalpages)
> - totalpages = 1;
> -
> - /*
> * The baseline for the badness score is the proportion of RAM that each
> * task's rss, pagetable and swap space use.
> */
> - points = get_mm_rss(p->mm) + p->mm->nr_ptes;
> - points += get_mm_counter(p->mm, MM_SWAPENTS);
> -
> - points *= 1000;
> - points /= totalpages;
> + points = get_mm_rss(p->mm) + p->mm->nr_ptes +
> + get_mm_counter(p->mm, MM_SWAPENTS);
> task_unlock(p);
>
> /*
> * Root processes get 3% bonus, just like the __vm_enough_memory()
> * implementation used by LSMs.
> */
> - if (has_capability_noaudit(p, CAP_SYS_ADMIN))
> - points -= 30;
> + if (has_capability_noaudit(p, CAP_SYS_ADMIN) && totalpages)
> + points -= 30 * totalpages / 1000;
>
> /*
> * /proc/pid/oom_score_adj ranges from -1000 to +1000 such that it may
> * either completely disable oom killing or always prefer a certain
> * task.
> */
> - points += p->signal->oom_score_adj;
> + points += p->signal->oom_score_adj * totalpages / 1000;
>
> /*
> * Never return 0 for an eligible task that may be killed since it's
> * possible that no single user task uses more than 0.1% of memory and
> * no single admin tasks uses more than 3.0%.
> */
> - if (points <= 0)
> - return 1;
> - return (points < 1000) ? points : 1000;
> + return points <= 0 ? 1 : points;
> }
>
> /*
> @@ -314,7 +302,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
> {
> struct task_struct *g, *p;
> struct task_struct *chosen = NULL;
> - *ppoints = 0;
> + unsigned long chosen_points = 0;
>
> do_each_thread(g, p) {
> unsigned int points;
> @@ -354,7 +342,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
> */
> if (p == current) {
> chosen = p;
> - *ppoints = 1000;
> + chosen_points = ULONG_MAX;
> } else if (!force_kill) {
> /*
> * If this task is not being ptraced on exit,
> @@ -367,12 +355,13 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
> }
>
> points = oom_badness(p, memcg, nodemask, totalpages);
> - if (points > *ppoints) {
> + if (points > chosen_points) {
> chosen = p;
> - *ppoints = points;
> + chosen_points = points;
> }
> } while_each_thread(g, p);
>
> + *ppoints = chosen_points * 1000 / totalpages;
> return chosen;
> }
>

Dave, did you get a chance to test this out? It's something we'll want in
the oom killer so if I can add your Tested-by it would be great. Thanks!

2012-05-03 22:29:57

by Dave Jones

[permalink] [raw]

Subject: Re: 3.4-rc4 oom killer out of control.

On Thu, May 03, 2012 at 03:14:09PM -0700, David Rientjes wrote:

> Dave, did you get a chance to test this out? It's something we'll want in
> the oom killer so if I can add your Tested-by it would be great. Thanks!

Yes, this seems to be an improvement in my case (the fuzzer got killed every time
rather than arbitary system processes).

Feel free to add my Tested-by:

thanks,

Dave

2012-05-17 21:33:33

by David Rientjes

[permalink] [raw]

Subject: [patch] mm, oom: normalize oom scores to oom_score_adj scale only for userspace

The oom_score_adj scale ranges from -1000 to 1000 and represents the
proportion of memory available to the process at allocation time. This
means an oom_score_adj value of 300, for example, will bias a process as
though it was using an extra 30.0% of available memory and a value of
-350 will discount 35.0% of available memory from its usage.

The oom killer badness heuristic also uses this scale to report the oom
score for each eligible process in determining the "best" process to
kill. Thus, it can only differentiate each process's memory usage by
0.1% of system RAM.

On large systems, this can end up being a large amount of memory: 256MB
on 256GB systems, for example.

This can be fixed by having the badness heuristic to use the actual
memory usage in scoring threads and then normalizing it to the
oom_score_adj scale for userspace. This results in better comparison
between eligible threads for kill and no change from the userspace
perspective.

Suggested-by: KOSAKI Motohiro <[email protected]>
Tested-by: Dave Jones <[email protected]>
Signed-off-by: David Rientjes <[email protected]>
---
fs/proc/base.c | 5 +++--
include/linux/oom.h | 5 +++--
mm/oom_kill.c | 39 ++++++++++++++-------------------------
3 files changed, 20 insertions(+), 29 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -410,12 +410,13 @@ static const struct file_operations proc_lstats_operations = {

static int proc_oom_score(struct task_struct *task, char *buffer)
{
+ unsigned long totalpages = totalram_pages + total_swap_pages;
unsigned long points = 0;

read_lock(&tasklist_lock);
if (pid_alive(task))
- points = oom_badness(task, NULL, NULL,
- totalram_pages + total_swap_pages);
+ points = oom_badness(task, NULL, NULL, totalpages) *
+ 1000 / totalpages;
read_unlock(&tasklist_lock);
return sprintf(buffer, "%lu\n", points);
}
diff --git a/include/linux/oom.h b/include/linux/oom.h
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -43,8 +43,9 @@ enum oom_constraint {
extern void compare_swap_oom_score_adj(int old_val, int new_val);
extern int test_set_oom_score_adj(int new_val);

-extern unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
- const nodemask_t *nodemask, unsigned long totalpages);
+extern unsigned long oom_badness(struct task_struct *p,
+ struct mem_cgroup *memcg, const nodemask_t *nodemask,
+ unsigned long totalpages);
extern int try_set_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -180,10 +180,10 @@ static bool oom_unkillable_task(struct task_struct *p,
* predictable as possible. The goal is to return the highest value for the
* task consuming the most memory to avoid subsequent oom failures.
*/
-unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
- const nodemask_t *nodemask, unsigned long totalpages)
+unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
+ const nodemask_t *nodemask, unsigned long totalpages)
{
- long points;
+ unsigned long points;

if (oom_unkillable_task(p, memcg, nodemask))
return 0;
@@ -198,45 +198,33 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
}

/*
- * The memory controller may have a limit of 0 bytes, so avoid a divide
- * by zero, if necessary.
- */
- if (!totalpages)
- totalpages = 1;
-
- /*
* The baseline for the badness score is the proportion of RAM that each
* task's rss, pagetable and swap space use.
*/
- points = get_mm_rss(p->mm) + p->mm->nr_ptes;
- points += get_mm_counter(p->mm, MM_SWAPENTS);
-
- points *= 1000;
- points /= totalpages;
+ points = get_mm_rss(p->mm) + p->mm->nr_ptes +
+ get_mm_counter(p->mm, MM_SWAPENTS);
task_unlock(p);

/*
* Root processes get 3% bonus, just like the __vm_enough_memory()
* implementation used by LSMs.
*/
- if (has_capability_noaudit(p, CAP_SYS_ADMIN))
- points -= 30;
+ if (has_capability_noaudit(p, CAP_SYS_ADMIN) && totalpages)
+ points -= 30 * totalpages / 1000;

/*
* /proc/pid/oom_score_adj ranges from -1000 to +1000 such that it may
* either completely disable oom killing or always prefer a certain
* task.
*/
- points += p->signal->oom_score_adj;
+ points += p->signal->oom_score_adj * totalpages / 1000;

/*
* Never return 0 for an eligible task that may be killed since it's
* possible that no single user task uses more than 0.1% of memory and
* no single admin tasks uses more than 3.0%.
*/
- if (points <= 0)
- return 1;
- return (points < 1000) ? points : 1000;
+ return points <= 0 ? 1 : points;
}

/*
@@ -314,7 +302,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
{
struct task_struct *g, *p;
struct task_struct *chosen = NULL;
- *ppoints = 0;
+ unsigned long chosen_points = 0;

do_each_thread(g, p) {
unsigned int points;
@@ -354,7 +342,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
*/
if (p == current) {
chosen = p;
- *ppoints = 1000;
+ chosen_points = ULONG_MAX;
} else if (!force_kill) {
/*
* If this task is not being ptraced on exit,
@@ -367,12 +355,13 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
}

points = oom_badness(p, memcg, nodemask, totalpages);
- if (points > *ppoints) {
+ if (points > chosen_points) {
chosen = p;
- *ppoints = points;
+ chosen_points = points;
}
} while_each_thread(g, p);

+ *ppoints = chosen_points * 1000 / totalpages;
return chosen;
}

2012-05-17 21:50:26

by Andrew Morton

[permalink] [raw]

Subject: Re: [patch] mm, oom: normalize oom scores to oom_score_adj scale only for userspace

On Thu, 17 May 2012 14:33:27 -0700 (PDT)
David Rientjes <[email protected]> wrote:

> The oom_score_adj scale ranges from -1000 to 1000 and represents the
> proportion of memory available to the process at allocation time. This
> means an oom_score_adj value of 300, for example, will bias a process as
> though it was using an extra 30.0% of available memory and a value of
> -350 will discount 35.0% of available memory from its usage.
>
> The oom killer badness heuristic also uses this scale to report the oom
> score for each eligible process in determining the "best" process to
> kill. Thus, it can only differentiate each process's memory usage by
> 0.1% of system RAM.
>
> On large systems, this can end up being a large amount of memory: 256MB
> on 256GB systems, for example.
>
> This can be fixed by having the badness heuristic to use the actual
> memory usage in scoring threads and then normalizing it to the
> oom_score_adj scale for userspace. This results in better comparison
> between eligible threads for kill and no change from the userspace
> perspective.
>
> ...
>
> @@ -198,45 +198,33 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
> }
>
> /*
> - * The memory controller may have a limit of 0 bytes, so avoid a divide
> - * by zero, if necessary.
> - */
> - if (!totalpages)
> - totalpages = 1;
> -
> - /*
> * The baseline for the badness score is the proportion of RAM that each
> * task's rss, pagetable and swap space use.
> */
> - points = get_mm_rss(p->mm) + p->mm->nr_ptes;
> - points += get_mm_counter(p->mm, MM_SWAPENTS);
> -
> - points *= 1000;
> - points /= totalpages;
> + points = get_mm_rss(p->mm) + p->mm->nr_ptes +
> + get_mm_counter(p->mm, MM_SWAPENTS);
> task_unlock(p);
>
> /*
> * Root processes get 3% bonus, just like the __vm_enough_memory()
> * implementation used by LSMs.
> */
> - if (has_capability_noaudit(p, CAP_SYS_ADMIN))
> - points -= 30;
> + if (has_capability_noaudit(p, CAP_SYS_ADMIN) && totalpages)

There doesn't seem much point in testing totalpages here - it's a
micro-optimisation which adds a branch, on a slow path.

> + points -= 30 * totalpages / 1000;
>
> /*
> * /proc/pid/oom_score_adj ranges from -1000 to +1000 such that it may
> * either completely disable oom killing or always prefer a certain
> * task.
> */
> - points += p->signal->oom_score_adj;
> + points += p->signal->oom_score_adj * totalpages / 1000;

And if we *do* want to add that micro-optimisation, we may as well
extend it to cover this expression also:

if (totalpages) { /* reason goes here */
if (has_capability_noaudit(...))
points -= 30 * totalpages / 1000;
p->signal->oom_score_adj * totalpages / 1000;
}

> /*
> * Never return 0 for an eligible task that may be killed since it's
> * possible that no single user task uses more than 0.1% of memory and
> * no single admin tasks uses more than 3.0%.
> */
> - if (points <= 0)
> - return 1;
> - return (points < 1000) ? points : 1000;
> + return points <= 0 ? 1 : points;

`points' is unsigned - testing it for negative looks odd.

> }
>
> /*
> @@ -314,7 +302,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
> {
> struct task_struct *g, *p;
> struct task_struct *chosen = NULL;
> - *ppoints = 0;
> + unsigned long chosen_points = 0;
>
> do_each_thread(g, p) {
> unsigned int points;
> @@ -354,7 +342,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
> */
> if (p == current) {
> chosen = p;
> - *ppoints = 1000;
> + chosen_points = ULONG_MAX;
> } else if (!force_kill) {
> /*
> * If this task is not being ptraced on exit,
> @@ -367,12 +355,13 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
> }
>
> points = oom_badness(p, memcg, nodemask, totalpages);
> - if (points > *ppoints) {
> + if (points > chosen_points) {
> chosen = p;
> - *ppoints = points;
> + chosen_points = points;
> }
> } while_each_thread(g, p);
>
> + *ppoints = chosen_points * 1000 / totalpages;

So it's up to the select_bad_process() callers to prevent the
divide-by-zero. It is unobvious that they actually do this, and this
important and unobvious caller requirement is undocumented.

> return chosen;
> }
>

2012-05-23 07:15:09

by David Rientjes

[permalink] [raw]

Subject: [patch v2] mm, oom: normalize oom scores to oom_score_adj scale only for userspace

The oom_score_adj scale ranges from -1000 to 1000 and represents the
proportion of memory available to the process at allocation time. This
means an oom_score_adj value of 300, for example, will bias a process as
though it was using an extra 30.0% of available memory and a value of
-350 will discount 35.0% of available memory from its usage.

The oom killer badness heuristic also uses this scale to report the oom
score for each eligible process in determining the "best" process to
kill. Thus, it can only differentiate each process's memory usage by
0.1% of system RAM.

On large systems, this can end up being a large amount of memory: 256MB
on 256GB systems, for example.

This can be fixed by having the badness heuristic to use the actual
memory usage in scoring threads and then normalizing it to the
oom_score_adj scale for userspace. This results in better comparison
between eligible threads for kill and no change from the userspace
perspective.

Suggested-by: KOSAKI Motohiro <[email protected]>
Tested-by: Dave Jones <[email protected]>
Signed-off-by: David Rientjes <[email protected]>
---
fs/proc/base.c | 5 +++--
include/linux/oom.h | 5 +++--
mm/oom_kill.c | 44 ++++++++++++++++----------------------------
3 files changed, 22 insertions(+), 32 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -410,12 +410,13 @@ static const struct file_operations proc_lstats_operations = {

static int proc_oom_score(struct task_struct *task, char *buffer)
{
+ unsigned long totalpages = totalram_pages + total_swap_pages;
unsigned long points = 0;

read_lock(&tasklist_lock);
if (pid_alive(task))
- points = oom_badness(task, NULL, NULL,
- totalram_pages + total_swap_pages);
+ points = oom_badness(task, NULL, NULL, totalpages) *
+ 1000 / totalpages;
read_unlock(&tasklist_lock);
return sprintf(buffer, "%lu\n", points);
}
diff --git a/include/linux/oom.h b/include/linux/oom.h
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -43,8 +43,9 @@ enum oom_constraint {
extern void compare_swap_oom_score_adj(int old_val, int new_val);
extern int test_set_oom_score_adj(int new_val);

-extern unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
- const nodemask_t *nodemask, unsigned long totalpages);
+extern unsigned long oom_badness(struct task_struct *p,
+ struct mem_cgroup *memcg, const nodemask_t *nodemask,
+ unsigned long totalpages);
extern int try_set_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -180,10 +180,10 @@ static bool oom_unkillable_task(struct task_struct *p,
* predictable as possible. The goal is to return the highest value for the
* task consuming the most memory to avoid subsequent oom failures.
*/
-unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
- const nodemask_t *nodemask, unsigned long totalpages)
+unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
+ const nodemask_t *nodemask, unsigned long totalpages)
{
- long points;
+ unsigned long points;

if (oom_unkillable_task(p, memcg, nodemask))
return 0;
@@ -198,21 +198,11 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
}

/*
- * The memory controller may have a limit of 0 bytes, so avoid a divide
- * by zero, if necessary.
- */
- if (!totalpages)
- totalpages = 1;
-
- /*
* The baseline for the badness score is the proportion of RAM that each
* task's rss, pagetable and swap space use.
*/
- points = get_mm_rss(p->mm) + p->mm->nr_ptes;
- points += get_mm_counter(p->mm, MM_SWAPENTS);
-
- points *= 1000;
- points /= totalpages;
+ points = get_mm_rss(p->mm) + p->mm->nr_ptes +
+ get_mm_counter(p->mm, MM_SWAPENTS);
task_unlock(p);

/*
@@ -220,23 +210,20 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
* implementation used by LSMs.
*/
if (has_capability_noaudit(p, CAP_SYS_ADMIN))
- points -= 30;
+ points -= 30 * totalpages / 1000;

/*
* /proc/pid/oom_score_adj ranges from -1000 to +1000 such that it may
* either completely disable oom killing or always prefer a certain
* task.
*/
- points += p->signal->oom_score_adj;
+ points += p->signal->oom_score_adj * totalpages / 1000;

/*
- * Never return 0 for an eligible task that may be killed since it's
- * possible that no single user task uses more than 0.1% of memory and
- * no single admin tasks uses more than 3.0%.
+ * Never return 0 for an eligible task regardless of the root bonus and
+ * oom_score_adj (oom_score_adj can't be OOM_SCORE_ADJ_MIN here).
*/
- if (points <= 0)
- return 1;
- return (points < 1000) ? points : 1000;
+ return points ? points : 1;
}

/*
@@ -314,7 +301,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
{
struct task_struct *g, *p;
struct task_struct *chosen = NULL;
- *ppoints = 0;
+ unsigned long chosen_points = 0;

do_each_thread(g, p) {
unsigned int points;
@@ -354,7 +341,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
*/
if (p == current) {
chosen = p;
- *ppoints = 1000;
+ chosen_points = ULONG_MAX;
} else if (!force_kill) {
/*
* If this task is not being ptraced on exit,
@@ -367,12 +354,13 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
}

points = oom_badness(p, memcg, nodemask, totalpages);
- if (points > *ppoints) {
+ if (points > chosen_points) {
chosen = p;
- *ppoints = points;
+ chosen_points = points;
}
} while_each_thread(g, p);

+ *ppoints = chosen_points * 1000 / totalpages;
return chosen;
}

@@ -572,7 +560,7 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
}

check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, order, NULL);
- limit = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT;
+ limit = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT ? : 1;
read_lock(&tasklist_lock);
p = select_bad_process(&points, limit, memcg, NULL, false);
if (p && PTR_ERR(p) != -1UL)

2012-05-23 22:37:21

by Andrew Morton

[permalink] [raw]

Subject: Re: [patch v2] mm, oom: normalize oom scores to oom_score_adj scale only for userspace

On Wed, 23 May 2012 00:15:03 -0700 (PDT)
David Rientjes <[email protected]> wrote:

> The oom_score_adj scale ranges from -1000 to 1000 and represents the
> proportion of memory available to the process at allocation time. This
> means an oom_score_adj value of 300, for example, will bias a process as
> though it was using an extra 30.0% of available memory and a value of
> -350 will discount 35.0% of available memory from its usage.
>
> The oom killer badness heuristic also uses this scale to report the oom
> score for each eligible process in determining the "best" process to
> kill. Thus, it can only differentiate each process's memory usage by
> 0.1% of system RAM.
>
> On large systems, this can end up being a large amount of memory: 256MB
> on 256GB systems, for example.
>
> This can be fixed by having the badness heuristic to use the actual
> memory usage in scoring threads and then normalizing it to the
> oom_score_adj scale for userspace. This results in better comparison
> between eligible threads for kill and no change from the userspace
> perspective.
>
> ...
>
> @@ -367,12 +354,13 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
> }
>
> points = oom_badness(p, memcg, nodemask, totalpages);
> - if (points > *ppoints) {
> + if (points > chosen_points) {
> chosen = p;
> - *ppoints = points;
> + chosen_points = points;
> }
> } while_each_thread(g, p);
>
> + *ppoints = chosen_points * 1000 / totalpages;
> return chosen;
> }
>

It's still not obvious that we always avoid the divide-by-zero here.
If there's some weird way of convincing constrained_alloc() to look at
an empty nodemask, or a nodemask which covers only empty nodes then
blam.

Now, it's probably the case that this is a can't-happen but that
guarantee would be pretty convoluted and fragile?

2012-05-24 06:02:14

by David Rientjes

[permalink] [raw]

Subject: Re: [patch v2] mm, oom: normalize oom scores to oom_score_adj scale only for userspace

On Wed, 23 May 2012, Andrew Morton wrote:

> > @@ -367,12 +354,13 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
> > }
> >
> > points = oom_badness(p, memcg, nodemask, totalpages);
> > - if (points > *ppoints) {
> > + if (points > chosen_points) {
> > chosen = p;
> > - *ppoints = points;
> > + chosen_points = points;
> > }
> > } while_each_thread(g, p);
> >
> > + *ppoints = chosen_points * 1000 / totalpages;
> > return chosen;
> > }
> >
>
> It's still not obvious that we always avoid the divide-by-zero here.
> If there's some weird way of convincing constrained_alloc() to look at
> an empty nodemask, or a nodemask which covers only empty nodes then
> blam.
>
> Now, it's probably the case that this is a can't-happen but that
> guarantee would be pretty convoluted and fragile?
>

It can only happen for memcg with a zero limit, something I tried to
prevent by not allowing tasks to be attached to the memcgs with such a
limit in a different patch but you didn't like that :)

So I fixed it in this patch with this:

@@ -572,7 +560,7 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
}

check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, order, NULL);
- limit = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT;
+ limit = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT ? : 1;
read_lock(&tasklist_lock);
p = select_bad_process(&points, limit, memcg, NULL, false);
if (p && PTR_ERR(p) != -1UL)

Cpusets do not allow threads to be attached without a set of mems or the
final mem in a cpuset to be removed while tasks are still attached. The
page allocator certainly wouldn't be calling the oom killer for a set of
zones that span no pages.

Any suggestion on where to put the check for !totalpages so it's easier to
understand?