LinuxLists.cc - [PATCH 1/2][BUGFIX] oom: remove totalpage normalization from oom

2010-08-25 09:42:15

Subject: [PATCH 1/2][BUGFIX] oom: remove totalpage normalization from oom_badness()

Current oom_score_adj is completely broken because It is strongly bound
google usecase and ignore other all.

1) Priority inversion
As kamezawa-san pointed out, This break cgroup and lxr environment.
He said,
> Assume 2 proceses A, B which has oom_score_adj of 300 and 0
> And A uses 200M, B uses 1G of memory under 4G system
>
> Under the system.
> A's socre = (200M *1000)/4G + 300 = 350
> B's score = (1G * 1000)/4G = 250.
>
> In the cpuset, it has 2G of memory.
> A's score = (200M * 1000)/2G + 300 = 400
> B's socre = (1G * 1000)/2G = 500
>
> This priority-inversion don't happen in current system.

2) Ratio base point don't works large machine
oom_score_adj normalize oom-score to 0-1000 range.
but if the machine has 1TB memory, 1 point (i.e. 0.1%) mean
1GB. this is no suitable for tuning parameter.
As I said, proposional value oriented tuning parameter has
scalability risk.

3) No reason to implement ABI breakage.
old tuning parameter mean)
oom-score = oom-base-score x 2^oom_adj
new tuning parameter mean)
oom-score = oom-base-score + oom_score_adj / (totalram + totalswap)
but "oom_score_adj / (totalram + totalswap)" can be calculated in
userland too. beucase both totalram and totalswap has been exporsed by
/proc. So no reason to introduce funny new equation.

4) totalram based normalization assume flat memory model.
example, the machine is assymmetric numa. fat node memory and thin
node memory might have another wight value.
In other word, totalram based priority is a one of policy. Fixed and
workload depended policy shouldn't be embedded in kernel. probably.

Then, this patch remove *UGLY* total_pages suck completely. Googler
can calculate it at userland!

Signed-off-by: KOSAKI Motohiro <[email protected]>
---
fs/proc/base.c | 33 ++---------
include/linux/oom.h | 16 +-----
include/linux/sched.h | 2 +-
mm/oom_kill.c | 142 ++++++++++++++++++++-----------------------------
4 files changed, 67 insertions(+), 126 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index a1c43e7..90ba487 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -434,8 +434,7 @@ static int proc_oom_score(struct task_struct *task, char *buffer)

read_lock(&tasklist_lock);
if (pid_alive(task))
- points = oom_badness(task, NULL, NULL,
- totalram_pages + total_swap_pages);
+ points = oom_badness(task, NULL, NULL);
read_unlock(&tasklist_lock);
return sprintf(buffer, "%lu\n", points);
}
@@ -1056,15 +1055,7 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
current->comm, task_pid_nr(current),
task_pid_nr(task), task_pid_nr(task));
task->signal->oom_adj = oom_adjust;
- /*
- * Scale /proc/pid/oom_score_adj appropriately ensuring that a maximum
- * value is always attainable.
- */
- if (task->signal->oom_adj == OOM_ADJUST_MAX)
- task->signal->oom_score_adj = OOM_SCORE_ADJ_MAX;
- else
- task->signal->oom_score_adj = (oom_adjust * OOM_SCORE_ADJ_MAX) /
- -OOM_DISABLE;
+
unlock_task_sighand(task, &flags);
put_task_struct(task);

@@ -1081,8 +1072,8 @@ static ssize_t oom_score_adj_read(struct file *file, char __user *buf,
size_t count, loff_t *ppos)
{
struct task_struct *task = get_proc_task(file->f_path.dentry->d_inode);
- char buffer[PROC_NUMBUF];
- int oom_score_adj = OOM_SCORE_ADJ_MIN;
+ char buffer[21];
+ long oom_score_adj = 0;
unsigned long flags;
size_t len;

@@ -1093,7 +1084,7 @@ static ssize_t oom_score_adj_read(struct file *file, char __user *buf,
unlock_task_sighand(task, &flags);
}
put_task_struct(task);
- len = snprintf(buffer, sizeof(buffer), "%d\n", oom_score_adj);
+ len = snprintf(buffer, sizeof(buffer), "%ld\n", oom_score_adj);
return simple_read_from_buffer(buf, count, ppos, buffer, len);
}

@@ -1101,7 +1092,7 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
size_t count, loff_t *ppos)
{
struct task_struct *task;
- char buffer[PROC_NUMBUF];
+ char buffer[21];
unsigned long flags;
long oom_score_adj;
int err;
@@ -1115,9 +1106,6 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
err = strict_strtol(strstrip(buffer), 0, &oom_score_adj);
if (err)
return -EINVAL;
- if (oom_score_adj < OOM_SCORE_ADJ_MIN ||
- oom_score_adj > OOM_SCORE_ADJ_MAX)
- return -EINVAL;

task = get_proc_task(file->f_path.dentry->d_inode);
if (!task)
@@ -1134,15 +1122,6 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
}

task->signal->oom_score_adj = oom_score_adj;
- /*
- * Scale /proc/pid/oom_adj appropriately ensuring that OOM_DISABLE is
- * always attainable.
- */
- if (task->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
- task->signal->oom_adj = OOM_DISABLE;
- else
- task->signal->oom_adj = (oom_score_adj * OOM_ADJUST_MAX) /
- OOM_SCORE_ADJ_MAX;
unlock_task_sighand(task, &flags);
put_task_struct(task);
return count;
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 5e3aa83..21006dc 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -12,13 +12,6 @@
#define OOM_ADJUST_MIN (-16)
#define OOM_ADJUST_MAX 15

-/*
- * /proc/<pid>/oom_score_adj set to OOM_SCORE_ADJ_MIN disables oom killing for
- * pid.
- */
-#define OOM_SCORE_ADJ_MIN (-1000)
-#define OOM_SCORE_ADJ_MAX 1000
-
#ifdef __KERNEL__

#include <linux/sched.h>
@@ -40,8 +33,9 @@ enum oom_constraint {
CONSTRAINT_MEMCG,
};

-extern unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
- const nodemask_t *nodemask, unsigned long totalpages);
+/* The badness from the OOM killer */
+extern unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *mem,
+ const nodemask_t *nodemask);
extern int try_set_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);

@@ -62,10 +56,6 @@ static inline void oom_killer_enable(void)
oom_killer_disabled = false;
}

-/* The badness from the OOM killer */
-extern unsigned long badness(struct task_struct *p, struct mem_cgroup *mem,
- const nodemask_t *nodemask, unsigned long uptime);
-
extern struct task_struct *find_lock_task_mm(struct task_struct *p);

/* sysctls */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1e2a6db..5e61d60 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -622,7 +622,7 @@ struct signal_struct {
#endif

int oom_adj; /* OOM kill score adjustment (bit shift) */
- int oom_score_adj; /* OOM kill score adjustment */
+ long oom_score_adj; /* OOM kill score adjustment */
};

/* Context switch must be unlocked if interrupts are to be enabled */
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index fc81cb2..b242ddc 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -143,55 +143,41 @@ static bool oom_unkillable_task(struct task_struct *p, struct mem_cgroup *mem,
/**
* oom_badness - heuristic function to determine which candidate task to kill
* @p: task struct of which task we should calculate
- * @totalpages: total present RAM allowed for page allocation
*
* The heuristic for determining which task to kill is made to be as simple and
* predictable as possible. The goal is to return the highest value for the
* task consuming the most memory to avoid subsequent oom failures.
*/
-unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
- const nodemask_t *nodemask, unsigned long totalpages)
+unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *mem,
+ const nodemask_t *nodemask)
{
- int points;
+ unsigned long points;
+ unsigned long points_orig;
+ int oom_adj = p->signal->oom_adj;
+ long oom_score_adj = p->signal->oom_score_adj;

- if (oom_unkillable_task(p, mem, nodemask))
- return 0;

- p = find_lock_task_mm(p);
- if (!p)
+ if (oom_unkillable_task(p, mem, nodemask))
return 0;
-
- /*
- * Shortcut check for OOM_SCORE_ADJ_MIN so the entire heuristic doesn't
- * need to be executed for something that cannot be killed.
- */
- if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
- task_unlock(p);
+ if (oom_adj == OOM_DISABLE)
return 0;
- }

/*
* When the PF_OOM_ORIGIN bit is set, it indicates the task should have
* priority for oom killing.
*/
- if (p->flags & PF_OOM_ORIGIN) {
- task_unlock(p);
- return 1000;
- }
+ if (p->flags & PF_OOM_ORIGIN)
+ return ULONG_MAX;

- /*
- * The memory controller may have a limit of 0 bytes, so avoid a divide
- * by zero, if necessary.
- */
- if (!totalpages)
- totalpages = 1;
+ p = find_lock_task_mm(p);
+ if (!p)
+ return 0;

/*
* The baseline for the badness score is the proportion of RAM that each
* task's rss and swap space use.
*/
- points = (get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS)) * 1000 /
- totalpages;
+ points = (get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS));
task_unlock(p);

/*
@@ -202,15 +188,25 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
points -= 30;

/*
- * /proc/pid/oom_score_adj ranges from -1000 to +1000 such that it may
- * either completely disable oom killing or always prefer a certain
- * task.
+ * Adjust the score by oom_adj and oom_score_adj.
*/
- points += p->signal->oom_score_adj;
+ points_orig = points;
+ points += oom_score_adj;
+ if ((oom_score_adj > 0) && (points < points_orig))
+ points = ULONG_MAX; /* may be overflow */
+ if ((oom_score_adj < 0) && (points > points_orig))
+ points = 1; /* may be underflow */
+
+ if (oom_adj) {
+ if (oom_adj > 0) {
+ if (!points)
+ points = 1;
+ points <<= oom_adj;
+ } else
+ points >>= -(oom_adj);
+ }

- if (points < 0)
- return 0;
- return (points < 1000) ? points : 1000;
+ return points;
}

/*
@@ -218,17 +214,11 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
*/
#ifdef CONFIG_NUMA
static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
- gfp_t gfp_mask, nodemask_t *nodemask,
- unsigned long *totalpages)
+ gfp_t gfp_mask, nodemask_t *nodemask)
{
struct zone *zone;
struct zoneref *z;
enum zone_type high_zoneidx = gfp_zone(gfp_mask);
- bool cpuset_limited = false;
- int nid;
-
- /* Default to all available memory */
- *totalpages = totalram_pages + total_swap_pages;

if (!zonelist)
return CONSTRAINT_NONE;
@@ -245,33 +235,21 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
* the page allocator means a mempolicy is in effect. Cpuset policy
* is enforced in get_page_from_freelist().
*/
- if (nodemask && !nodes_subset(node_states[N_HIGH_MEMORY], *nodemask)) {
- *totalpages = total_swap_pages;
- for_each_node_mask(nid, *nodemask)
- *totalpages += node_spanned_pages(nid);
+ if (nodemask && !nodes_subset(node_states[N_HIGH_MEMORY], *nodemask))
return CONSTRAINT_MEMORY_POLICY;
- }

/* Check this allocation failure is caused by cpuset's wall function */
for_each_zone_zonelist_nodemask(zone, z, zonelist,
high_zoneidx, nodemask)
if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
- cpuset_limited = true;
+ return CONSTRAINT_CPUSET;

- if (cpuset_limited) {
- *totalpages = total_swap_pages;
- for_each_node_mask(nid, cpuset_current_mems_allowed)
- *totalpages += node_spanned_pages(nid);
- return CONSTRAINT_CPUSET;
- }
return CONSTRAINT_NONE;
}
#else
static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
- gfp_t gfp_mask, nodemask_t *nodemask,
- unsigned long *totalpages)
+ gfp_t gfp_mask, nodemask_t *nodemask)
{
- *totalpages = totalram_pages + total_swap_pages;
return CONSTRAINT_NONE;
}
#endif
@@ -282,16 +260,16 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
*
* (not docbooked, we don't want this one cluttering up the manual)
*/
-static struct task_struct *select_bad_process(unsigned int *ppoints,
- unsigned long totalpages, struct mem_cgroup *mem,
- const nodemask_t *nodemask)
+static struct task_struct *select_bad_process(unsigned long *ppoints,
+ struct mem_cgroup *mem,
+ const nodemask_t *nodemask)
{
struct task_struct *p;
struct task_struct *chosen = NULL;
*ppoints = 0;

for_each_process(p) {
- unsigned int points;
+ unsigned long points;

if (oom_unkillable_task(p, mem, nodemask))
continue;
@@ -323,10 +301,10 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
return ERR_PTR(-1UL);

chosen = p;
- *ppoints = 1000;
+ *ppoints = ULONG_MAX;
}

- points = oom_badness(p, mem, nodemask, totalpages);
+ points = oom_badness(p, mem, nodemask);
if (points > *ppoints) {
chosen = p;
*ppoints = points;
@@ -371,7 +349,7 @@ static void dump_tasks(const struct mem_cgroup *mem)
continue;
}

- pr_info("[%5d] %5d %5d %8lu %8lu %3u %3d %5d %s\n",
+ pr_info("[%5d] %5d %5d %8lu %8lu %3u %3d %5ld %s\n",
task->pid, task_uid(task), task->tgid,
task->mm->total_vm, get_mm_rss(task->mm),
task_cpu(task), task->signal->oom_adj,
@@ -385,7 +363,7 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
{
task_lock(current);
pr_warning("%s invoked oom-killer: gfp_mask=0x%x, order=%d, "
- "oom_adj=%d, oom_score_adj=%d\n",
+ "oom_adj=%d, oom_score_adj=%ld\n",
current->comm, gfp_mask, order, current->signal->oom_adj,
current->signal->oom_score_adj);
cpuset_print_task_mems_allowed(current);
@@ -426,14 +404,13 @@ static int oom_kill_task(struct task_struct *p, struct mem_cgroup *mem)
#undef K

static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
- unsigned int points, unsigned long totalpages,
- struct mem_cgroup *mem, nodemask_t *nodemask,
- const char *message)
+ unsigned long points, struct mem_cgroup *mem,
+ nodemask_t *nodemask, const char *message)
{
struct task_struct *victim = p;
struct task_struct *child;
struct task_struct *t = p;
- unsigned int victim_points = 0;
+ unsigned long victim_points = 0;

if (printk_ratelimit())
dump_header(p, gfp_mask, order, mem);
@@ -449,7 +426,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
}

task_lock(p);
- pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n",
+ pr_err("%s: Kill process %d (%s) score %lu or sacrifice child\n",
message, task_pid_nr(p), p->comm, points);
task_unlock(p);

@@ -461,13 +438,12 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
*/
do {
list_for_each_entry(child, &t->children, sibling) {
- unsigned int child_points;
+ unsigned long child_points;

/*
* oom_badness() returns 0 if the thread is unkillable
*/
- child_points = oom_badness(child, mem, nodemask,
- totalpages);
+ child_points = oom_badness(child, mem, nodemask);
if (child_points > victim_points) {
victim = child;
victim_points = child_points;
@@ -505,19 +481,17 @@ static void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
#ifdef CONFIG_CGROUP_MEM_RES_CTLR
void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
{
- unsigned long limit;
- unsigned int points = 0;
+ unsigned long points = 0;
struct task_struct *p;

check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, 0);
- limit = mem_cgroup_get_limit(mem) >> PAGE_SHIFT;
read_lock(&tasklist_lock);
retry:
- p = select_bad_process(&points, limit, mem, NULL);
+ p = select_bad_process(&points, mem, NULL);
if (!p || PTR_ERR(p) == -1UL)
goto out;

- if (oom_kill_process(p, gfp_mask, 0, points, limit, mem, NULL,
+ if (oom_kill_process(p, gfp_mask, 0, points, mem, NULL,
"Memory cgroup out of memory"))
goto retry;
out:
@@ -642,9 +616,8 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
int order, nodemask_t *nodemask)
{
struct task_struct *p;
- unsigned long totalpages;
unsigned long freed = 0;
- unsigned int points;
+ unsigned long points;
enum oom_constraint constraint = CONSTRAINT_NONE;
int killed = 0;

@@ -668,8 +641,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
* Check if there were limitations on the allocation (only relevant for
* NUMA) that may require different handling.
*/
- constraint = constrained_alloc(zonelist, gfp_mask, nodemask,
- &totalpages);
+ constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
check_panic_on_oom(constraint, gfp_mask, order);

read_lock(&tasklist_lock);
@@ -681,14 +653,14 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
* non-zero, current could not be killed so we must fallback to
* the tasklist scan.
*/
- if (!oom_kill_process(current, gfp_mask, order, 0, totalpages,
+ if (!oom_kill_process(current, gfp_mask, order, 0,
NULL, nodemask,
"Out of memory (oom_kill_allocating_task)"))
goto out;
}

retry:
- p = select_bad_process(&points, totalpages, NULL,
+ p = select_bad_process(&points, NULL,
constraint == CONSTRAINT_MEMORY_POLICY ? nodemask :
NULL);
if (PTR_ERR(p) == -1UL)
@@ -701,7 +673,7 @@ retry:
panic("Out of memory and no killable processes...\n");
}

- if (oom_kill_process(p, gfp_mask, order, points, totalpages, NULL,
+ if (oom_kill_process(p, gfp_mask, order, points, NULL,
nodemask, "Out of memory"))
goto retry;
killed = 1;
--
1.6.5.2

2010-08-25 09:42:57

by KOSAKI Motohiro

[permalink] [raw]

Subject: [PATCH 2/2][BUGFIX] Revert "oom: deprecate oom_adj tunable"

oom_adj is not only used for kernel knob, but also used for
application interface.
Then, adding new knob is no good reason to deprecate it.

Also, after former patch, oom_score_adj can't be used for setting
OOM_DISABLE. We need "echo -17 > /proc/<pid>/oom_adj" thing.

This reverts commit 51b1bd2ace1595b72956224deda349efa880b693.
---
Documentation/feature-removal-schedule.txt | 25 -------------------------
Documentation/filesystems/proc.txt | 3 ---
fs/proc/base.c | 8 --------
include/linux/oom.h | 3 ---
4 files changed, 0 insertions(+), 39 deletions(-)

diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt
index 842aa9d..aff4d11 100644
--- a/Documentation/feature-removal-schedule.txt
+++ b/Documentation/feature-removal-schedule.txt
@@ -151,31 +151,6 @@ Who: Eric Biederman <[email protected]>

---------------------------

-What: /proc/<pid>/oom_adj
-When: August 2012
-Why: /proc/<pid>/oom_adj allows userspace to influence the oom killer's
- badness heuristic used to determine which task to kill when the kernel
- is out of memory.
-
- The badness heuristic has since been rewritten since the introduction of
- this tunable such that its meaning is deprecated. The value was
- implemented as a bitshift on a score generated by the badness()
- function that did not have any precise units of measure. With the
- rewrite, the score is given as a proportion of available memory to the
- task allocating pages, so using a bitshift which grows the score
- exponentially is, thus, impossible to tune with fine granularity.
-
- A much more powerful interface, /proc/<pid>/oom_score_adj, was
- introduced with the oom killer rewrite that allows users to increase or
- decrease the badness() score linearly. This interface will replace
- /proc/<pid>/oom_adj.
-
- A warning will be emitted to the kernel log if an application uses this
- deprecated interface. After it is printed once, future warnings will be
- suppressed until the kernel is rebooted.
-
----------------------------
-
What: remove EXPORT_SYMBOL(kernel_thread)
When: August 2006
Files: arch/*/kernel/*_ksyms.c
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index a6aca87..cf1295c 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -1285,9 +1285,6 @@ scaled linearly with /proc/<pid>/oom_score_adj.
Writing to /proc/<pid>/oom_score_adj or /proc/<pid>/oom_adj will change the
other with its scaled value.

-NOTICE: /proc/<pid>/oom_adj is deprecated and will be removed, please see
-Documentation/feature-removal-schedule.txt.
-
Caveat: when a parent task is selected, the oom killer will sacrifice any first
generation children with seperate address spaces instead, if possible. This
avoids servers and important system daemons from being killed and loses the
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 90ba487..55a16f2 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1046,14 +1046,6 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
return -EACCES;
}

- /*
- * Warn that /proc/pid/oom_adj is deprecated, see
- * Documentation/feature-removal-schedule.txt.
- */
- printk_once(KERN_WARNING "%s (%d): /proc/%d/oom_adj is deprecated, "
- "please use /proc/%d/oom_score_adj instead.\n",
- current->comm, task_pid_nr(current),
- task_pid_nr(task), task_pid_nr(task));
task->signal->oom_adj = oom_adjust;

unlock_task_sighand(task, &flags);
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 21006dc..394f2e6 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -2,9 +2,6 @@
#define __INCLUDE_LINUX_OOM_H

/*
- * /proc/<pid>/oom_adj is deprecated, see
- * Documentation/feature-removal-schedule.txt.
- *
* /proc/<pid>/oom_adj set to -17 protects from the oom-killer
*/
#define OOM_DISABLE (-17)
--
1.6.5.2

2010-08-25 10:25:36

by David Rientjes

[permalink] [raw]

Subject: Re: [PATCH 1/2][BUGFIX] oom: remove totalpage normalization from oom_badness()

On Wed, 25 Aug 2010, KOSAKI Motohiro wrote:

> Current oom_score_adj is completely broken because It is strongly bound
> google usecase and ignore other all.
>

That's wrong, we don't even use this heuristic yet and there is nothing,
in any way, that is specific to Google.

> 1) Priority inversion
> As kamezawa-san pointed out, This break cgroup and lxr environment.
> He said,
> > Assume 2 proceses A, B which has oom_score_adj of 300 and 0
> > And A uses 200M, B uses 1G of memory under 4G system
> >
> > Under the system.
> > A's socre = (200M *1000)/4G + 300 = 350
> > B's score = (1G * 1000)/4G = 250.
> >
> > In the cpuset, it has 2G of memory.
> > A's score = (200M * 1000)/2G + 300 = 400
> > B's socre = (1G * 1000)/2G = 500
> >
> > This priority-inversion don't happen in current system.
>

You continually bring this up, and I've answered it three times, but
you've never responded to it before and completely ignore it. I really
hope and expect that you'll participate more in the development process
and not continue to reinterate your talking points when you have no answer
to my response.

You're wrong, especially with regard to cpusets, which was formally part
of the heuristic itself.

Users bind an aggregate of tasks to a cgroup (cpusets or memcg) as a means
of isolation and attach a set of resources (memory, in this case) for
those tasks to use. The user who does this is fully aware of the set of
tasks being bound, there is no mystery or unexpected results when doing
so. So when you set an oom_score_adj for a task, you don't necessarily
need to be aware of the set of resources it has available, which is
dynamic and an attribute of the system or cgroup, but rather the priority
of that task in competition with other tasks for the same resources.

_That_ is what is important in having a userspace influence on a badness
heursitic: how those badness scores compare relative to other tasks that
share the same resources. That's how a task is chosen for oom kill, not
because of a static formula such as you're introducing here that outputs a
value (and, thus, a priority) regardless of the context in which the task
is bound.

That also means that the same task is not necessarily killed in a
cpuset-constrained oom compared to a system-wide oom. If you bias a task
by 30% of available memory, which Kame did in his example above, it's
entirely plausible that task A should be killed because it's actual usage
is only 1/20th of the machine. When its cpuset is oom, and the admin has
specifically bound that task to only 2G of memory, we'd natually want to
kill the memory hogger, that is using 50% of the total memory available to
it.

> 2) Ratio base point don't works large machine
> oom_score_adj normalize oom-score to 0-1000 range.
> but if the machine has 1TB memory, 1 point (i.e. 0.1%) mean
> 1GB. this is no suitable for tuning parameter.
> As I said, proposional value oriented tuning parameter has
> scalability risk.
>

So you'd rather use the range of oom_adj from -17 to +15 instead of
oom_score_adj from -1000 to +1000 where each point is 68GB? I don't
understand your point here as to why oom_score_adj is worse.

But, yes, in reality we don't really care about the granularity so much
that we need to prioritize a task using 512MB more memory than another to
break the tie on a 1TB machine, 1/2048th of its memory.

> 3) No reason to implement ABI breakage.
> old tuning parameter mean)
> oom-score = oom-base-score x 2^oom_adj

Everybody knows this is useless beyond polarizing a task for kill or
making it immune.

> new tuning parameter mean)
> oom-score = oom-base-score + oom_score_adj / (totalram + totalswap)

This, on the other hand, has an actual unit (proportion of available
memory) that can be used to prioritize tasks amongst those competing for
the same set of shared resources and remains constant even when a task
changes cpuset, its memcg limit changes, etc.

And your equation is wrong, it's

((rss + swap) / (available ram + swap)) + oom_score_adj

which is completely different from what you think it is.

> but "oom_score_adj / (totalram + totalswap)" can be calculated in
> userland too. beucase both totalram and totalswap has been exporsed by
> /proc. So no reason to introduce funny new equation.
>

Yup, it definitely can, which is why as I mentioned to Kame (who doesn't
have strong feelings either way, even though you quote him as having these
strong objections) that you can easily convert oom_score_adj into a
stand-alone memory quantity (biasing or forgiving 512MB of a task's
memory, for example) in the context it is currently attached to with
simple arithemetic in userspace. That's why oom_score_adj is powerful.

> 4) totalram based normalization assume flat memory model.
> example, the machine is assymmetric numa. fat node memory and thin
> node memory might have another wight value.
> In other word, totalram based priority is a one of policy. Fixed and
> workload depended policy shouldn't be embedded in kernel. probably.
>

I don't know what this means, and this was your criticism before I changed
the denominator during the revision of the patchset, so it's probably
obsoleted. oom_score_adj always operates based on the proportion of
memory available to the application which is how the new oom killer
determines which tasks to kill: relative to the importance (if defined by
userspace) and memory usage compared to other tasks competing for it.

> Then, this patch remove *UGLY* total_pages suck completely. Googler
> can calculate it at userland!
>

Nothing specific about any of this to Google. Users who actually setup
their machines to use mempolicies, cpusets, or memcgs actually do want a
powerful interface from userspace to tune the priorities in terms of both
business goals and also importance of the task. That is done much more
powerfully now with oom_score_adj than the previous implementation. Users
who don't use these cgroups, especially desktop users, can see
oom_score_adj in terms of a memory quantity that remains static: they
aren't going to encounter changing memcg limits, cpuset mems, etc.

That said, I really don't know why you keep mentioning "Google this" and
"Google that" when the company I'm working for is really irrelevant to
this discussion.

With that, I respectfully nack your patch.

2010-08-25 10:27:54

by David Rientjes

[permalink] [raw]

Subject: Re: [PATCH 2/2][BUGFIX] Revert "oom: deprecate oom_adj tunable"

On Wed, 25 Aug 2010, KOSAKI Motohiro wrote:

> oom_adj is not only used for kernel knob, but also used for
> application interface.
> Then, adding new knob is no good reason to deprecate it.
>
> Also, after former patch, oom_score_adj can't be used for setting
> OOM_DISABLE. We need "echo -17 > /proc/<pid>/oom_adj" thing.
>
> This reverts commit 51b1bd2ace1595b72956224deda349efa880b693.

Since I nacked the parent patch of this, I implicitly nack this one as
well since oom_score_adj shouldn't be going anywhere. The way to disable
oom killing for a task via the new interface, /proc/pid/oom_score_adj, is
by OOM_SCORE_ADJ_MIN as specified in the documentation.

2010-08-26 00:44:28

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [PATCH 1/2][BUGFIX] oom: remove totalpage normalization from oom_badness()

On Wed, 25 Aug 2010 03:25:25 -0700 (PDT)
David Rientjes <[email protected]> wrote:
>
> > 3) No reason to implement ABI breakage.
> > old tuning parameter mean)
> > oom-score = oom-base-score x 2^oom_adj
>
> Everybody knows this is useless beyond polarizing a task for kill or
> making it immune.
>
> > new tuning parameter mean)
> > oom-score = oom-base-score + oom_score_adj / (totalram + totalswap)
>
> This, on the other hand, has an actual unit (proportion of available
> memory) that can be used to prioritize tasks amongst those competing for
> the same set of shared resources and remains constant even when a task
> changes cpuset, its memcg limit changes, etc.
>
> And your equation is wrong, it's
>
> ((rss + swap) / (available ram + swap)) + oom_score_adj
>
> which is completely different from what you think it is.
>

I'm now trying to write a userspace tool to calculate this, for me.
Then, could you update documentation ?
==
3.2 /proc/<pid>/oom_score - Display current oom-killer score
-------------------------------------------------------------

This file can be used to check the current score used by the oom-killer is for
any given <pid>. Use it together with /proc/<pid>/oom_adj to tune which
process should be killed in an out-of-memory situation.
==

add a some documentation like:
==
(For system monitoring tool developpers, not for usual users.)
oom_score calculation is implemnentation dependent and can be modified without
any caution. But current logic is

oom_score = ((proc's rss + proc's swap) / (available ram + swap)) + oom_score_adj

proc's rss and swap can be obtained by /proc/<pid>/statm and available ram + swap
is dependent on the situation.
If the system is totaly under oom,
available ram == /proc/meminfo's MemTotal
available swap == in most case == /proc/meminfo's SwapTotal
When you use memory cgroup,
When swap is limited, avaliable ram + swap == memory cgroup's memsw limit.
When swap is unlimited, avaliable ram + swap = memory cgroup's memory limit +
SwapTotal

Then, please be careful that oom_score's order among tasks depends on the
situation. Assume 2 proceses A, B which has oom_score_adj of 300 and 0
And A uses 200M, B uses 1G of memory under 4G system

Under the 4G system.
A's socre = (200M *1000)/4G + 300 = 350
B's score = (1G * 1000)/4G = 250.

In the memory cgroup, it has 2G of resource.
A's score = (200M * 1000)/2G + 300 = 400
B's socre = (1G * 1000)/2G = 500

You shoudn't depend on /proc/<pid>/oom_score if you have to handle OOM under
cgroups and cpuset. But the logic is simple.
==

If you don't want, I'll add text and a sample tool to cgroup/memory.txt.

Thanks,
-Kame

2010-08-26 00:52:15

by David Rientjes

[permalink] [raw]

Subject: Re: [PATCH 1/2][BUGFIX] oom: remove totalpage normalization from oom_badness()

On Thu, 26 Aug 2010, KAMEZAWA Hiroyuki wrote:

> I'm now trying to write a userspace tool to calculate this, for me.
> Then, could you update documentation ?
> ==
> 3.2 /proc/<pid>/oom_score - Display current oom-killer score
> -------------------------------------------------------------
>
> This file can be used to check the current score used by the oom-killer is for
> any given <pid>. Use it together with /proc/<pid>/oom_adj to tune which
> process should be killed in an out-of-memory situation.
> ==
>

You'll want to look at section 3.1 of Documentation/filesystems/proc.txt,
which describes /proc/pid/oom_score_adj, not 3.2.

> add a some documentation like:
> ==
> (For system monitoring tool developpers, not for usual users.)
> oom_score calculation is implemnentation dependent and can be modified without
> any caution. But current logic is
>
> oom_score = ((proc's rss + proc's swap) / (available ram + swap)) + oom_score_adj
>

I'd hesitate to state the formula outside of the implementation and
instead focus on the semantics of oom_score_adj (as a proportion of
available memory compared to other tasks), which I tried doing in section
3.1. Then, the userspace tool only need be concerned about the units of
oom_score_adj rather than whether rss, swap, or later extentions such as
shm are added.

Thanks for working on this, Kame!

2010-08-26 01:08:55

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [PATCH 1/2][BUGFIX] oom: remove totalpage normalization from oom_badness()

On Wed, 25 Aug 2010 17:52:06 -0700 (PDT)
David Rientjes <[email protected]> wrote:

> On Thu, 26 Aug 2010, KAMEZAWA Hiroyuki wrote:
>
> > I'm now trying to write a userspace tool to calculate this, for me.
> > Then, could you update documentation ?
> > ==
> > 3.2 /proc/<pid>/oom_score - Display current oom-killer score
> > -------------------------------------------------------------
> >
> > This file can be used to check the current score used by the oom-killer is for
> > any given <pid>. Use it together with /proc/<pid>/oom_adj to tune which
> > process should be killed in an out-of-memory situation.
> > ==
> >
>
> You'll want to look at section 3.1 of Documentation/filesystems/proc.txt,
> which describes /proc/pid/oom_score_adj, not 3.2.
>
> > add a some documentation like:
> > ==
> > (For system monitoring tool developpers, not for usual users.)
> > oom_score calculation is implemnentation dependent and can be modified without
> > any caution. But current logic is
> >
> > oom_score = ((proc's rss + proc's swap) / (available ram + swap)) + oom_score_adj
> >
>
> I'd hesitate to state the formula outside of the implementation and
> instead focus on the semantics of oom_score_adj (as a proportion of
> available memory compared to other tasks), which I tried doing in section
> 3.1. Then, the userspace tool only need be concerned about the units of
> oom_score_adj rather than whether rss, swap, or later extentions such as
> shm are added.
>
> Thanks for working on this, Kame!
>
BTW, why you don't subtract the amount of Hugepages ?

The old code did
"totalrampages - hugepage" as available memory.

IIUC, the number of hugepages is not accounted into mm->rss, so, isn't it
better to subtract # of hugepage ?
Hmm...makes no difference ?

Thanks,
-Kame

2010-08-26 01:16:44

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [PATCH 1/2][BUGFIX] oom: remove totalpage normalization from oom_badness()

On Wed, 25 Aug 2010 17:52:06 -0700 (PDT)
David Rientjes <[email protected]> wrote:

> On Thu, 26 Aug 2010, KAMEZAWA Hiroyuki wrote:
>
> > I'm now trying to write a userspace tool to calculate this, for me.
> > Then, could you update documentation ?
> > ==
> > 3.2 /proc/<pid>/oom_score - Display current oom-killer score
> > -------------------------------------------------------------
> >
> > This file can be used to check the current score used by the oom-killer is for
> > any given <pid>. Use it together with /proc/<pid>/oom_adj to tune which
> > process should be killed in an out-of-memory situation.
> > ==
> >
>
> You'll want to look at section 3.1 of Documentation/filesystems/proc.txt,
> which describes /proc/pid/oom_score_adj, not 3.2.
>
> > add a some documentation like:
> > ==
> > (For system monitoring tool developpers, not for usual users.)
> > oom_score calculation is implemnentation dependent and can be modified without
> > any caution. But current logic is
> >
> > oom_score = ((proc's rss + proc's swap) / (available ram + swap)) + oom_score_adj
> >
>
> I'd hesitate to state the formula outside of the implementation and
> instead focus on the semantics of oom_score_adj (as a proportion of
> available memory compared to other tasks), which I tried doing in section
> 3.1. Then, the userspace tool only need be concerned about the units of
> oom_score_adj rather than whether rss, swap, or later extentions such as
> shm are added.
>
Hmm. I'll add a text like following to cgroup/memory.txt. O.K. ?

==
Notes on oom_score and oom_score_adj.

oom_score is calculated as
oom_score = (taks's proportion of memory) + oom_score_adj.

Then, when you use oom_score_adj to control the order of priority of oom,
you should know about the amount of memory you can use.
So, an approximate oom_score under memcg can be

memcg_oom_score = (oom_score - oom_score_adj) * system_memory/memcg's limit
+ oom_score_adj.

And yes, this can be affected by hierarchy control of memcg and calculation
will be more complicated. See, oom_disable feature also.
==

Thanks,
-Kame

2010-08-26 02:50:33

by David Rientjes

[permalink] [raw]

Subject: Re: [PATCH 1/2][BUGFIX] oom: remove totalpage normalization from oom_badness()

On Thu, 26 Aug 2010, KAMEZAWA Hiroyuki wrote:

> Hmm. I'll add a text like following to cgroup/memory.txt. O.K. ?
>
> ==
> Notes on oom_score and oom_score_adj.
>
> oom_score is calculated as
> oom_score = (taks's proportion of memory) + oom_score_adj.
>

I'd replace "memory" with "memory limit (or memsw limit)" so it's clear
we're talking about the amount of memory available to task.

> Then, when you use oom_score_adj to control the order of priority of oom,
> you should know about the amount of memory you can use.

Hmm, you need to know the amount of memory that you can use iff you know
the memcg limit and it's a static value. Otherwise, you only need to know
the "memory usage of your application relative to others in the same
cgroup." An oom_score_adj of +300 adds 30% of that memcg's limit to the
task, allowing all other tasks to use 30% more memory than that task with
it still be killed. An oom_score_adj of -300 allows that task to use 30%
more memory than other tasks without getting killed. These don't need to
know the actual limit.

> So, an approximate oom_score under memcg can be
>
> memcg_oom_score = (oom_score - oom_score_adj) * system_memory/memcg's limit
> + oom_score_adj.
>

Right, that's the exact score within the memcg.

But, I still wouldn't encourage a formula like this because the memcg
limit (or cpuset mems, mempolicy nodes, etc) are dynamic and may change
out from under us. So it's more important to define oom_score_adj in the
user's mind as a proportion of memory available to be added (either
positively or negatively) to its memory use when comparing it to other
tasks. The point is that the memcg limit isn't interesting in this
formula, it's more important to understand the priority of the task
_compared_ to other tasks memory usage in that memcg.

It probably would be helpful, though, if you know that a vital system task
uses 1G, for instance, in a 4G memcg that an oom_score_adj of -250 will
disable oom killing for it. If that tasks leaks memory or becomes
significantly large, for whatever reason, it could be killed, but we _can_
discount the 1G in comparison to other tasks as the "cost of doing
business" when it comes to vital system tasks:

(memory usage) * (memory+swap limit / system memory)

2010-08-26 03:25:31

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [PATCH 1/2][BUGFIX] oom: remove totalpage normalization from oom_badness()

On Wed, 25 Aug 2010 19:50:22 -0700 (PDT)
David Rientjes <[email protected]> wrote:

> On Thu, 26 Aug 2010, KAMEZAWA Hiroyuki wrote:
>
> > Hmm. I'll add a text like following to cgroup/memory.txt. O.K. ?
> >
> > ==
> > Notes on oom_score and oom_score_adj.
> >
> > oom_score is calculated as
> > oom_score = (taks's proportion of memory) + oom_score_adj.
> >
>
> I'd replace "memory" with "memory limit (or memsw limit)" so it's clear
> we're talking about the amount of memory available to task.
>
ok.

> > Then, when you use oom_score_adj to control the order of priority of oom,
> > you should know about the amount of memory you can use.
>
> Hmm, you need to know the amount of memory that you can use iff you know
> the memcg limit and it's a static value. Otherwise, you only need to know
> the "memory usage of your application relative to others in the same
> cgroup." An oom_score_adj of +300 adds 30% of that memcg's limit to the
> task, allowing all other tasks to use 30% more memory than that task with
> it still be killed. An oom_score_adj of -300 allows that task to use 30%
> more memory than other tasks without getting killed. These don't need to
> know the actual limit.
>

Hmm. What's complicated is oom_score_adj's behavior.

> > So, an approximate oom_score under memcg can be
> >
> > memcg_oom_score = (oom_score - oom_score_adj) * system_memory/memcg's limit
> > + oom_score_adj.
> >
>
> Right, that's the exact score within the memcg.
>
> But, I still wouldn't encourage a formula like this because the memcg
> limit (or cpuset mems, mempolicy nodes, etc) are dynamic and may change
> out from under us. So it's more important to define oom_score_adj in the
> user's mind as a proportion of memory available to be added (either
> positively or negatively) to its memory use when comparing it to other
> tasks. The point is that the memcg limit isn't interesting in this
> formula, it's more important to understand the priority of the task
> _compared_ to other tasks memory usage in that memcg.
>

yes. For defineing/understanding priority, oom_score_adj is that.
But it's priority isn't static.

> It probably would be helpful, though, if you know that a vital system task
> uses 1G, for instance, in a 4G memcg that an oom_score_adj of -250 will
> disable oom killing for it.

yes.

> If that tasks leaks memory or becomes
> significantly large, for whatever reason, it could be killed, but we _can_
> discount the 1G in comparison to other tasks as the "cost of doing
> business" when it comes to vital system tasks:
>
> (memory usage) * (memory+swap limit / system memory)
>

yes. under 8G system, -250 will allow ingnoring 2G of usage.

== How about this text ? ==

When you set a task's oom_score_adj, it can get priority not to be oom-killed.
oom_score_adj gives priority proportional to the memory limitation.

Assuming you set -250 to oom_score_adj.

Under 4G memory limit, it gets 25% of bonus...1G memory bonus for avoiding OOM.
Under 8G memory limit, it gets 25% of bonus...2G memory bonus for avoiding OOM.

Then, what bonus a task can get depends on the context of OOM. If you use
oom_score_adj and want to give bonus to a task, setting it in regard with
minimum memory limitation which a task is under will work well.
==

Thanks,
-Kame

2010-08-26 03:52:21

by David Rientjes

[permalink] [raw]

Subject: Re: [PATCH 1/2][BUGFIX] oom: remove totalpage normalization from oom_badness()

On Thu, 26 Aug 2010, KAMEZAWA Hiroyuki wrote:

> > Hmm, you need to know the amount of memory that you can use iff you know
> > the memcg limit and it's a static value. Otherwise, you only need to know
> > the "memory usage of your application relative to others in the same
> > cgroup." An oom_score_adj of +300 adds 30% of that memcg's limit to the
> > task, allowing all other tasks to use 30% more memory than that task with
> > it still be killed. An oom_score_adj of -300 allows that task to use 30%
> > more memory than other tasks without getting killed. These don't need to
> > know the actual limit.
> >
>
> Hmm. What's complicated is oom_score_adj's behavior.
>

I understand it's a little tricky when dealing with memcg-constrained oom
conditions versus system-wide oom conditions. Think of it this way: if
the system is oom, then every memcg, every cpuset, and every mempolicy is
also oom. That doesn't imply that something in every memcg, every cpuset,
or every mempolicy must be killed, however. What cgroup happens to be
penalized in this scenario isn't necessarily the scope of oom_score_adj's
purpose. oom_score_adj certainly does have a stronger influence over a
task's priority when it's a system oom and not a memcg oom because the
size of available memory is different, but that's fine: we set positive
and negative oom_score_adj values for a reason based on the application,
and that's not necessarily (but can be) a function of the memcg or system
capacity. Again, oom_score_adj is only meaningful when considered
relative to other candidate tasks since the badness score itself is
considered relative to other candidate tasks.

You can have multiple tasks that have +1000 oom_score_adj values (or
multiple tasks that have +15 oom_adj values). Only one will be killed and
it's dependent only on the ordering of the tasklist. That isn't an
exception case, that only means that we prevented needless oom killing.

> > > So, an approximate oom_score under memcg can be
> > >
> > > memcg_oom_score = (oom_score - oom_score_adj) * system_memory/memcg's limit
> > > + oom_score_adj.
> > >
> >
> > Right, that's the exact score within the memcg.
> >
> > But, I still wouldn't encourage a formula like this because the memcg
> > limit (or cpuset mems, mempolicy nodes, etc) are dynamic and may change
> > out from under us. So it's more important to define oom_score_adj in the
> > user's mind as a proportion of memory available to be added (either
> > positively or negatively) to its memory use when comparing it to other
> > tasks. The point is that the memcg limit isn't interesting in this
> > formula, it's more important to understand the priority of the task
> > _compared_ to other tasks memory usage in that memcg.
> >
>
> yes. For defineing/understanding priority, oom_score_adj is that.
> But it's priority isn't static.
>

If the memcg limit changes because we're attaching more tasks, yes, we may
want to change its oom_score_adj relative to those tasks. So
oom_score_adj is a function of the attached tasks and its allowed set of
resources in comparison to them, not the limit itself.

> > If that tasks leaks memory or becomes
> > significantly large, for whatever reason, it could be killed, but we _can_
> > discount the 1G in comparison to other tasks as the "cost of doing
> > business" when it comes to vital system tasks:
> >
> > (memory usage) * (memory+swap limit / system memory)
> >
>
> yes. under 8G system, -250 will allow ingnoring 2G of usage.
>

Yeah, that conversion could be useful if the system RAM capacity or memcg
limit, etc, remains static.

> == How about this text ? ==
>
> When you set a task's oom_score_adj, it can get priority not to be oom-killed.

And the reverse, it can get priority to be killed :)

> oom_score_adj gives priority proportional to the memory limitation.
>
> Assuming you set -250 to oom_score_adj.
>
> Under 4G memory limit, it gets 25% of bonus...1G memory bonus for avoiding OOM.
> Under 8G memory limit, it gets 25% of bonus...2G memory bonus for avoiding OOM.
>
> Then, what bonus a task can get depends on the context of OOM. If you use
> oom_score_adj and want to give bonus to a task, setting it in regard with
> minimum memory limitation which a task is under will work well.

Very nice, and the "bonus" there is what the task can safely use in
comparison to any other task competing for the same resources without
getting selected itself because of that memory.

2010-08-30 02:58:27

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [PATCH 1/2][BUGFIX] oom: remove totalpage normalization from oom_badness()

> On Wed, 25 Aug 2010, KOSAKI Motohiro wrote:
>
> > Current oom_score_adj is completely broken because It is strongly bound
> > google usecase and ignore other all.
> >
>
> That's wrong, we don't even use this heuristic yet and there is nothing,
> in any way, that is specific to Google.

Please show us an evidence. Big mouth is no good way to persuade us.
I requested you "COMMUNICATE REAL WORLD USER", do you really realized this?

>
> > 1) Priority inversion
> > As kamezawa-san pointed out, This break cgroup and lxr environment.
> > He said,
> > > Assume 2 proceses A, B which has oom_score_adj of 300 and 0
> > > And A uses 200M, B uses 1G of memory under 4G system
> > >
> > > Under the system.
> > > A's socre = (200M *1000)/4G + 300 = 350
> > > B's score = (1G * 1000)/4G = 250.
> > >
> > > In the cpuset, it has 2G of memory.
> > > A's score = (200M * 1000)/2G + 300 = 400
> > > B's socre = (1G * 1000)/2G = 500
> > >
> > > This priority-inversion don't happen in current system.
> >
>
> You continually bring this up, and I've answered it three times, but
> you've never responded to it before and completely ignore it.

Yes, I ignored. Don't talk your dream. I hope to see concrete use-case.
As I repeatedly said, I don't care you while you ignore real world end user.
ANY BODY DON'T EXCEPT STABILIZATION DEVELOPERS ARE KINDFUL FOR END USER
HARMFUL. WE HAVE NO MERCY WHILE YOU CONTINUE TO INMORAL DEVELOPMENT.

I'm waiting ome more day. Pray! anyone join to this discussion and
explain real use instead you. We don't ignore end-user. But nobody
except you reponce this even though I don't care your , I definitely
will sent this patch to mainline.

> I really
> hope and expect that you'll participate more in the development process
> and not continue to reinterate your talking points when you have no answer
> to my response.
>
> You're wrong, especially with regard to cpusets, which was formally part
> of the heuristic itself.
>
> Users bind an aggregate of tasks to a cgroup (cpusets or memcg) as a means
> of isolation and attach a set of resources (memory, in this case) for
> those tasks to use. The user who does this is fully aware of the set of
> tasks being bound, there is no mystery or unexpected results when doing
> so. So when you set an oom_score_adj for a task, you don't necessarily
> need to be aware of the set of resources it has available, which is
> dynamic and an attribute of the system or cgroup, but rather the priority
> of that task in competition with other tasks for the same resources.

That is YOUR policy. The problem is IT'S NO GENERIC.

>
> _That_ is what is important in having a userspace influence on a badness
> heursitic: how those badness scores compare relative to other tasks that
> share the same resources. That's how a task is chosen for oom kill, not
> because of a static formula such as you're introducing here that outputs a
> value (and, thus, a priority) regardless of the context in which the task
> is bound.
>
> That also means that the same task is not necessarily killed in a
> cpuset-constrained oom compared to a system-wide oom. If you bias a task
> by 30% of available memory, which Kame did in his example above, it's
> entirely plausible that task A should be killed because it's actual usage
> is only 1/20th of the machine. When its cpuset is oom, and the admin has
> specifically bound that task to only 2G of memory, we'd natually want to
> kill the memory hogger, that is using 50% of the total memory available to
> it.

I agree your implementation works fine if admins have the same policy
with you. I oppose you assume you can change all admins in the world.

> > 2) Ratio base point don't works large machine
> > oom_score_adj normalize oom-score to 0-1000 range.
> > but if the machine has 1TB memory, 1 point (i.e. 0.1%) mean
> > 1GB. this is no suitable for tuning parameter.
> > As I said, proposional value oriented tuning parameter has
> > scalability risk.
> >
>
> So you'd rather use the range of oom_adj from -17 to +15 instead of
> oom_score_adj from -1000 to +1000 where each point is 68GB? I don't
> understand your point here as to why oom_score_adj is worse.

No. As I said,
- If you want to solve minority issue, you have to keep no regression
for majority user.
- If you want to solve major isssue and making bug change. Investigate
world wide use case carefully. and refrect it.

oom_score_adj was pointed out it overlook a lot of use case. then I
request 1) remake all, or 2) kill existing code change.

> But, yes, in reality we don't really care about the granularity so much
> that we need to prioritize a task using 512MB more memory than another to
> break the tie on a 1TB machine, 1/2048th of its memory.
>
> > 3) No reason to implement ABI breakage.
> > old tuning parameter mean)
> > oom-score = oom-base-score x 2^oom_adj
>
> Everybody knows this is useless beyond polarizing a task for kill or
> making it immune.
>
> > new tuning parameter mean)
> > oom-score = oom-base-score + oom_score_adj / (totalram + totalswap)
>
> This, on the other hand, has an actual unit (proportion of available
> memory) that can be used to prioritize tasks amongst those competing for
> the same set of shared resources and remains constant even when a task
> changes cpuset, its memcg limit changes, etc.
>
> And your equation is wrong, it's
>
> ((rss + swap) / (available ram + swap)) + oom_score_adj
>
> which is completely different from what you think it is.

you equetion can be changed

(rss + swap) + oom_score_adj x (available ram + swap)
-----------------------------------------------------------
(available ram + swap)

That said, same oom_score_adj can be calculated.

> > but "oom_score_adj / (totalram + totalswap)" can be calculated in
> > userland too. beucase both totalram and totalswap has been exporsed by
> > /proc. So no reason to introduce funny new equation.
> >
>
> Yup, it definitely can, which is why as I mentioned to Kame (who doesn't
> have strong feelings either way, even though you quote him as having these
> strong objections) that you can easily convert oom_score_adj into a
> stand-alone memory quantity (biasing or forgiving 512MB of a task's
> memory, for example) in the context it is currently attached to with
> simple arithemetic in userspace. That's why oom_score_adj is powerful.

I already said I disagree.

>
> > 4) totalram based normalization assume flat memory model.
> > example, the machine is assymmetric numa. fat node memory and thin
> > node memory might have another wight value.
> > In other word, totalram based priority is a one of policy. Fixed and
> > workload depended policy shouldn't be embedded in kernel. probably.
> >
>
> I don't know what this means, and this was your criticism before I changed
> the denominator during the revision of the patchset, so it's probably
> obsoleted. oom_score_adj always operates based on the proportion of
> memory available to the application which is how the new oom killer
> determines which tasks to kill: relative to the importance (if defined by
> userspace) and memory usage compared to other tasks competing for it.

I already explained asymmetric numa issue in past. again, don't assuem
you policy and your machine if you want to change kernel core code.

>
> > Then, this patch remove *UGLY* total_pages suck completely. Googler
> > can calculate it at userland!
> >
>
> Nothing specific about any of this to Google. Users who actually setup
> their machines to use mempolicies, cpusets, or memcgs actually do want a
> powerful interface from userspace to tune the priorities in terms of both
> business goals and also importance of the task. That is done much more
> powerfully now with oom_score_adj than the previous implementation. Users
> who don't use these cgroups, especially desktop users, can see
> oom_score_adj in terms of a memory quantity that remains static: they
> aren't going to encounter changing memcg limits, cpuset mems, etc.
>
> That said, I really don't know why you keep mentioning "Google this" and
> "Google that" when the company I'm working for is really irrelevant to
> this discussion.

Please don't talk your imazine. You have to talk about concrete use-case.

> With that, I respectfully nack your patch.

Sorry, I don't care this. Please fix you.

Thanks.