the victim should not directly access hardware devices like Xorg server,
because the hardware could be left in an unpredictable state, although
user-application can set /proc/pid/oom_score_adj to protect it. so i think
those processes should get 3% bonus for protection.
Signed-off-by: Figo.zhang <[email protected]>
---
mm/oom_kill.c | 8 +++++---
1 files changed, 5 insertions(+), 3 deletions(-)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 4029583..df6a9da 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -195,10 +195,12 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
task_unlock(p);
/*
- * Root processes get 3% bonus, just like the __vm_enough_memory()
- * implementation used by LSMs.
+ * Root and direct hardware access processes get 3% bonus, just like the
+ * __vm_enough_memory() implementation used by LSMs.
*/
- if (has_capability_noaudit(p, CAP_SYS_ADMIN))
+ if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
+ has_capability_noaudit(p, CAP_SYS_RESOURCE) ||
+ has_capability_noaudit(p, CAP_SYS_RAWIO))
points -= 30;
/*
On Tue, 2 Nov 2010, Figo.zhang wrote:
> the victim should not directly access hardware devices like Xorg server,
> because the hardware could be left in an unpredictable state, although
> user-application can set /proc/pid/oom_score_adj to protect it. so i think
> those processes should get 3% bonus for protection.
>
Which applications are you referring to that cannot gracefully exit if
killed?
> Signed-off-by: Figo.zhang <[email protected]>
> ---
> mm/oom_kill.c | 8 +++++---
> 1 files changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 4029583..df6a9da 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -195,10 +195,12 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
> task_unlock(p);
>
> /*
> - * Root processes get 3% bonus, just like the __vm_enough_memory()
> - * implementation used by LSMs.
> + * Root and direct hardware access processes get 3% bonus, just like the
> + * __vm_enough_memory() implementation used by LSMs.
LSM's have this bonus for CAP_SYS_ADMIN, but not for CAP_SYS_RAWIO, so
this comment is incorrect.
> */
> - if (has_capability_noaudit(p, CAP_SYS_ADMIN))
> + if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
> + has_capability_noaudit(p, CAP_SYS_RESOURCE) ||
> + has_capability_noaudit(p, CAP_SYS_RAWIO))
> points -= 30;
>
> /*
CAP_SYS_RAWIO had a much more dramatic impact in the previous heuristic to
such a point that it would often allow memory hogging tasks to elude the
oom killer at the expense of innocent tasks. I'm not sure this is the
best way to go.
>
> Which applications are you referring to that cannot gracefully exit if
> killed?
like Xorg server, if xorg server be killed, the gnome desktop will be
crashed.
>
> CAP_SYS_RAWIO had a much more dramatic impact in the previous heuristic to
> such a point that it would often allow memory hogging tasks to elude the
> oom killer at the expense of innocent tasks. I'm not sure this is the
> best way to go.
is it some experiments for demonstration the CAP_SYS_RAWIO will elude
the oom killer?
On Tue, 2 Nov 2010, Figo.zhang wrote:
> > Which applications are you referring to that cannot gracefully exit if
> > killed?
>
> like Xorg server, if xorg server be killed, the gnome desktop will be
> crashed.
>
Right, but you didn't explicitly prohibit such applications from being
killed, so that suggests that doing so may be inconvenient but doesn't
incur something like corruption or data loss, which is what I would
consider "unstable" or "inconsistent" state.
We're trying to avoid any additional heuristics from being introduced for
specific usecases, even for Xorg. That ensures that the heuristic remains
as predictable as possible and frees a large amount of memory. If Xorg is
being killed first instead of a true memory hogger, then it seems like a
forkbomb scenario instead; could you please post your kernel log so that
we can diagnose that issue seperately?
> > CAP_SYS_RAWIO had a much more dramatic impact in the previous heuristic to
> > such a point that it would often allow memory hogging tasks to elude the
> > oom killer at the expense of innocent tasks. I'm not sure this is the
> > best way to go.
>
> is it some experiments for demonstration the CAP_SYS_RAWIO will elude
> the oom killer?
>
The old heuristic would allow it to elude the oom killer because it would
divide the score by four if a task had the capability, which is a much
more drastic "bonus" than you suggest here. That would reduce the score
for the memory hogging task significantly enough that we killed tons of
innocent tasks instead before eventually killing the task that was leaking
memory but failed to be identified because it had CAP_SYS_RAWIO. I'm
trying to avoid any such repeats.
CAP_SYS_RESOURCE also had better get 3% bonus for protection.
Signed-off-by: Figo.zhang <[email protected]>
---
mm/oom_kill.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 4029583..30b24b9 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -198,7 +198,8 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
* Root processes get 3% bonus, just like the __vm_enough_memory()
* implementation used by LSMs.
*/
- if (has_capability_noaudit(p, CAP_SYS_ADMIN))
+ if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
+ has_capability_noaudit(p, CAP_SYS_RESOURCE))
points -= 30;
/*
On Thu, 4 Nov 2010, Figo.zhang wrote:
> CAP_SYS_RESOURCE also had better get 3% bonus for protection.
>
Would you like to elaborate as to why?
> Signed-off-by: Figo.zhang <[email protected]>
> ---
> mm/oom_kill.c | 3 ++-
> 1 files changed, 2 insertions(+), 1 deletions(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 4029583..30b24b9 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -198,7 +198,8 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
> * Root processes get 3% bonus, just like the __vm_enough_memory()
> * implementation used by LSMs.
> */
> - if (has_capability_noaudit(p, CAP_SYS_ADMIN))
> + if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
> + has_capability_noaudit(p, CAP_SYS_RESOURCE))
> points -= 30;
>
> /*
>
>
>
> On Thu, 4 Nov 2010, Figo.zhang wrote:
>
> > CAP_SYS_RESOURCE also had better get 3% bonus for protection.
> >
>
>
> Would you like to elaborate as to why?
>
>
process with CAP_SYS_RESOURCE capibility which have system resource
limits, like journaling resource on ext3/4 filesystem, RTC clock. so it
also the same treatment as process with CAP_SYS_ADMIN.
Best,
Figo.zhang
On Thu, 4 Nov 2010, Figo.zhang wrote:
> > > CAP_SYS_RESOURCE also had better get 3% bonus for protection.
> > >
> >
> >
> > Would you like to elaborate as to why?
> >
> >
>
> process with CAP_SYS_RESOURCE capibility which have system resource
> limits, like journaling resource on ext3/4 filesystem, RTC clock. so it
> also the same treatment as process with CAP_SYS_ADMIN.
>
NACK, there's no justification that these tasks should be given a 3%
memory bonus in the oom killer heuristic; in fact, since they can allocate
without limits it is more important to target these tasks if they are
using an egregious amount of memory. CAP_SYS_RESOURCE threads have the
ability to lower their own oom_score_adj values, thus, they should protect
themselves if necessary like everything else.
On Wed, 2010-11-03 at 18:50 -0700, David Rientjes wrote:
> On Thu, 4 Nov 2010, Figo.zhang wrote:
>
> > > > CAP_SYS_RESOURCE also had better get 3% bonus for protection.
> > > >
> > >
> > >
> > > Would you like to elaborate as to why?
> > >
> > >
> >
> > process with CAP_SYS_RESOURCE capibility which have system resource
> > limits, like journaling resource on ext3/4 filesystem, RTC clock. so it
> > also the same treatment as process with CAP_SYS_ADMIN.
> >
>
> NACK, there's no justification that these tasks should be given a 3%
> memory bonus in the oom killer heuristic; in fact, since they can allocate
> without limits it is more important to target these tasks if they are
> using an egregious amount of memory. CAP_SYS_RESOURCE threads have the
> ability to lower their own oom_score_adj values, thus, they should protect
> themselves if necessary like everything else.
In your new heuristic, you also get CAP_SYS_RESOURCE to protection.
see fs/proc/base.c, line 1167:
if (oom_score_adj < task->signal->oom_score_adj &&
!capable(CAP_SYS_RESOURCE)) {
err = -EACCES;
goto err_sighand;
}
so i want to protect some process like normal process not
CAP_SYS_RESOUCE, i set a small oom_score_adj , if new oom_score_adj is
small than now and it is not limited resource, it will not adjust, that
seems not right?
On Thu, 4 Nov 2010, Figo.zhang wrote:
> In your new heuristic, you also get CAP_SYS_RESOURCE to protection.
> see fs/proc/base.c, line 1167:
> if (oom_score_adj < task->signal->oom_score_adj &&
> !capable(CAP_SYS_RESOURCE)) {
> err = -EACCES;
> goto err_sighand;
> }
That's unchanged from the old behavior with oom_adj.
> so i want to protect some process like normal process not
> CAP_SYS_RESOUCE, i set a small oom_score_adj , if new oom_score_adj is
> small than now and it is not limited resource, it will not adjust, that
> seems not right?
>
Tasks without CAP_SYS_RESOURCE cannot lower their own oom_score_adj,
otherwise it can trivially kill other tasks. They can, however, increase
their own oom_score_adj so the oom killer prefers to kill it first.
I think you may be confused: CAP_SYS_RESOURCE override resource limits.
On Wed, 2010-11-03 at 19:54 -0700, David Rientjes wrote:
> On Thu, 4 Nov 2010, Figo.zhang wrote:
>
> > In your new heuristic, you also get CAP_SYS_RESOURCE to protection.
> > see fs/proc/base.c, line 1167:
> > if (oom_score_adj < task->signal->oom_score_adj &&
> > !capable(CAP_SYS_RESOURCE)) {
> > err = -EACCES;
> > goto err_sighand;
> > }
>
> That's unchanged from the old behavior with oom_adj.
>
> > so i want to protect some process like normal process not
> > CAP_SYS_RESOUCE, i set a small oom_score_adj , if new oom_score_adj is
> > small than now and it is not limited resource, it will not adjust, that
> > seems not right?
> >
>
> Tasks without CAP_SYS_RESOURCE cannot lower their own oom_score_adj,
CAP_SYS_RESOURCE == 1 means without resource limits just like a
superuser,
CAP_SYS_RESOURCE == 0 means hold resource limits, like normal user,
right?
a new lower oom_score_adj will protect the process, right?
Tasks without CAP_SYS_RESOURCE, means that it is not a superuser, why
user canot protect it by oom_score_adj?
like i want to protect my program such as gnome-terminal which is
without CAP_SYS_RESOURCE (have resource limits),
[figo@myhost ~]$ ps -ax | grep gnome-ter
Warning: bad ps syntax, perhaps a bogus '-'? See
http://procps.sf.net/faq.html
2280 ? Sl 0:01 gnome-terminal
8839 pts/0 S+ 0:00 grep gnome-ter
[figo@myhost ~]$ cat /proc/2280/oom_adj
3
[figo@myhost ~]$ echo -17 > /proc/2280/oom_adj
bash: echo: write error: Permission denied
[figo@myhost ~]$
so, i canot protect my program.
> otherwise it can trivially kill other tasks. They can, however, increase
> their own oom_score_adj so the oom killer prefers to kill it first.
>
> I think you may be confused: CAP_SYS_RESOURCE override resource limits.
On Thu, 4 Nov 2010, Figo.zhang wrote:
> CAP_SYS_RESOURCE == 1 means without resource limits just like a
> superuser,
> CAP_SYS_RESOURCE == 0 means hold resource limits, like normal user,
> right?
>
Yes.
> a new lower oom_score_adj will protect the process, right?
>
Yes.
> Tasks without CAP_SYS_RESOURCE, means that it is not a superuser, why
> user canot protect it by oom_score_adj?
>
Because, as I said, it would be trivial for a user program to deplete all
memory (either intentionally or unintentioally) and cause every other task
on the system to be oom killed as a result. That's an undesired result of
a blatently obvious DoS.
> like i want to protect my program such as gnome-terminal which is
> without CAP_SYS_RESOURCE (have resource limits),
>
> [figo@myhost ~]$ ps -ax | grep gnome-ter
> Warning: bad ps syntax, perhaps a bogus '-'? See
> http://procps.sf.net/faq.html
> 2280 ? Sl 0:01 gnome-terminal
> 8839 pts/0 S+ 0:00 grep gnome-ter
> [figo@myhost ~]$ cat /proc/2280/oom_adj
> 3
> [figo@myhost ~]$ echo -17 > /proc/2280/oom_adj
> bash: echo: write error: Permission denied
> [figo@myhost ~]$
>
> so, i canot protect my program.
>
If this is your system, you can either give yourself CAP_SYS_RESOURCE or
do it through the superuser. This isn't exactly new, it's been the case
for the past four years.
I'm still struggling to find out the problem that you're trying to address
with your various patches, perhaps because you haven't said what it is.
>
> the victim should not directly access hardware devices like Xorg server,
> because the hardware could be left in an unpredictable state, although
> user-application can set /proc/pid/oom_score_adj to protect it. so i think
> those processes should get 3% bonus for protection.
>
> Signed-off-by: Figo.zhang <[email protected]>
I was surprised this issue is still there. This was pointed out half year
ago already :-/
> ---
> mm/oom_kill.c | 8 +++++---
> 1 files changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 4029583..df6a9da 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -195,10 +195,12 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
> task_unlock(p);
>
> /*
> - * Root processes get 3% bonus, just like the __vm_enough_memory()
> - * implementation used by LSMs.
> + * Root and direct hardware access processes get 3% bonus, just like the
> + * __vm_enough_memory() implementation used by LSMs.
> */
This comment is incorrect. LSM is care only CAP_SYS_ADMIN.
> - if (has_capability_noaudit(p, CAP_SYS_ADMIN))
> + if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
> + has_capability_noaudit(p, CAP_SYS_RESOURCE) ||
> + has_capability_noaudit(p, CAP_SYS_RAWIO))
> points -= 30;
But yes. OOM need to care both CAP_SYS_RESOURCE and CAP_SYS_RAWIO.
Reviewed-by: KOSAKI Motohiro <[email protected]>
> On Thu, 4 Nov 2010, Figo.zhang wrote:
>
> > > > CAP_SYS_RESOURCE also had better get 3% bonus for protection.
> > > >
> > >
> > >
> > > Would you like to elaborate as to why?
> > >
> > >
> >
> > process with CAP_SYS_RESOURCE capibility which have system resource
> > limits, like journaling resource on ext3/4 filesystem, RTC clock. so it
> > also the same treatment as process with CAP_SYS_ADMIN.
> >
>
> NACK, there's no justification that these tasks should be given a 3%
> memory bonus in the oom killer heuristic; in fact, since they can allocate
> without limits it is more important to target these tasks if they are
> using an egregious amount of memory. CAP_SYS_RESOURCE threads have the
> ability to lower their own oom_score_adj values, thus, they should protect
> themselves if necessary like everything else.
David, Stupid are YOU. you removed CAP_SYS_RESOURCE condition with ZERO
explanation and Figo reported a regression. That's enough the reason to
undo. YOU have a guilty to explain why do you want to change and why
do you think it has justification.
Don't blame bug reporter. That's completely wrong.
the victim should not directly access hardware devices like Xorg server,
because the hardware could be left in an unpredictable state, although
user-application can set /proc/pid/oom_score_adj to protect it. so i think
those processes should get 3% bonus for protection.
in v2, fix the incorrect comment.
Signed-off-by: Figo.zhang <[email protected]>
Reviewed-by: KOSAKI Motohiro <[email protected]>
---
mm/oom_kill.c | 7 +++++--
1 files changed, 5 insertions(+), 2 deletions(-)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 4029583..9b06f56 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -196,9 +196,12 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
/*
* Root processes get 3% bonus, just like the __vm_enough_memory()
- * implementation used by LSMs.
+ * implementation used by LSMs. And direct hardware access processes
+ * also get 3% bonus.
*/
- if (has_capability_noaudit(p, CAP_SYS_ADMIN))
+ if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
+ has_capability_noaudit(p, CAP_SYS_RESOURCE) ||
+ has_capability_noaudit(p, CAP_SYS_RAWIO))
points -= 30;
/*
> > > process with CAP_SYS_RESOURCE capibility which have system resource
> > > limits, like journaling resource on ext3/4 filesystem, RTC clock. so it
> > > also the same treatment as process with CAP_SYS_ADMIN.
> > >
> >
> > NACK, there's no justification that these tasks should be given a 3%
> > memory bonus in the oom killer heuristic; in fact, since they can allocate
> > without limits it is more important to target these tasks if they are
> > using an egregious amount of memory.
>
> David, Stupid are YOU. you removed CAP_SYS_RESOURCE condition with ZERO
> explanation and Figo reported a regression. That's enough the reason to
> undo. YOU have a guilty to explain why do you want to change and why
> do you think it has justification.
>
> Don't blame bug reporter. That's completely wrong.
Can people stop throwing things at each other and worry about the facts
- If it's a regression it should get reverted or fixed. But is it
actually a regression ? Has the underlying behaviour changed in a
problematic way?
"CAP_SYS_RESOURCE threads have the ability to lower their own oom_score_adj
values, thus, they should protect themselves if necessary like
everything else."
The reverse can be argued equally - that they can unprotect themselves if
necessary. In fact it seems to be a "point of view" sort of question
which way you deal with CAP_SYS_RESOURCE, and that to me argues that
changing from old expected behaviour to a new behaviour is a regression.
On Tue, 9 Nov 2010, Alan Cox wrote:
> The reverse can be argued equally - that they can unprotect themselves if
> necessary. In fact it seems to be a "point of view" sort of question
> which way you deal with CAP_SYS_RESOURCE, and that to me argues that
> changing from old expected behaviour to a new behaviour is a regression.
>
I didn't check earlier, but CAP_SYS_RESOURCE hasn't had a place in the oom
killer's heuristic in over five years, so what regression are we referring
to in this thread? These tasks already have full control over
oom_score_adj to modify its oom killing priority in either direction.
And, as I said, giving these threads a bonus to be less preferred doesn't
seem appropriate since (1) it's not a defined or expected behavior of
CAP_SYS_RESOURCE like it is for sysadmin tasks, and (2) these threads are
not bound by resource limits and thus have a higher liklihood of consuming
larger amounts of memory.
That's why I nack'd the patch in the first place and still do, there's no
regression here and it's not in the best interest of freeing a large
amount of memory which is the sole purpose of the oom killer.
Futhermore, the heuristic was entirely rewritten, but I wouldn't consider
all the old factors such as cputime and nice level being removed as
"regressions" since the aim was to make it more predictable and more
likely to kill a large consumer of memory such that we don't have to kill
more tasks in the near future.
On Tue, 9 Nov 2010, Figo.zhang wrote:
>
> the victim should not directly access hardware devices like Xorg server,
> because the hardware could be left in an unpredictable state, although
> user-application can set /proc/pid/oom_score_adj to protect it. so i think
> those processes should get 3% bonus for protection.
>
The logic here is wrong: if killing these tasks can leave hardware in an
unpredictable state (and that state is presumably harmful), then they
should be completely immune from oom killing since you're still leaving
them exposed here to be killed.
So the question that needs to be answered is: why do these threads deserve
to use 3% more memory (not >4%) than others without getting killed? If
there was some evidence that these threads have a certain quantity of
memory they require as a fundamental attribute of CAP_SYS_RAWIO, then I
have no objection, but that's going to be expressed in a memory quantity
not a percentage as you have here.
The CAP_SYS_ADMIN heuristic has a background: it is used in the oom killer
because we have used the same 3% in __vm_enough_memory() for a long time
and we want consistency amongst the heuristics. Adding additional bonuses
with arbitrary values like 3% of memory for things like CAP_SYS_RAWIO
makes the heuristic less predictable and moves us back toward the old
heuristic which was almost entirely arbitrary.
Now before KOSAKI-san comes out and says the old heuristic considered
CAP_SYS_RAWIO and the new one does not so it _must_ be a regression: the
old heuristic also divided the badness score by 4 for that capability as a
completely arbitrary value (just like 3% is here). Other traits like
runtime and nice levels were also removed from the heuristic. What needs
to be shown is that CAP_SYS_RAWIO requires additional memory just to run
or we should neglect to free 3% of memory, which could be gigabytes,
because it has this trait.
On Tue, 9 Nov 2010, David Rientjes wrote:
> I didn't check earlier, but CAP_SYS_RESOURCE hasn't had a place in the oom
> killer's heuristic in over five years, so what regression are we referring
> to in this thread? These tasks already have full control over
> oom_score_adj to modify its oom killing priority in either direction.
>
Yes, CAP_SYS_RESOURCE was a part of the heuristic in 2.6.25 along with
CAP_SYS_ADMIN and was removed with the rewrite; when I said it "hasn't had
a place in the oom killer's heuristic," I meant it's an unnecessary
extention to CAP_SYS_ADMIN and allows for killing innocent tasks when a
CAP_SYS_RESOURCE task is using too much memory.
The fundamental issue here is whether or not we should give a bonus to
CAP_SYS_RESOURCE tasks because they are, by definition, allowed to access
extra resources and we're willing to sacrifice other tasks for that. This
is antagonist to the oom killer's sole goal, however, which is to kill the
task consuming the largest amount of memory unless protected by userspace
(which CAP_SYS_RESOURCE has completely control in doing).
Since these threads have complete ability to give themselves this bonus
(echo -30 > /proc/self/oom_score_adj), I don't think this needs to be a
part of the core heuristic nor with such an arbitrary value of 3% (the old
heuristic divided its badness score by 4, another arbitrary value).
On Tue, 2010-11-09 at 13:06 -0800, David Rientjes wrote:
> On Tue, 9 Nov 2010, Alan Cox wrote:
>
> > The reverse can be argued equally - that they can unprotect themselves if
> > necessary. In fact it seems to be a "point of view" sort of question
> > which way you deal with CAP_SYS_RESOURCE, and that to me argues that
> > changing from old expected behaviour to a new behaviour is a regression.
> >
>
> I didn't check earlier, but CAP_SYS_RESOURCE hasn't had a place in the oom
> killer's heuristic in over five years, so what regression are we referring
> to in this thread? These tasks already have full control over
> oom_score_adj to modify its oom killing priority in either direction.
yes, it can control by user, but is it all system administrators will
adjust all of the processes by each one and one in real word? suppose if
it has thousands of processes in database system.
> Futhermore, the heuristic was entirely rewritten, but I wouldn't consider
> all the old factors such as cputime and nice level being removed as
> "regressions" since the aim was to make it more predictable and more
> likely to kill a large consumer of memory such that we don't have to kill
> more tasks in the near future.
the goal of oom_killer is to find out the best process to kill, the one
should be:
1. it is a most memory comsuming process in all processes
2. and it was a proper process to kill, which will not be let system
into unpredictable state as possible.
if a user process and a process such email cleint "evolution" with
ditecly hareware access such as "Xorg", they have eat the equal memory,
so which process are you want to kill?
On Tue, 2010-11-09 at 13:16 -0800, David Rientjes wrote:
> On Tue, 9 Nov 2010, Figo.zhang wrote:
>
> >
> > the victim should not directly access hardware devices like Xorg server,
> > because the hardware could be left in an unpredictable state, although
> > user-application can set /proc/pid/oom_score_adj to protect it. so i think
> > those processes should get 3% bonus for protection.
> >
>
> The logic here is wrong: if killing these tasks can leave hardware in an
> unpredictable state (and that state is presumably harmful), then they
> should be completely immune from oom killing since you're still leaving
> them exposed here to be killed.
we let the processes with hardware access get bonus for protection. the
goal is not select them to be killed as possible.
>
> So the question that needs to be answered is: why do these threads deserve
> to use 3% more memory (not >4%) than others without getting killed? If
> there was some evidence that these threads have a certain quantity of
> memory they require as a fundamental attribute of CAP_SYS_RAWIO, then I
> have no objection, but that's going to be expressed in a memory quantity
> not a percentage as you have here.
>
> The CAP_SYS_ADMIN heuristic has a background: it is used in the oom killer
> because we have used the same 3% in __vm_enough_memory() for a long time
> and we want consistency amongst the heuristics. Adding additional bonuses
> with arbitrary values like 3% of memory for things like CAP_SYS_RAWIO
> makes the heuristic less predictable and moves us back toward the old
> heuristic which was almost entirely arbitrary.
yes, i think it is be better those processes which be protection maybe
divided the badness score by 4, like old heuristic.
the victim should not directly access hardware devices like Xorg server,
because the hardware could be left in an unpredictable state, although
user-application can set /proc/pid/oom_score_adj to protect it. so i think
those processes should get bonus for protection.
in v2, fix the incorrect comment.
in v3, change the divided the badness score by 4, like old heuristic for protection. we just
want the oom_killer don't select Root/RESOURCE/RAWIO process as possible.
suppose that if a user process A such as email cleint "evolution" and a process B with
ditecly hareware access such as "Xorg", they have eat the equal memory (the badness score is
the same),so which process are you want to kill? so in new heuristic, it will kill the process B.
but in reality, we want to kill process A.
Signed-off-by: Figo.zhang <[email protected]>
Reviewed-by: KOSAKI Motohiro <[email protected]>
mm/oom_kill.c | 9 +++++++++
1 files changed, 9 insertions(+), 0 deletions(-)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 4029583..f43d759 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -202,6 +202,15 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
points -= 30;
/*
+ * Root and direct hareware access processor are usually more
+ * important, so them should get bonus for protection.
+ */
+ if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
+ has_capability_noaudit(p, CAP_SYS_RESOURCE) ||
+ has_capability_noaudit(p, CAP_SYS_RAWIO))
+ points /= 4;
+
+ /*
* /proc/pid/oom_score_adj ranges from -1000 to +1000 such that it may
* either completely disable oom killing or always prefer a certain
* task.
the victim should not directly access hardware devices like Xorg server,
because the hardware could be left in an unpredictable state, although
user-application can set /proc/pid/oom_score_adj to protect it. so i think
those processes should get bonus for protection.
in v2, fix the incorrect comment.
in v3, change the divided the badness score by 4, like old heuristic for protection. we just
want the oom_killer don't select Root/RESOURCE/RAWIO process as possible.
suppose that if a user process A such as email cleint "evolution" and a process B with
ditecly hareware access such as "Xorg", they have eat the equal memory (the badness score is
the same),so which process are you want to kill? so in new heuristic, it will kill the process B.
but in reality, we want to kill process A.
Signed-off-by: Figo.zhang <[email protected]>
Reviewed-by: KOSAKI Motohiro <[email protected]>
---
mm/oom_kill.c | 9 +++++++++
1 files changed, 9 insertions(+), 0 deletions(-)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 4029583..f43d759 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -202,6 +202,15 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
points -= 30;
/*
+ * Root and direct hareware access processes are usually more
+ * important, so they should get bonus for protection.
+ */
+ if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
+ has_capability_noaudit(p, CAP_SYS_RESOURCE) ||
+ has_capability_noaudit(p, CAP_SYS_RAWIO))
+ points /= 4;
+
+ /*
* /proc/pid/oom_score_adj ranges from -1000 to +1000 such that it may
* either completely disable oom killing or always prefer a certain
* task.
On Wed, 10 Nov 2010, Figo.zhang wrote:
> > I didn't check earlier, but CAP_SYS_RESOURCE hasn't had a place in the oom
> > killer's heuristic in over five years, so what regression are we referring
> > to in this thread? These tasks already have full control over
> > oom_score_adj to modify its oom killing priority in either direction.
>
> yes, it can control by user, but is it all system administrators will
> adjust all of the processes by each one and one in real word? suppose if
> it has thousands of processes in database system.
>
Yes, the kernel can't possibly know the oom killing priorities of your
task so if you have such requirements then you must use the userspace
tunable.
> > Futhermore, the heuristic was entirely rewritten, but I wouldn't consider
> > all the old factors such as cputime and nice level being removed as
> > "regressions" since the aim was to make it more predictable and more
> > likely to kill a large consumer of memory such that we don't have to kill
> > more tasks in the near future.
>
> the goal of oom_killer is to find out the best process to kill, the one
> should be:
> 1. it is a most memory comsuming process in all processes
> 2. and it was a proper process to kill, which will not be let system
> into unpredictable state as possible.
>
There are four types of tasks that are improper to kill and this is
relatively unchanged in the past five years of the oom killer:
- init,
- kthreads,
- tasks that are bound to a disjoint set of cpuset mems or mempolicy
nodes that are not oom, and
- those disabled from oom killing by userspace.
That does not include CAP_SYS_RESOURCE, nor CAP_SYS_ADMIN. Your argument
about killing some tasks that have CAP_SYS_RESOURCE leaving hardware in an
unpredictable state isn't even addressed by your own patch, you only give
them a 3% memory bonus so they are still eligible.
As mentioned previously, for this patch to make sense, you would need to
show that CAP_SYS_RESOURCE equates to 3% of the available memory's
capacity for a task. I don't believe that evidence has been presented.
This has nothing to do with preventing these threads from being killed (at
the risk of possibly panicking the machine) since your patch doesn't do
that.
> if a user process and a process such email cleint "evolution" with
> ditecly hareware access such as "Xorg", they have eat the equal memory,
> so which process are you want to kill?
>
Both have equal oom killing priority according to the heuristic if they
are not run by root. If you would like to protect Xorg, then you need to
use the userspace tunable to protect it just like everything else does.
This is completely unchanged from the oom killer rewrite.
If you actually have a problem that you're reporting, however, it would
probably be better to show the oom killer log from that event and let us
address it instead of introducing arbitrary heuristics into something
which aims to be as predictable as possible.
On Wed, 10 Nov 2010, Figo.zhang wrote:
> the victim should not directly access hardware devices like Xorg server,
> because the hardware could be left in an unpredictable state, although
> user-application can set /proc/pid/oom_score_adj to protect it. so i think
> those processes should get bonus for protection.
>
Again, this argument doesn't work: if killing the task leaves hardware in
an unpredictable state (and that's presumably harmful), then they
shouldn't be killed at all.
Please show why CAP_SYS_RESOURCE equates to 3% additional memory for such
tasks.
CAP_SYS_RESOURCE allows those threads to override resource limits, so
these have potentially unbounded amounts of memory usage. Thus, they may
have the highest memory usage on the machine and now your patch has caused
other innocent tasks to be killed before this is actually targeted.
That's a bad result. Why do we need this type of hack in the oom killer
when these threads have the privilege to modify oom killing priorities for
all tasks on the system? Laziness, at the cost of a less predictable
heuristic?
Why aren't you doing the same change for __vm_enough_memory() for LSMs?
> in v2, fix the incorrect comment.
> in v3, change the divided the badness score by 4, like old heuristic for protection. we just
> want the oom_killer don't select Root/RESOURCE/RAWIO process as possible.
>
> suppose that if a user process A such as email cleint "evolution" and a process B with
> ditecly hareware access such as "Xorg", they have eat the equal memory (the badness score is
> the same),so which process are you want to kill? so in new heuristic, it will kill the process B.
> but in reality, we want to kill process A.
>
Then you need to protect process B accordingly and since it has
CAP_SYS_RESOURCE it can easily do that on its own or the admin can protect
Xorg.
> Signed-off-by: Figo.zhang <[email protected]>
> Reviewed-by: KOSAKI Motohiro <[email protected]>
Unless you did this in private, I didn't see KOSAKI-san's reviewed-by line
for this change and it is drastically different from what you've proposed
before.
> ---
> mm/oom_kill.c | 9 +++++++++
> 1 files changed, 9 insertions(+), 0 deletions(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 4029583..f43d759 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -202,6 +202,15 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
> points -= 30;
>
> /*
> + * Root and direct hareware access processes are usually more
> + * important, so they should get bonus for protection.
> + */
> + if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
> + has_capability_noaudit(p, CAP_SYS_RESOURCE) ||
> + has_capability_noaudit(p, CAP_SYS_RAWIO))
> + points /= 4;
> +
What on earth? So now CAP_SYS_ADMIN gets a 3% bonus in the if-clause
above this, then we divide a percentage of memory use by 4? What does
that mean AT ALL?
And now you've thrown CAP_SYS_RAWIO in there without any mention in the
changelog?
Are you just trying to introduce all the old arbitrary heuristics from
before the rewrite back into the oom killer like this?
Do you actually have a log from an event where the oom killer targeted the
incorrect task?
> > the victim should not directly access hardware devices like Xorg server,
> > because the hardware could be left in an unpredictable state, although
> > user-application can set /proc/pid/oom_score_adj to protect it. so i think
> > those processes should get 3% bonus for protection.
> >
>
> The logic here is wrong: if killing these tasks can leave hardware in an
> unpredictable state (and that state is presumably harmful), then they
> should be completely immune from oom killing since you're still leaving
> them exposed here to be killed.
>
> So the question that needs to be answered is: why do these threads deserve
> to use 3% more memory (not >4%) than others without getting killed? If
> there was some evidence that these threads have a certain quantity of
> memory they require as a fundamental attribute of CAP_SYS_RAWIO, then I
> have no objection, but that's going to be expressed in a memory quantity
> not a percentage as you have here.
3% is choosed by you :-/
> The CAP_SYS_ADMIN heuristic has a background: it is used in the oom killer
> because we have used the same 3% in __vm_enough_memory() for a long time
> and we want consistency amongst the heuristics. Adding additional bonuses
> with arbitrary values like 3% of memory for things like CAP_SYS_RAWIO
> makes the heuristic less predictable and moves us back toward the old
> heuristic which was almost entirely arbitrary.
That's bogus. __vm_enough_memory() does track virtual adress space. oom-killer
doesn't. It's unrelated.
> Now before KOSAKI-san comes out and says the old heuristic considered
> CAP_SYS_RAWIO and the new one does not so it _must_ be a regression: the
> old heuristic also divided the badness score by 4 for that capability as a
> completely arbitrary value (just like 3% is here). Other traits like
> runtime and nice levels were also removed from the heuristic. What needs
> to be shown is that CAP_SYS_RAWIO requires additional memory just to run
> or we should neglect to free 3% of memory, which could be gigabytes,
> because it has this trait.
Old background is very simple and cleaner.
CAP_SYS_RESOURCE mean the process has a privilege of using more resource.
then, oom-killer gave it additonal bonus.
CAP_SYS_RAWIO mean the process has a direct hardware access privilege
(eg X.org, RDB). and then, killing it might makes system crash.
In another story, somebody doubt 4x bonus is good or not. but 3% has
the same problem.
> the victim should not directly access hardware devices like Xorg server,
> because the hardware could be left in an unpredictable state, although
> user-application can set /proc/pid/oom_score_adj to protect it. so i think
> those processes should get bonus for protection.
>
> in v2, fix the incorrect comment.
> in v3, change the divided the badness score by 4, like old heuristic for protection. we just
> want the oom_killer don't select Root/RESOURCE/RAWIO process as possible.
>
> suppose that if a user process A such as email cleint "evolution" and a process B with
> ditecly hareware access such as "Xorg", they have eat the equal memory (the badness score is
> the same),so which process are you want to kill? so in new heuristic, it will kill the process B.
> but in reality, we want to kill process A.
>
> Signed-off-by: Figo.zhang <[email protected]>
> Reviewed-by: KOSAKI Motohiro <[email protected]>
Sorry for the delay. I've sent completely revert patch to linus. It will
disappear your headache, I believe. I'm sorry that our development
caused your harm. We really don't want it.
Thanks.
> ---
> mm/oom_kill.c | 9 +++++++++
> 1 files changed, 9 insertions(+), 0 deletions(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 4029583..f43d759 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -202,6 +202,15 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
> points -= 30;
>
> /*
> + * Root and direct hareware access processes are usually more
> + * important, so they should get bonus for protection.
> + */
> + if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
> + has_capability_noaudit(p, CAP_SYS_RESOURCE) ||
> + has_capability_noaudit(p, CAP_SYS_RAWIO))
> + points /= 4;
> +
> + /*
> * /proc/pid/oom_score_adj ranges from -1000 to +1000 such that it may
> * either completely disable oom killing or always prefer a certain
> * task.
>
>
On Sun, 14 Nov 2010, KOSAKI Motohiro wrote:
> > So the question that needs to be answered is: why do these threads deserve
> > to use 3% more memory (not >4%) than others without getting killed? If
> > there was some evidence that these threads have a certain quantity of
> > memory they require as a fundamental attribute of CAP_SYS_RAWIO, then I
> > have no objection, but that's going to be expressed in a memory quantity
> > not a percentage as you have here.
>
> 3% is choosed by you :-/
>
No, 3% was chosen in __vm_enough_memory() for LSMs as the comment in the
oom killer shows:
/*
* Root processes get 3% bonus, just like the __vm_enough_memory()
* implementation used by LSMs.
*/
and is described in Documentation/filesystems/proc.txt.
I think in cases of heuristics like this where we obviously want to give
some bonus to CAP_SYS_ADMIN that there is consistency with other bonuses
given elsewhere in the kernel.
> Old background is very simple and cleaner.
>
The old heuristic divided the arbitrary badness score by 4 with
CAP_SYS_RESOURCE. The new heuristic doesn't consider it.
How is that more clean?
> CAP_SYS_RESOURCE mean the process has a privilege of using more resource.
> then, oom-killer gave it additonal bonus.
>
As a side-effect of being given more resources to allocate, those
applications are relatively unbounded in terms of memory consumption to
other tasks. Thus, it's possible that these applications are using a
massive amount of memory (say, 75%) and now with the proposed change a
task using 25% of memory would be killed instead. This increases the
liklihood that the CAP_SYS_RESOURCE thread will have to be killed
eventually, anyway, and the goal is to kill as few tasks as possible to
free sufficient amount of memory.
Since threads having CAP_SYS_RESOURCE have full control over their
oom_score_adj, they can take the additional precautions to protect
themselves if necessary. It doesn't need to be a part of the heuristic to
bias these tasks which will lead to the undesired result described above
by default rather than intentionally from userspace.
> CAP_SYS_RAWIO mean the process has a direct hardware access privilege
> (eg X.org, RDB). and then, killing it might makes system crash.
>
Then you would want to explicitly filter these tasks from oom kill just as
OOM_SCORE_ADJ_MIN works rather than giving them a memory quantity bonus.
On Sun, 14 Nov 2010, KOSAKI Motohiro wrote:
> > the victim should not directly access hardware devices like Xorg server,
> > because the hardware could be left in an unpredictable state, although
> > user-application can set /proc/pid/oom_score_adj to protect it. so i think
> > those processes should get bonus for protection.
> >
> > in v2, fix the incorrect comment.
> > in v3, change the divided the badness score by 4, like old heuristic for protection. we just
> > want the oom_killer don't select Root/RESOURCE/RAWIO process as possible.
> >
> > suppose that if a user process A such as email cleint "evolution" and a process B with
> > ditecly hareware access such as "Xorg", they have eat the equal memory (the badness score is
> > the same),so which process are you want to kill? so in new heuristic, it will kill the process B.
> > but in reality, we want to kill process A.
> >
> > Signed-off-by: Figo.zhang <[email protected]>
> > Reviewed-by: KOSAKI Motohiro <[email protected]>
>
> Sorry for the delay. I've sent completely revert patch to linus. It will
> disappear your headache, I believe. I'm sorry that our development
> caused your harm. We really don't want it.
>
Oh please, your dramatics are getting better and better.
Figo.zhang never described a problem that was being addressed but rather
proposed several different variants of a patch (some with CAP_SYS_ADMIN,
some with CAP_SYS_RESOURCE, some with CAP_SYS_RAWIO, some with a
combination, some with a 3% bonus, some with a order-of-2 bonus, etc) to
return the same heuristic used in the old oom killer. I asked several
times to show the oom killer log from the problematic behavior and none
were presented.
> On Sun, 14 Nov 2010, KOSAKI Motohiro wrote:
>
> > > So the question that needs to be answered is: why do these threads deserve
> > > to use 3% more memory (not >4%) than others without getting killed? If
> > > there was some evidence that these threads have a certain quantity of
> > > memory they require as a fundamental attribute of CAP_SYS_RAWIO, then I
> > > have no objection, but that's going to be expressed in a memory quantity
> > > not a percentage as you have here.
> >
> > 3% is choosed by you :-/
> >
>
> No, 3% was chosen in __vm_enough_memory() for LSMs as the comment in the
> oom killer shows:
>
> /*
> * Root processes get 3% bonus, just like the __vm_enough_memory()
> * implementation used by LSMs.
> */
>
> and is described in Documentation/filesystems/proc.txt.
>
> I think in cases of heuristics like this where we obviously want to give
> some bonus to CAP_SYS_ADMIN that there is consistency with other bonuses
> given elsewhere in the kernel.
Keep comparision apple to apple. vm_enough_memory() account _virtual_ memory.
oom-killer try to free _physical_ memory. It's unrelated.
>
> > Old background is very simple and cleaner.
> >
>
> The old heuristic divided the arbitrary badness score by 4 with
> CAP_SYS_RESOURCE. The new heuristic doesn't consider it.
>
> How is that more clean?
>
> > CAP_SYS_RESOURCE mean the process has a privilege of using more resource.
> > then, oom-killer gave it additonal bonus.
> >
>
> As a side-effect of being given more resources to allocate, those
> applications are relatively unbounded in terms of memory consumption to
> other tasks. Thus, it's possible that these applications are using a
> massive amount of memory (say, 75%) and now with the proposed change a
> task using 25% of memory would be killed instead. This increases the
> liklihood that the CAP_SYS_RESOURCE thread will have to be killed
> eventually, anyway, and the goal is to kill as few tasks as possible to
> free sufficient amount of memory.
You are talking two difference at once. 3% vs 4x and CAP_SYS_RESOURCE and
CAP_SYS_ADMIN.
Please keep comparing apple to apple.
>
> Since threads having CAP_SYS_RESOURCE have full control over their
> oom_score_adj, they can take the additional precautions to protect
> themselves if necessary. It doesn't need to be a part of the heuristic to
> bias these tasks which will lead to the undesired result described above
> by default rather than intentionally from userspace.
>
> > CAP_SYS_RAWIO mean the process has a direct hardware access privilege
> > (eg X.org, RDB). and then, killing it might makes system crash.
> >
>
> Then you would want to explicitly filter these tasks from oom kill just as
> OOM_SCORE_ADJ_MIN works rather than giving them a memory quantity bonus.
No. Why does userland recover your mistake?
>Nothing to say, really. Seems each time we're told about a bug or a
>regression, David either fixes the bug or points out why it wasn't a
>bug or why it wasn't a regression or how it was a deliberate behaviour
>change for the better.
>I just haven't seen any solid reason to be concerned about the state of
>the current oom-killer, sorry.
>I'm concerned that you're concerned! A lot. When someone such as
>yourself is unhappy with part of MM then I sit up and pay attention.
>But after all this time I simply don't understand the technical issues
>which you're seeing here.
we just talk about oom-killer technical issues.
i am doubt that a new rewrite but the athor canot provide some evidence
and experiment result, why did you do that? what is the prominent change
for your new algorithm?
as KOSAKI Motohiro said, "you removed CAP_SYS_RESOURCE condition with
ZERO explanation".
David just said that pls use userspace tunable for protection by
oom_score_adj. but may i ask question:
1. what is your innovation for your new algorithm, the old one have the
same way for user tunable oom_adj.
2. if server like db-server/financial-server have huge import processes
(such as root/hardware access processes)want to be protection, you let
the administrator to find out which processes should be protection. you
will let the financial-server administrator huge crazy!! and lose so
many money!! ^~^
3. i see your email in LKML, you just said
"I have repeatedly said that the oom killer no longer kills KDE when run
on my desktop in the presence of a memory hogging task that was written
specifically to oom the machine."
http://thread.gmane.org/gmane.linux.kernel.mm/48998
so you just test your new oom_killer algorithm on your desktop with KDE,
so have you provide the detail how you do the test? is it do the
experiment again for anyone and got the same result as your comment ?
as KOSAKI Motohiro said, in reality word, it we makes 5-6 brain
simulation, embedded, desktop, web server,db server, hpc, finance.
Different workloads certenally makes big impact. have you do those
experiments?
i think that technology should base on experiment not on imagine.
Best,
Figo.zhang
On Mon, 15 Nov 2010, KOSAKI Motohiro wrote:
> > I think in cases of heuristics like this where we obviously want to give
> > some bonus to CAP_SYS_ADMIN that there is consistency with other bonuses
> > given elsewhere in the kernel.
>
> Keep comparision apple to apple. vm_enough_memory() account _virtual_ memory.
> oom-killer try to free _physical_ memory. It's unrelated.
>
It's not unrelated, the LSM function gives an arbitrary 3% bonus to
CAP_SYS_ADMIN. Such threads should also be preferred in the oom killer
over other threads since they tend to be more important but not an overly
drastic bias such that they don't get killed when using an egregious
amount of memory. So in selecting a small percentage of memory that tends
to be a significant bias but not overwhelming, I went with the 3% found
elsewhere in the kernel. __vm_enough_memory() doesn't have that
preference for any scientifically calculated reason, it's a heuristic just
like oom_badness().
> > > CAP_SYS_RAWIO mean the process has a direct hardware access privilege
> > > (eg X.org, RDB). and then, killing it might makes system crash.
> > >
> >
> > Then you would want to explicitly filter these tasks from oom kill just as
> > OOM_SCORE_ADJ_MIN works rather than giving them a memory quantity bonus.
>
> No. Why does userland recover your mistake?
>
You just said killing any CAP_SYS_RAWIO task may make the system crash, so
presuming that you don't want the system to crash, you are suggesting we
should make these threads completely immune? That's never been the case
(and isn't for oom_kill_allocating_task, either), so there's no history
you can draw from to support your argument.
On Mon, 15 Nov 2010, Figo.zhang wrote:
> i am doubt that a new rewrite but the athor canot provide some evidence and
> experiment result, why did you do that? what is the prominent change for your
> new algorithm?
>
> as KOSAKI Motohiro said, "you removed CAP_SYS_RESOURCE condition with ZERO
> explanation".
>
> David just said that pls use userspace tunable for protection by
> oom_score_adj. but may i ask question:
>
> 1. what is your innovation for your new algorithm, the old one have the same
> way for user tunable oom_adj.
>
The goal was to make the oom killer heuristic as predictable as possible
and to kill the most memory-hogging task to avoid having to recall it and
needlessly kill several tasks.
The goal behind oom_score_adj vs. oom_adj was for several reasons, as
pointed out before:
- give it a unit (proportion of available memory), oom_adj had no unit,
- allow it to work on a linear scale for more control over
prioritization, oom_adj had an exponential scale,
- give it a much higher resolution so it can be fine-tuned, it works with
a granularity of 0.1% of memory (~128M on a 128G machine), and
- allow it to describe the oom killing priority of a task regardless of
its cpuset attachment, mempolicy, or memcg, or when their respective
limits change.
> 2. if server like db-server/financial-server have huge import processes (such
> as root/hardware access processes)want to be protection, you let the
> administrator to find out which processes should be protection. you
> will let the financial-server administrator huge crazy!! and lose so many
> money!! ^~^
>
You have full control over disabling a task from being considered with
oom_score_adj just like you did with oom_adj. Since oom_adj is
deprecated for two years, you can even use the old interface until then.
> 3. i see your email in LKML, you just said
> "I have repeatedly said that the oom killer no longer kills KDE when run on my
> desktop in the presence of a memory hogging task that was written specifically
> to oom the machine."
> http://thread.gmane.org/gmane.linux.kernel.mm/48998
>
> so you just test your new oom_killer algorithm on your desktop with KDE, so
> have you provide the detail how you do the test? is it do the
> experiment again for anyone and got the same result as your comment ?
>
Xorg tends to be killed less because of the change to the heuristic's
baseline, which is now based on rss and swap instead of total_vm. This is
seperate from the issues you list above, but is a benefit to the oom
killer that desktop users especially will notice. I, personally, am
interested more in the server market and that's why I looked for a more
robust userspace tunable that would still be applicable when things like
cpusets have a node added or removed.
> The goal was to make the oom killer heuristic as predictable as possible
> and to kill the most memory-hogging task to avoid having to recall it and
> needlessly kill several tasks.
Meta question - why is that a good thing. In a desktop environment it's
frequently wrong, in a server environment it is often wrong. We had this
before where people spend months fiddling with the vm and make it work
slightly differently and it suits their workload, then other workloads go
downhill. Then the cycle repeats.
> You have full control over disabling a task from being considered with
> oom_score_adj just like you did with oom_adj. Since oom_adj is
> deprecated for two years, you can even use the old interface until then.
Which changeset added it to the Documentation directory as deprecated ?
Alan
On Mon, 15 Nov 2010, Alan Cox wrote:
> > The goal was to make the oom killer heuristic as predictable as possible
> > and to kill the most memory-hogging task to avoid having to recall it and
> > needlessly kill several tasks.
>
> Meta question - why is that a good thing. In a desktop environment it's
> frequently wrong, in a server environment it is often wrong. We had this
> before where people spend months fiddling with the vm and make it work
> slightly differently and it suits their workload, then other workloads go
> downhill. Then the cycle repeats.
>
Most of the arbitrary heuristics were removed from oom_badness(), things
like nice level, runtime, CAP_SYS_RESOURCE, etc., so that we only consider
the rss and swap usage of each application in comparison to each other
when deciding which task to kill. We give root tasks a 3% bonus since
they tend to be more important to the productivity or uptime of the
machine, which did exist -- albeit with a more dramatic impact -- in the
old heursitic.
You'll find that the new heuristic always kills the task consuming the
most amount of rss unless influenced by userspace via the tunables (or
within 3% of root tasks).
We always want to kill the most memory-hogging task because it avoids
needlessly killing additional tasks when we must immediately recall the
oom killer because we continue to allocate memory. If that task happens
to be of vital importance to userspace, then the user has full control
over tuning the oom killer priorities in such circumstances.
> > You have full control over disabling a task from being considered with
> > oom_score_adj just like you did with oom_adj. Since oom_adj is
> > deprecated for two years, you can even use the old interface until then.
>
> Which changeset added it to the Documentation directory as deprecated ?
>
51b1bd2a was the actual change that deprecated it, which was a direct
follow-up to a63d83f4 which actually obsoleted it.
sorry for the delay.
> > The goal was to make the oom killer heuristic as predictable as possible
> > and to kill the most memory-hogging task to avoid having to recall it and
> > needlessly kill several tasks.
>
> Meta question - why is that a good thing. In a desktop environment it's
> frequently wrong, in a server environment it is often wrong. We had this
> before where people spend months fiddling with the vm and make it work
> slightly differently and it suits their workload, then other workloads go
> downhill. Then the cycle repeats.
>
> > You have full control over disabling a task from being considered with
> > oom_score_adj just like you did with oom_adj. Since oom_adj is
> > deprecated for two years, you can even use the old interface until then.
>
> Which changeset added it to the Documentation directory as deprecated ?
It's insufficient.
a63d83f427fbce97a6cea0db2e64b0eb8435cd10 (oom: badness heuristic rewrite)
introduced a lot of incompatibility to oom_adj and oom_score.
Theresore I would sugestted full revert and resubmit some patches which
cherry pick no pain piece.
> On Mon, 15 Nov 2010, KOSAKI Motohiro wrote:
>
> > > I think in cases of heuristics like this where we obviously want to give
> > > some bonus to CAP_SYS_ADMIN that there is consistency with other bonuses
> > > given elsewhere in the kernel.
> >
> > Keep comparision apple to apple. vm_enough_memory() account _virtual_ memory.
> > oom-killer try to free _physical_ memory. It's unrelated.
> >
>
> It's not unrelated, the LSM function gives an arbitrary 3% bonus to
> CAP_SYS_ADMIN.
Unrelated. LSM _is_ security module. and It only account virtual memory.
> Such threads should also be preferred in the oom killer
> over other threads since they tend to be more important but not an overly
> drastic bias such that they don't get killed when using an egregious
> amount of memory. So in selecting a small percentage of memory that tends
> to be a significant bias but not overwhelming, I went with the 3% found
> elsewhere in the kernel. __vm_enough_memory() doesn't have that
> preference for any scientifically calculated reason, it's a heuristic just
> like oom_badness().
__vm_enough_memory() only gurard to memory overcommiting. And it doesn't
have any recover way. We expect admin should recover their HAND. In the
other hand, oom-killer _is_ automatic recover way. It's no need admin's
hand. That's the reason why CAP_ADMIN is important or not.
> > > > CAP_SYS_RAWIO mean the process has a direct hardware access privilege
> > > > (eg X.org, RDB). and then, killing it might makes system crash.
> > > >
> > >
> > > Then you would want to explicitly filter these tasks from oom kill just as
> > > OOM_SCORE_ADJ_MIN works rather than giving them a memory quantity bonus.
> >
> > No. Why does userland recover your mistake?
> >
>
> You just said killing any CAP_SYS_RAWIO task may make the system crash, so
> presuming that you don't want the system to crash, you are suggesting we
> should make these threads completely immune? That's never been the case
> (and isn't for oom_kill_allocating_task, either), so there's no history
> you can draw from to support your argument.
No. I only require YOU have to investigate userland usecase BEFORE making
change.
On Tue, 23 Nov 2010, KOSAKI Motohiro wrote:
> > > > I think in cases of heuristics like this where we obviously want to give
> > > > some bonus to CAP_SYS_ADMIN that there is consistency with other bonuses
> > > > given elsewhere in the kernel.
> > >
> > > Keep comparision apple to apple. vm_enough_memory() account _virtual_ memory.
> > > oom-killer try to free _physical_ memory. It's unrelated.
> > >
> >
> > It's not unrelated, the LSM function gives an arbitrary 3% bonus to
> > CAP_SYS_ADMIN.
>
> Unrelated. LSM _is_ security module. and It only account virtual memory.
>
I needed a small bias for CAP_SYS_ADMIN tasks so I chose 3% since it's the
same proportion used elsewhere in the kernel and works nicely since the
badness score is now a proportion. If you'd like to propose a different
percentage or suggest removing the bias for root tasks altogether, feel
free to propose a patch. Thanks!
> On Tue, 23 Nov 2010, KOSAKI Motohiro wrote:
>
> > > > > I think in cases of heuristics like this where we obviously want to give
> > > > > some bonus to CAP_SYS_ADMIN that there is consistency with other bonuses
> > > > > given elsewhere in the kernel.
> > > >
> > > > Keep comparision apple to apple. vm_enough_memory() account _virtual_ memory.
> > > > oom-killer try to free _physical_ memory. It's unrelated.
> > > >
> > >
> > > It's not unrelated, the LSM function gives an arbitrary 3% bonus to
> > > CAP_SYS_ADMIN.
> >
> > Unrelated. LSM _is_ security module. and It only account virtual memory.
> >
>
> I needed a small bias for CAP_SYS_ADMIN tasks so I chose 3% since it's the
> same proportion used elsewhere in the kernel and works nicely since the
> badness score is now a proportion.
Why? Is this important than X?
> If you'd like to propose a different
> percentage or suggest removing the bias for root tasks altogether, feel
> free to propose a patch. Thanks!
I only need to revert bad change.
Thanks.
On Tue, 30 Nov 2010, KOSAKI Motohiro wrote:
> > I needed a small bias for CAP_SYS_ADMIN tasks so I chose 3% since it's the
> > same proportion used elsewhere in the kernel and works nicely since the
> > badness score is now a proportion.
>
> Why? Is this important than X?
>
We have always preferred to break ties between applications by not
preferring the root task over the user task in the oom killer. If you'd
like to remove this bonus for CAP_SYS_ADMIN, please propose a patch.
Thanks!