Subject: Re: [RFC] oom-kill: give the dying task a higher priority

On Tue, Jun 01, 2010 at 08:50:06AM +0900, KAMEZAWA Hiroyuki wrote:
| On Mon, 31 May 2010 10:52:27 -0300
| "Luis Claudio R. Goncalves" <[email protected]> wrote:
|
| > | If an explanation as "acceralating all thread's priority in a process seems overkill"
| > | is given in changelog or comment, it's ok to me.
| >
| > If my understanding of badness() is right, I wouldn't be ashamed of saying
| > that it seems to be _a bit_ overkill. But I may be wrong in my
| > interpretation.
| >
| > While re-reading the code I noticed that in select_bad_process() we can
| > eventually bump on an already dying task, case in which we just wait for
| > the task to die and avoid killing other tasks. Maybe we could boost the
| > priority of the dying task here too.
| >
| yes, nice catch.

Here is a more complete version of the patch, boosting priority on the
three exit points of the OOM-killer. I also avoid touching the priority if
the task is already an RT task. The patch:


oom-kill: give the dying task a higher priority (v5)

In a system under heavy load it was observed that even after the
oom-killer selects a task to die, the task may take a long time to die.

Right before sending a SIGKILL to the task selected by the oom-killer
this task has it's priority increased so that it can exit() exit soon,
freeing memory. That is accomplished by:

/*
* We give our sacrificial lamb high priority and access to
* all the memory it needs. That way it should be able to
* exit() and clear out its resources quickly...
*/
p->rt.time_slice = HZ;
set_tsk_thread_flag(p, TIF_MEMDIE);

It sounds plausible giving the dying task an even higher priority to be
sure it will be scheduled sooner and free the desired memory. It was
suggested on LKML using SCHED_FIFO:1, the lowest RT priority so that
this task won't interfere with any running RT task.

If the dying task is already an RT task, leave it untouched.

Another good suggestion, implemented here, was to avoid boosting the
dying task priority in case of mem_cgroup OOM.

Signed-off-by: Luis Claudio R. Gon?alves <[email protected]>

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 709aedf..67e18ca 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -52,6 +52,22 @@ static int has_intersects_mems_allowed(struct task_struct *tsk)
return 0;
}

+/*
+ * If this is a system OOM (not a memcg OOM) and the task selected to be
+ * killed is not already running at high (RT) priorities, speed up the
+ * recovery by boosting the dying task to the lowest FIFO priority.
+ * That helps with the recovery and avoids interfering with RT tasks.
+ */
+static void boost_dying_task_prio(struct task_struct *p,
+ struct mem_cgroup *mem)
+{
+ if ((mem == NULL) && !rt_task(p)) {
+ struct sched_param param;
+ param.sched_priority = 1;
+ sched_setscheduler_nocheck(p, SCHED_FIFO, &param);
+ }
+}
+
/**
* badness - calculate a numeric value for how bad this task has been
* @p: task struct of which task we should calculate
@@ -277,8 +293,10 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
* blocked waiting for another task which itself is waiting
* for memory. Is there a better alternative?
*/
- if (test_tsk_thread_flag(p, TIF_MEMDIE))
+ if (test_tsk_thread_flag(p, TIF_MEMDIE)) {
+ boost_dying_task_prio(p, mem);
return ERR_PTR(-1UL);
+ }

/*
* This is in the process of releasing memory so wait for it
@@ -291,9 +309,10 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
* Otherwise we could get an easy OOM deadlock.
*/
if (p->flags & PF_EXITING) {
- if (p != current)
+ if (p != current) {
+ boost_dying_task_prio(p, mem);
return ERR_PTR(-1UL);
-
+ }
chosen = p;
*ppoints = ULONG_MAX;
}
@@ -380,7 +399,8 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
* flag though it's unlikely that we select a process with CAP_SYS_RAW_IO
* set.
*/
-static void __oom_kill_task(struct task_struct *p, int verbose)
+static void __oom_kill_task(struct task_struct *p, struct mem_cgroup *mem,
+ int verbose)
{
if (is_global_init(p)) {
WARN_ON(1);
@@ -413,11 +433,11 @@ static void __oom_kill_task(struct task_struct *p, int verbose)
*/
p->rt.time_slice = HZ;
set_tsk_thread_flag(p, TIF_MEMDIE);
-
force_sig(SIGKILL, p);
+ boost_dying_task_prio(p, mem);
}

-static int oom_kill_task(struct task_struct *p)
+static int oom_kill_task(struct task_struct *p, struct mem_cgroup *mem)
{
/* WARNING: mm may not be dereferenced since we did not obtain its
* value from get_task_mm(p). This is OK since all we need to do is
@@ -430,7 +450,7 @@ static int oom_kill_task(struct task_struct *p)
if (!p->mm || p->signal->oom_adj == OOM_DISABLE)
return 1;

- __oom_kill_task(p, 1);
+ __oom_kill_task(p, mem, 1);

return 0;
}
@@ -449,7 +469,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
* its children or threads, just set TIF_MEMDIE so it can die quickly
*/
if (p->flags & PF_EXITING) {
- __oom_kill_task(p, 0);
+ __oom_kill_task(p, mem, 0);
return 0;
}

@@ -462,10 +482,10 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
continue;
if (mem && !task_in_mem_cgroup(c, mem))
continue;
- if (!oom_kill_task(c))
+ if (!oom_kill_task(c, mem))
return 0;
}
- return oom_kill_task(p);
+ return oom_kill_task(p, mem);
}

#ifdef CONFIG_CGROUP_MEM_RES_CTLR

--
[ Luis Claudio R. Goncalves Bass - Gospel - RT ]
[ Fingerprint: 4FDD B8C4 3C59 34BD 8BE9 2696 7203 D980 A448 C8F8 ]


2010-06-01 20:50:11

by David Rientjes

[permalink] [raw]
Subject: Re: [RFC] oom-kill: give the dying task a higher priority

On Tue, 1 Jun 2010, Luis Claudio R. Goncalves wrote:

> oom-kill: give the dying task a higher priority (v5)
>
> In a system under heavy load it was observed that even after the
> oom-killer selects a task to die, the task may take a long time to die.
>
> Right before sending a SIGKILL to the task selected by the oom-killer
> this task has it's priority increased so that it can exit() exit soon,
> freeing memory. That is accomplished by:
>
> /*
> * We give our sacrificial lamb high priority and access to
> * all the memory it needs. That way it should be able to
> * exit() and clear out its resources quickly...
> */
> p->rt.time_slice = HZ;
> set_tsk_thread_flag(p, TIF_MEMDIE);
>
> It sounds plausible giving the dying task an even higher priority to be
> sure it will be scheduled sooner and free the desired memory. It was
> suggested on LKML using SCHED_FIFO:1, the lowest RT priority so that
> this task won't interfere with any running RT task.
>
> If the dying task is already an RT task, leave it untouched.
>
> Another good suggestion, implemented here, was to avoid boosting the
> dying task priority in case of mem_cgroup OOM.
>
> Signed-off-by: Luis Claudio R. Gon?alves <[email protected]>
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 709aedf..67e18ca 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -52,6 +52,22 @@ static int has_intersects_mems_allowed(struct task_struct *tsk)
> return 0;
> }
>
> +/*
> + * If this is a system OOM (not a memcg OOM) and the task selected to be
> + * killed is not already running at high (RT) priorities, speed up the
> + * recovery by boosting the dying task to the lowest FIFO priority.
> + * That helps with the recovery and avoids interfering with RT tasks.
> + */
> +static void boost_dying_task_prio(struct task_struct *p,
> + struct mem_cgroup *mem)
> +{
> + if ((mem == NULL) && !rt_task(p)) {
> + struct sched_param param;
> + param.sched_priority = 1;
> + sched_setscheduler_nocheck(p, SCHED_FIFO, &param);
> + }
> +}
> +
> /**
> * badness - calculate a numeric value for how bad this task has been
> * @p: task struct of which task we should calculate
> @@ -277,8 +293,10 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
> * blocked waiting for another task which itself is waiting
> * for memory. Is there a better alternative?
> */
> - if (test_tsk_thread_flag(p, TIF_MEMDIE))
> + if (test_tsk_thread_flag(p, TIF_MEMDIE)) {
> + boost_dying_task_prio(p, mem);
> return ERR_PTR(-1UL);
> + }
>
> /*
> * This is in the process of releasing memory so wait for it

That's unnecessary, if p already has TIF_MEMDIE set, then
boost_dying_task_prio(p) has already been called.

> @@ -291,9 +309,10 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
> * Otherwise we could get an easy OOM deadlock.
> */
> if (p->flags & PF_EXITING) {
> - if (p != current)
> + if (p != current) {
> + boost_dying_task_prio(p, mem);
> return ERR_PTR(-1UL);
> -
> + }
> chosen = p;
> *ppoints = ULONG_MAX;
> }

This has the potential to actually make it harder to free memory if p is
waiting to acquire a writelock on mm->mmap_sem in the exit path while the
thread holding mm->mmap_sem is trying to run.

2010-06-02 13:54:10

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [RFC] oom-kill: give the dying task a higher priority

> > @@ -291,9 +309,10 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
> > * Otherwise we could get an easy OOM deadlock.
> > */
> > if (p->flags & PF_EXITING) {
> > - if (p != current)
> > + if (p != current) {
> > + boost_dying_task_prio(p, mem);
> > return ERR_PTR(-1UL);
> > -
> > + }
> > chosen = p;
> > *ppoints = ULONG_MAX;
> > }
>
> This has the potential to actually make it harder to free memory if p is
> waiting to acquire a writelock on mm->mmap_sem in the exit path while the
> thread holding mm->mmap_sem is trying to run.

if p is waiting, changing prio have no effect. It continue tol wait to release mmap_sem.

Subject: Re: [RFC] oom-kill: give the dying task a higher priority

On Wed, Jun 02, 2010 at 10:54:01PM +0900, KOSAKI Motohiro wrote:
| > > @@ -291,9 +309,10 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
| > > * Otherwise we could get an easy OOM deadlock.
| > > */
| > > if (p->flags & PF_EXITING) {
| > > - if (p != current)
| > > + if (p != current) {
| > > + boost_dying_task_prio(p, mem);
| > > return ERR_PTR(-1UL);
| > > -
| > > + }
| > > chosen = p;
| > > *ppoints = ULONG_MAX;
| > > }
| >
| > This has the potential to actually make it harder to free memory if p is
| > waiting to acquire a writelock on mm->mmap_sem in the exit path while the
| > thread holding mm->mmap_sem is trying to run.
|
| if p is waiting, changing prio have no effect. It continue tol wait to release mmap_sem.

Ok, that was not a good idea after all :)

But I understand the !rt_task(p) test is necessary to avoid decrementing
the priority of an eventual RT task selected to die. Though it may also be
a corner case in badness().

Luis
--
[ Luis Claudio R. Goncalves Bass - Gospel - RT ]
[ Fingerprint: 4FDD B8C4 3C59 34BD 8BE9 2696 7203 D980 A448 C8F8 ]

2010-06-02 21:12:09

by David Rientjes

[permalink] [raw]
Subject: Re: [RFC] oom-kill: give the dying task a higher priority

On Wed, 2 Jun 2010, KOSAKI Motohiro wrote:

> > > @@ -291,9 +309,10 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
> > > * Otherwise we could get an easy OOM deadlock.
> > > */
> > > if (p->flags & PF_EXITING) {
> > > - if (p != current)
> > > + if (p != current) {
> > > + boost_dying_task_prio(p, mem);
> > > return ERR_PTR(-1UL);
> > > -
> > > + }
> > > chosen = p;
> > > *ppoints = ULONG_MAX;
> > > }
> >
> > This has the potential to actually make it harder to free memory if p is
> > waiting to acquire a writelock on mm->mmap_sem in the exit path while the
> > thread holding mm->mmap_sem is trying to run.
>
> if p is waiting, changing prio have no effect. It continue tol wait to release mmap_sem.
>

And that can reduce the runtime of the thread holding a writelock on
mm->mmap_sem, making the exit actually take longer than without the patch
if its priority is significantly higher, especially on smaller machines.

2010-06-02 23:36:44

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [RFC] oom-kill: give the dying task a higher priority

> On Wed, 2 Jun 2010, KOSAKI Motohiro wrote:
>
> > > > @@ -291,9 +309,10 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
> > > > * Otherwise we could get an easy OOM deadlock.
> > > > */
> > > > if (p->flags & PF_EXITING) {
> > > > - if (p != current)
> > > > + if (p != current) {
> > > > + boost_dying_task_prio(p, mem);
> > > > return ERR_PTR(-1UL);
> > > > -
> > > > + }
> > > > chosen = p;
> > > > *ppoints = ULONG_MAX;
> > > > }
> > >
> > > This has the potential to actually make it harder to free memory if p is
> > > waiting to acquire a writelock on mm->mmap_sem in the exit path while the
> > > thread holding mm->mmap_sem is trying to run.
> >
> > if p is waiting, changing prio have no effect. It continue tol wait to release mmap_sem.
> >
>
> And that can reduce the runtime of the thread holding a writelock on
> mm->mmap_sem, making the exit actually take longer than without the patch
> if its priority is significantly higher, especially on smaller machines.

If p need mmap_sem, p is going to sleep to wait mmap_sem. if p doesn't,
quickly exit is good thing. In other word, task fairness is not our goal
when oom occur.


2010-06-03 00:52:54

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC] oom-kill: give the dying task a higher priority

On Thu, Jun 3, 2010 at 8:36 AM, KOSAKI Motohiro
<[email protected]> wrote:
>> On Wed, 2 Jun 2010, KOSAKI Motohiro wrote:
>>
>> > > > @@ -291,9 +309,10 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
>> > > >                  * Otherwise we could get an easy OOM deadlock.
>> > > >                  */
>> > > >                 if (p->flags & PF_EXITING) {
>> > > > -                       if (p != current)
>> > > > +                       if (p != current) {
>> > > > +                               boost_dying_task_prio(p, mem);
>> > > >                                 return ERR_PTR(-1UL);
>> > > > -
>> > > > +                       }
>> > > >                         chosen = p;
>> > > >                         *ppoints = ULONG_MAX;
>> > > >                 }
>> > >
>> > > This has the potential to actually make it harder to free memory if p is
>> > > waiting to acquire a writelock on mm->mmap_sem in the exit path while the
>> > > thread holding mm->mmap_sem is trying to run.
>> >
>> > if p is waiting, changing prio have no effect. It continue tol wait to release mmap_sem.
>> >
>>
>> And that can reduce the runtime of the thread holding a writelock on
>> mm->mmap_sem, making the exit actually take longer than without the patch
>> if its priority is significantly higher, especially on smaller machines.
>
> If p need mmap_sem, p is going to sleep to wait mmap_sem. if p doesn't,
> quickly exit is good thing. In other word, task fairness is not our goal
> when oom occur.
>

Tend to agree. I didn't agree boosting of whole threads' priority.

Task fairness VS system hang is trade off. task fairness is best
effort but system hang is critical.
Also, we have tried to it.

/*
* We give our sacrificial lamb high priority and access to
* all the memory it needs. That way it should be able to
* exit() and clear out its resources quickly...
*/
p->rt.time_slice = HZ;
set_tsk_thread_flag(p, TIF_MEMDIE);

But I think above code is meaningless unless p use SCHED_RR.
So boosting of lowest RT priority with FIFO is to meet above comment's
goal, I think.

--
Kind regards,
Minchan Kim

2010-06-03 07:51:06

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] oom-kill: give the dying task a higher priority

On Wed, 2010-06-02 at 14:11 -0700, David Rientjes wrote:
>
> And that can reduce the runtime of the thread holding a writelock on
> mm->mmap_sem, making the exit actually take longer than without the patch
> if its priority is significantly higher, especially on smaller machines.

/me smells an inversion... on -rt we solved those ;-)

2010-06-03 20:32:16

by David Rientjes

[permalink] [raw]
Subject: Re: [RFC] oom-kill: give the dying task a higher priority

On Thu, 3 Jun 2010, Peter Zijlstra wrote:

> > And that can reduce the runtime of the thread holding a writelock on
> > mm->mmap_sem, making the exit actually take longer than without the patch
> > if its priority is significantly higher, especially on smaller machines.
>
> /me smells an inversion... on -rt we solved those ;-)
>

Right, but I don't see how increasing an oom killed tasks priority to a
divine priority doesn't impact the priorities of other tasks which may be
blocking the exit of that task, namely a coredumper or holder of
mm->mmap_sem. This patch also doesn't address how it negatively impacts
the priorities of jobs running in different cpusets (although sharing the
same cpus) because one cpuset is oom.