2011-03-24 15:28:13

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 2/5] Revert "oom: give the dying task a higher priority"

On Tue, Mar 22, 2011 at 08:06:48PM +0900, KOSAKI Motohiro wrote:
> This reverts commit 93b43fa55088fe977503a156d1097cc2055449a2.
>
> The commit dramatically improve oom killer logic when fork-bomb
> occur. But, I've found it has nasty corner case. Now cpu cgroup
> has strange default RT runtime. It's 0! That said, if a process
> under cpu cgroup promote RT scheduling class, the process never
> run at all.
>
> Eventually, kernel may hang up when oom kill occur.
>
> The author need to resubmit it as adding knob and disabled
> by default if he really need this feature.
>
> Cc: Luis Claudio R. Goncalves <[email protected]>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>

Just a comment below.

> ---
> mm/oom_kill.c | 27 ---------------------------
> 1 files changed, 0 insertions(+), 27 deletions(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 3100bc5..739dee4 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -84,24 +84,6 @@ static bool has_intersects_mems_allowed(struct task_struct *tsk,
> #endif /* CONFIG_NUMA */
>
> /*
> - * If this is a system OOM (not a memcg OOM) and the task selected to be
> - * killed is not already running at high (RT) priorities, speed up the
> - * recovery by boosting the dying task to the lowest FIFO priority.
> - * That helps with the recovery and avoids interfering with RT tasks.
> - */
> -static void boost_dying_task_prio(struct task_struct *p,
> - struct mem_cgroup *mem)
> -{
> - struct sched_param param = { .sched_priority = 1 };
> -
> - if (mem)
> - return;
> -
> - if (!rt_task(p))
> - sched_setscheduler_nocheck(p, SCHED_FIFO, &param);
> -}
> -
> -/*
> * The process p may have detached its own ->mm while exiting or through
> * use_mm(), but one or more of its subthreads may still have a valid
> * pointer. Return p, or any of its subthreads with a valid ->mm, with
> @@ -452,13 +434,6 @@ static int oom_kill_task(struct task_struct *p, struct mem_cgroup *mem)
> set_tsk_thread_flag(p, TIF_MEMDIE);
> force_sig(SIGKILL, p);
>
> - /*
> - * We give our sacrificial lamb high priority and access to
> - * all the memory it needs. That way it should be able to
> - * exit() and clear out its resources quickly...
> - */
> - boost_dying_task_prio(p, mem);
> -

Before merging 93b43fa5508, we had a following routine.

+static int oom_kill_task(struct task_struct *p, struct mem_cgroup *mem)
{
p = find_lock_task_mm(p);
if (!p) {
@@ -434,9 +452,17 @@ static int oom_kill_task(struct task_struct *p)
K(get_mm_counter(p->mm, MM_FILEPAGES)));
task_unlock(p);

- p->rt.time_slice = HZ; <<---- THIS
+
set_tsk_thread_flag(p, TIF_MEMDIE);
force_sig(SIGKILL, p);
+
+ /*
+ * We give our sacrificial lamb high priority and access to
+ * all the memory it needs. That way it should be able to
+ * exit() and clear out its resources quickly...
+ */
+ boost_dying_task_prio(p, mem);
+
return 0;
}

At that time, I thought that routine is meaningless in non-RT scheduler.
So I Cced Peter but don't get the answer.
I just want to confirm it.

Do you still think it's meaningless?
so you remove it when you revert 93b43fa5508?
Then, this isn't just revert patch but revert + killing meaningless code patch.


-
Kind regards,
Minchan Kim


2011-03-28 09:48:19

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 2/5] Revert "oom: give the dying task a higher priority"

Hi

> @@ -434,9 +452,17 @@ static int oom_kill_task(struct task_struct *p)
> K(get_mm_counter(p->mm, MM_FILEPAGES)));
> task_unlock(p);
>
> - p->rt.time_slice = HZ; <<---- THIS
> +
> set_tsk_thread_flag(p, TIF_MEMDIE);
> force_sig(SIGKILL, p);
> +
> + /*
> + * We give our sacrificial lamb high priority and access to
> + * all the memory it needs. That way it should be able to
> + * exit() and clear out its resources quickly...
> + */
> + boost_dying_task_prio(p, mem);
> +
> return 0;
> }
>
> At that time, I thought that routine is meaningless in non-RT scheduler.
> So I Cced Peter but don't get the answer.
> I just want to confirm it.
>
> Do you still think it's meaningless?

In short, yes.


> so you remove it when you revert 93b43fa5508?
> Then, this isn't just revert patch but revert + killing meaningless code patch.

If you want, I'd like to rename a patch title. That said, we can't revert
93b43fa5508 simple cleanly, several patches depend on it. therefore I
reverted it manualy. and at that time, I don't want to resurrect
meaningless logic. anyway it's no matter. Luis is preparing new patches.
therefore we will get the same end result. :)


2011-03-28 09:52:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/5] Revert "oom: give the dying task a higher priority"

On Fri, 2011-03-25 at 00:27 +0900, Minchan Kim wrote:
>
> At that time, I thought that routine is meaningless in non-RT scheduler.
> So I Cced Peter but don't get the answer.
> I just want to confirm it.

Probably lost somewhere in the mess that is my inbox :/, what is the
full question?

2011-03-28 12:21:41

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 2/5] Revert "oom: give the dying task a higher priority"

Hi Peter,

On Mon, Mar 28, 2011 at 11:51:36AM +0200, Peter Zijlstra wrote:
> On Fri, 2011-03-25 at 00:27 +0900, Minchan Kim wrote:
> >
> > At that time, I thought that routine is meaningless in non-RT scheduler.
> > So I Cced Peter but don't get the answer.
> > I just want to confirm it.
>
> Probably lost somewhere in the mess that is my inbox :/, what is the
> full question?

The question is we had a routine which change rt.time_slice with HZ to
accelarate task exit. But when we applied 93b43fa5508, we found it isn't effective
any more about normal task. So we removed it. Is it right?

And Kosaki is about to revert 93b43fa5508 to find out the problem of this thread
and Luis said he has a another solution to replace 93b43fa5508.
If rt.time_slice handleing is effective, we should restore it until Luis's patch
will be merged.


--
Kind regards,
Minchan Kim

2011-03-28 12:29:01

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 2/5] Revert "oom: give the dying task a higher priority"

On Mon, Mar 28, 2011 at 06:48:13PM +0900, KOSAKI Motohiro wrote:
> Hi
>
> > @@ -434,9 +452,17 @@ static int oom_kill_task(struct task_struct *p)
> > K(get_mm_counter(p->mm, MM_FILEPAGES)));
> > task_unlock(p);
> >
> > - p->rt.time_slice = HZ; <<---- THIS
> > +
> > set_tsk_thread_flag(p, TIF_MEMDIE);
> > force_sig(SIGKILL, p);
> > +
> > + /*
> > + * We give our sacrificial lamb high priority and access to
> > + * all the memory it needs. That way it should be able to
> > + * exit() and clear out its resources quickly...
> > + */
> > + boost_dying_task_prio(p, mem);
> > +
> > return 0;
> > }
> >
> > At that time, I thought that routine is meaningless in non-RT scheduler.
> > So I Cced Peter but don't get the answer.
> > I just want to confirm it.
> >
> > Do you still think it's meaningless?
>
> In short, yes.
>
>
> > so you remove it when you revert 93b43fa5508?
> > Then, this isn't just revert patch but revert + killing meaningless code patch.
>
> If you want, I'd like to rename a patch title. That said, we can't revert
> 93b43fa5508 simple cleanly, several patches depend on it. therefore I
> reverted it manualy. and at that time, I don't want to resurrect
> meaningless logic. anyway it's no matter. Luis is preparing new patches.
> therefore we will get the same end result. :)

I don't mind it, either. :)
I just want to make sure the meaningless logic.
Thanks.

--
Kind regards,
Minchan Kim

2011-03-28 12:29:54

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/5] Revert "oom: give the dying task a higher priority"

On Mon, 2011-03-28 at 21:21 +0900, Minchan Kim wrote:
> Hi Peter,
>
> On Mon, Mar 28, 2011 at 11:51:36AM +0200, Peter Zijlstra wrote:
> > On Fri, 2011-03-25 at 00:27 +0900, Minchan Kim wrote:
> > >
> > > At that time, I thought that routine is meaningless in non-RT scheduler.
> > > So I Cced Peter but don't get the answer.
> > > I just want to confirm it.
> >
> > Probably lost somewhere in the mess that is my inbox :/, what is the
> > full question?
>
> The question is we had a routine which change rt.time_slice with HZ to
> accelarate task exit. But when we applied 93b43fa5508, we found it isn't effective
> any more about normal task. So we removed it. Is it right?

rt.time_slice is only relevant to SCHED_RR, since you seem to use
SCHED_FIFO (which runs for as long as the task is runnable), its
completely irrelevant.

> And Kosaki is about to revert 93b43fa5508 to find out the problem of this thread
> and Luis said he has a another solution to replace 93b43fa5508.
> If rt.time_slice handleing is effective, we should restore it until Luis's patch
> will be merged.

Right, so only SCHED_RR is affected by time_slice, it will be
decremented on tick (so anything that avoids ticks will also avoid the
decrement) and once it reaches 0 the task will be queued at the tail of
its static priority and reset the slice. If there is no other task on
that same priority we'll again schedule that task.

In short, don't use SCHED_RR and don't worry about time_slice.

2011-03-28 12:40:51

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 2/5] Revert "oom: give the dying task a higher priority"

On Mon, Mar 28, 2011 at 02:28:27PM +0200, Peter Zijlstra wrote:
> On Mon, 2011-03-28 at 21:21 +0900, Minchan Kim wrote:
> > Hi Peter,
> >
> > On Mon, Mar 28, 2011 at 11:51:36AM +0200, Peter Zijlstra wrote:
> > > On Fri, 2011-03-25 at 00:27 +0900, Minchan Kim wrote:
> > > >
> > > > At that time, I thought that routine is meaningless in non-RT scheduler.
> > > > So I Cced Peter but don't get the answer.
> > > > I just want to confirm it.
> > >
> > > Probably lost somewhere in the mess that is my inbox :/, what is the
> > > full question?
> >
> > The question is we had a routine which change rt.time_slice with HZ to
> > accelarate task exit. But when we applied 93b43fa5508, we found it isn't effective
> > any more about normal task. So we removed it. Is it right?
>
> rt.time_slice is only relevant to SCHED_RR, since you seem to use
> SCHED_FIFO (which runs for as long as the task is runnable), its
> completely irrelevant.
>
> > And Kosaki is about to revert 93b43fa5508 to find out the problem of this thread
> > and Luis said he has a another solution to replace 93b43fa5508.
> > If rt.time_slice handleing is effective, we should restore it until Luis's patch
> > will be merged.
>
> Right, so only SCHED_RR is affected by time_slice, it will be
> decremented on tick (so anything that avoids ticks will also avoid the
> decrement) and once it reaches 0 the task will be queued at the tail of
> its static priority and reset the slice. If there is no other task on
> that same priority we'll again schedule that task.
>
> In short, don't use SCHED_RR and don't worry about time_slice.

There was meaningless code in there. I guess it was in there from CFS.
Thanks for the explanation, Peter.

--
Kind regards,
Minchan Kim

Subject: Re: [PATCH 2/5] Revert "oom: give the dying task a higher priority"

On Mon, Mar 28, 2011 at 09:40:25PM +0900, Minchan Kim wrote:
| On Mon, Mar 28, 2011 at 02:28:27PM +0200, Peter Zijlstra wrote:
| > On Mon, 2011-03-28 at 21:21 +0900, Minchan Kim wrote:
| > > Hi Peter,
| > >
| > > On Mon, Mar 28, 2011 at 11:51:36AM +0200, Peter Zijlstra wrote:
| > > > On Fri, 2011-03-25 at 00:27 +0900, Minchan Kim wrote:
| > > > >
| > > > > At that time, I thought that routine is meaningless in non-RT scheduler.
| > > > > So I Cced Peter but don't get the answer.
| > > > > I just want to confirm it.
| > > >
| > > > Probably lost somewhere in the mess that is my inbox :/, what is the
| > > > full question?
| > >
| > > The question is we had a routine which change rt.time_slice with HZ to
| > > accelarate task exit. But when we applied 93b43fa5508, we found it isn't effective
| > > any more about normal task. So we removed it. Is it right?
| >
| > rt.time_slice is only relevant to SCHED_RR, since you seem to use
| > SCHED_FIFO (which runs for as long as the task is runnable), its
| > completely irrelevant.
| >
| > > And Kosaki is about to revert 93b43fa5508 to find out the problem of this thread
| > > and Luis said he has a another solution to replace 93b43fa5508.
| > > If rt.time_slice handleing is effective, we should restore it until Luis's patch
| > > will be merged.
| >
| > Right, so only SCHED_RR is affected by time_slice, it will be
| > decremented on tick (so anything that avoids ticks will also avoid the
| > decrement) and once it reaches 0 the task will be queued at the tail of
| > its static priority and reset the slice. If there is no other task on
| > that same priority we'll again schedule that task.
| >
| > In short, don't use SCHED_RR and don't worry about time_slice.
|
| There was meaningless code in there. I guess it was in there from CFS.
| Thanks for the explanation, Peter.

Yes, it was CFS related:

p = find_lock_task_mm(p);
...
p->rt.time_slice = HZ; <<---- THIS

Peter, would that be effective to boost the priority of the dying task?
I mean, in the context of SCHED_OTHER tasks would it really help the dying
task to be scheduled sooner to release its resources? If so, as we remove
the code in commit 93b43fa5508 we should re-add that old code.

Luis
--
[ Luis Claudio R. Goncalves Red Hat - Realtime Team ]
[ Fingerprint: 4FDD B8C4 3C59 34BD 8BE9 2696 7203 D980 A448 C8F8 ]

2011-03-28 13:18:46

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/5] Revert "oom: give the dying task a higher priority"

On Mon, 2011-03-28 at 10:10 -0300, Luis Claudio R. Goncalves wrote:
> | There was meaningless code in there. I guess it was in there from CFS.
> | Thanks for the explanation, Peter.
>
> Yes, it was CFS related:
>
> p = find_lock_task_mm(p);
> ...
> p->rt.time_slice = HZ; <<---- THIS

CFS has never used rt.time_slice, that's always been a pure SCHED_RR
thing.

> Peter, would that be effective to boost the priority of the dying task?

The thing you're currently doing, making it SCHED_FIFO ?

> I mean, in the context of SCHED_OTHER tasks would it really help the dying
> task to be scheduled sooner to release its resources?

That very much depends on how all this stuff works, I guess if everybody
serializes on OOM and only the first will actually kill a task and all
the waiting tasks will try to allocate a page again before also doing
the OOM thing, and the pending tasks are woken after the OOM target task
has completed dying.. then I don't see much point in boosting things,
since everybody interested in memory will block and eventually only the
dying task will be left running.

Its been a very long while since I stared at the OOM code..

> If so, as we remove
> the code in commit 93b43fa5508 we should re-add that old code.

It doesn't make any sense to fiddle with rt.time_slice afaict.

2011-03-28 13:48:18

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 2/5] Revert "oom: give the dying task a higher priority"

On Mon, Mar 28, 2011 at 10:10:29AM -0300, Luis Claudio R. Goncalves wrote:
> On Mon, Mar 28, 2011 at 09:40:25PM +0900, Minchan Kim wrote:
> | On Mon, Mar 28, 2011 at 02:28:27PM +0200, Peter Zijlstra wrote:
> | > On Mon, 2011-03-28 at 21:21 +0900, Minchan Kim wrote:
> | > > Hi Peter,
> | > >
> | > > On Mon, Mar 28, 2011 at 11:51:36AM +0200, Peter Zijlstra wrote:
> | > > > On Fri, 2011-03-25 at 00:27 +0900, Minchan Kim wrote:
> | > > > >
> | > > > > At that time, I thought that routine is meaningless in non-RT scheduler.
> | > > > > So I Cced Peter but don't get the answer.
> | > > > > I just want to confirm it.
> | > > >
> | > > > Probably lost somewhere in the mess that is my inbox :/, what is the
> | > > > full question?
> | > >
> | > > The question is we had a routine which change rt.time_slice with HZ to
> | > > accelarate task exit. But when we applied 93b43fa5508, we found it isn't effective
> | > > any more about normal task. So we removed it. Is it right?
> | >
> | > rt.time_slice is only relevant to SCHED_RR, since you seem to use
> | > SCHED_FIFO (which runs for as long as the task is runnable), its
> | > completely irrelevant.
> | >
> | > > And Kosaki is about to revert 93b43fa5508 to find out the problem of this thread
> | > > and Luis said he has a another solution to replace 93b43fa5508.
> | > > If rt.time_slice handleing is effective, we should restore it until Luis's patch
> | > > will be merged.
> | >
> | > Right, so only SCHED_RR is affected by time_slice, it will be
> | > decremented on tick (so anything that avoids ticks will also avoid the
> | > decrement) and once it reaches 0 the task will be queued at the tail of
> | > its static priority and reset the slice. If there is no other task on
> | > that same priority we'll again schedule that task.
> | >
> | > In short, don't use SCHED_RR and don't worry about time_slice.
> |
> | There was meaningless code in there. I guess it was in there from CFS.
> | Thanks for the explanation, Peter.
>
> Yes, it was CFS related:

I think it wasn't related CFS but O(1).
I guess as we changed O(1) with CFS, the fault was remained.

--
Kind regards,
Minchan Kim

Subject: Re: [PATCH 2/5] Revert "oom: give the dying task a higher priority"

On Mon, Mar 28, 2011 at 03:18:13PM +0200, Peter Zijlstra wrote:
| On Mon, 2011-03-28 at 10:10 -0300, Luis Claudio R. Goncalves wrote:
| > | There was meaningless code in there. I guess it was in there from CFS.
| > | Thanks for the explanation, Peter.
| >
| > Yes, it was CFS related:
| >
| > p = find_lock_task_mm(p);
| > ...
| > p->rt.time_slice = HZ; <<---- THIS
|
| CFS has never used rt.time_slice, that's always been a pure SCHED_RR
| thing.
|
| > Peter, would that be effective to boost the priority of the dying task?
|
| The thing you're currently doing, making it SCHED_FIFO ?

I meant the p->rt.time_slice line, but you already answered my question.
Thanks :)

| > I mean, in the context of SCHED_OTHER tasks would it really help the dying
| > task to be scheduled sooner to release its resources?
|
| That very much depends on how all this stuff works, I guess if everybody
| serializes on OOM and only the first will actually kill a task and all
| the waiting tasks will try to allocate a page again before also doing
| the OOM thing, and the pending tasks are woken after the OOM target task
| has completed dying.. then I don't see much point in boosting things,
| since everybody interested in memory will block and eventually only the
| dying task will be left running.
|
| Its been a very long while since I stared at the OOM code..
|
| > If so, as we remove
| > the code in commit 93b43fa5508 we should re-add that old code.
|
| It doesn't make any sense to fiddle with rt.time_slice afaict.
---end quoted text---

--
[ Luis Claudio R. Goncalves Red Hat - Realtime Team ]
[ Fingerprint: 4FDD B8C4 3C59 34BD 8BE9 2696 7203 D980 A448 C8F8 ]

2011-03-29 02:46:38

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 2/5] Revert "oom: give the dying task a higher priority"

Hi

> > I mean, in the context of SCHED_OTHER tasks would it really help the dying
> > task to be scheduled sooner to release its resources?
>
> That very much depends on how all this stuff works, I guess if everybody
> serializes on OOM and only the first will actually kill a task and all
> the waiting tasks will try to allocate a page again before also doing
> the OOM thing, and the pending tasks are woken after the OOM target task
> has completed dying.. then I don't see much point in boosting things,
> since everybody interested in memory will block and eventually only the
> dying task will be left running.

Probably I can answer this question. When OOM occur, kernel has very a
few pages (typically 10 - 100). but not 0. therefore bloody page-in vs
page-out battle (aka allocation vs free battle) is running.

IOW, While we have multiple cpu or per-cpu page queue, we don't see
page cache become completely 0.

Therefore, not killed task doesn't sleep completely. page-out may have
very small allocation successful chance. (but almostly it's fail. pages
are stealed by another task)

Before Luis's patch, kernel livelock on oom may be solved within 30min,
but after his patch, it's solved within 1 second. that's big different
for human response time. That's the test result.

Thanks.