LinuxLists.cc - process hangs on do

On Fri, 2012-10-26 at 10:42 +0800, Qiang Gao wrote:
> On Thu, Oct 25, 2012 at 5:57 PM, Michal Hocko <[email protected]> wrote:
> > On Wed 24-10-12 11:44:17, Qiang Gao wrote:
> >> On Wed, Oct 24, 2012 at 1:43 AM, Balbir Singh <[email protected]> wrote:
> >> > On Tue, Oct 23, 2012 at 3:45 PM, Michal Hocko <[email protected]> wrote:
> >> >> On Tue 23-10-12 18:10:33, Qiang Gao wrote:
> >> >>> On Tue, Oct 23, 2012 at 5:50 PM, Michal Hocko <[email protected]> wrote:
> >> >>> > On Tue 23-10-12 15:18:48, Qiang Gao wrote:
> >> >>> >> This process was moved to RT-priority queue when global oom-killer
> >> >>> >> happened to boost the recovery of the system..
> >> >>> >
> >> >>> > Who did that? oom killer doesn't boost the priority (scheduling class)
> >> >>> > AFAIK.
> >> >>> >
> >> >>> >> but it wasn't get properily dealt with. I still have no idea why where
> >> >>> >> the problem is ..
> >> >>> >
> >> >>> > Well your configuration says that there is no runtime reserved for the
> >> >>> > group.
> >> >>> > Please refer to Documentation/scheduler/sched-rt-group.txt for more
> >> >>> > information.
> >> >>> >
> >> >> [...]
> >> >>> maybe this is not a upstream-kernel bug. the centos/redhat kernel
> >> >>> would boost the process to RT prio when the process was selected
> >> >>> by oom-killer.
> >> >>
> >> >> This still looks like your cpu controller is misconfigured. Even if the
> >> >> task is promoted to be realtime.
> >> >
> >> >
> >> > Precisely! You need to have rt bandwidth enabled for RT tasks to run,
> >> > as a workaround please give the groups some RT bandwidth and then work
> >> > out the migration to RT and what should be the defaults on the distro.
> >> >
> >> > Balbir
> >>
> >>
> >> see https://patchwork.kernel.org/patch/719411/
> >
> > The patch surely "fixes" your problem but the primary fault here is the
> > mis-configured cpu cgroup. If the value for the bandwidth is zero by
> > default then all realtime processes in the group a screwed. The value
> > should be set to something more reasonable.
> > I am not familiar with the cpu controller but it seems that
> > alloc_rt_sched_group needs some treat. Care to look into it and send a
> > patch to the cpu controller and cgroup maintainers, please?
> >
> > --
> > Michal Hocko
> > SUSE Labs
>
> I'm trying to fix the problem. but no substantive progress yet.

The throttle tracks a finite resource for an arbitrary number of groups,
so there's no sane rt_runtime default other than zero.

Most folks only want the top level throttle warm fuzzy, so a complete
runtime RT_GROUP_SCHED on/off switch with default to off, ie rt tasks
cannot be moved until switched on would fix some annoying "Oopsie, I
forgot" allocation troubles. If you turn it on, shame on you if you
fail to allocate, you asked for it, you're not just stuck with it
because your distro enabled it in their config.

Or, perhaps just make zero rt_runtime always mean traverse up to first
non-zero rt_runtime, ie zero allocation children may consume parental
runtime as they see fit on first come first served basis, when it's
gone, tough, parent/children all wait for refill.

Or whatever, as long as you don't bust distribution/tracking for those
crazy people who intentionally use RT_GROUP_SCHED ;-)

The bug is in the patch that used sched_setscheduler_nocheck(). Plain
sched_setscheduler() would have replied -EGOAWAY.

-Mike

2012-10-26 20:04:10

by Mike Galbraith

[permalink] [raw]

Subject: Re: process hangs on do_exit when oom happens

On Fri, 2012-10-26 at 10:03 -0700, Mike Galbraith wrote:

> The bug is in the patch that used sched_setscheduler_nocheck(). Plain
> sched_setscheduler() would have replied -EGOAWAY.

sched_setscheduler_nocheck() should say go away too methinks. This
isn't about permissions, it's about not being stupid in general.

sched: fix __sched_setscheduler() RT_GROUP_SCHED conditionals

Remove user and rt_bandwidth_enabled() RT_GROUP_SCHED conditionals in
__sched_setscheduler(). The end result of kernel OR user promoting a
task in a group with zero rt_runtime allocated is the same bad thing,
and throttle switch position matters little. It's safer to just say
no solely based upon bandwidth existence, may save the user a nasty
surprise if he later flips the throttle switch to 'on'.

The commit below came about due to sched_setscheduler_nocheck()
allowing a task in a task group with zero rt_runtime allocated to
be promoted by the kernel oom logic, thus marooning it forever.

<quote>
commit 341aea2bc48bf652777fb015cc2b3dfa9a451817
Author: KOSAKI Motohiro <[email protected]>
Date: Thu Apr 14 15:22:13 2011 -0700

oom-kill: remove boost_dying_task_prio()

This is an almost-revert of commit 93b43fa ("oom: give the dying task a
higher priority").

That commit dramatically improved oom killer logic when a fork-bomb
occurs. But I've found that it has nasty corner case. Now cpu cgroup has
strange default RT runtime. It's 0! That said, if a process under cpu
cgroup promote RT scheduling class, the process never run at all.
</quote>

Signed-off-by: Mike Galbraith <[email protected]>

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2d8927f..d3a35f8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3810,17 +3810,14 @@ recheck:
}

#ifdef CONFIG_RT_GROUP_SCHED
- if (user) {
- /*
- * Do not allow realtime tasks into groups that have no runtime
- * assigned.
- */
- if (rt_bandwidth_enabled() && rt_policy(policy) &&
- task_group(p)->rt_bandwidth.rt_runtime == 0 &&
- !task_group_is_autogroup(task_group(p))) {
- task_rq_unlock(rq, p, &flags);
- return -EPERM;
- }
+ /*
+ * Do not allow realtime tasks into groups that have no runtime
+ * assigned.
+ */
+ if (rt_policy(policy) && task_group(p)->rt_bandwidth.rt_runtime == 0 &&
+ !task_group_is_autogroup(task_group(p))) {
+ task_rq_unlock(rq, p, &flags);
+ return -EPERM;
}
#endif