Subject: Cpu-Hotplug and Real-Time

Hi,

While running a cpu-hotplug test involving a high priority
process (SCHED_RR, prio=94) trying to periodically offline and
online cpu1 on a 2-processor machine, I noticed that the system was
becoming unresponsive after a few iterations.

However, when the same test was repeated with processors
greater than 2, it worked fine.
Also, if the hotplugging process, was not of rt-prio, it
worked fine on a 2-processor machine.

After some debugging, I saw that the hang occured because
the high prio process was stuck in a loop doing yield() inside
wait_task_inactive(). Description follows:

Say a high-prio task (A) does a kthread_create(B),
followed by a kthread_bind(B, cpu1). At this moment,
only cpu0 is online.

Now, immediately after being created, B would
do a
complete(&create->started) [kernel/kthread.c: kthread()],
before scheduling itself out.

This complete() will wake up kthreadd, which had spawned B.
It is possible that during the wakeup, kthreadd might preempt B.
Thus, B is still on the runqueue, and not yet called schedule().

kthreadd, will inturn do a
complete(&create->done); [kernel/kthread.c: create_kthread()]
which will wake up the thread which had called kthread_create().
In our case it's task A, which will run immediately, since its priority
is higher.

A will now call kthread_bind(B, cpu1).
kthread_bind(), calls wait_task_inactive(B), to ensures that
B has scheduled itself out.

B is still on the runqueue, so A calls yield() in wait_task_inactive().
But since A is the task with the highest prio, scheduler schedules it
back again.

Thus B never gets to run to schedule itself out.
A loops waiting for B to schedule out leading to system hang.

In my case,
A was the high priority process trying to bring up cpu1, and
thus doing a kthread_create/kthread_bind in
migration_call(): CPU_UP_PREPARE.

B was the migration thread for cpu1.

And the above problem occurs when only one cpu is online.

Possible solutions to this problem:
a) Let the newly spawned kernel threads inherit
their parent's prio and policy.

b) Instead of using yield() in wait_task_inactive(), we could use
something like a yield_to(p):

yield_to(struct task_struct p)
{
int old_prio = p->prio;
/* Temporarily boost p's priority atleast to that of current task */
if (current->prio > old_prio)
set_prio(p, current->prio);
yield();
/* Reset priority back to the original value */
set_prio(p, old_prio);
}


Thoughts?

Thanks and Regards
gautham.

--
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"


2007-08-07 15:11:44

by Oleg Nesterov

[permalink] [raw]
Subject: Re: Cpu-Hotplug and Real-Time

On 08/07, Gautham R Shenoy wrote:
>
> After some debugging, I saw that the hang occured because
> the high prio process was stuck in a loop doing yield() inside
> wait_task_inactive(). Description follows:
>
> Say a high-prio task (A) does a kthread_create(B),
> followed by a kthread_bind(B, cpu1). At this moment,
> only cpu0 is online.
>
> Now, immediately after being created, B would
> do a
> complete(&create->started) [kernel/kthread.c: kthread()],
> before scheduling itself out.
>
> This complete() will wake up kthreadd, which had spawned B.
> It is possible that during the wakeup, kthreadd might preempt B.
> Thus, B is still on the runqueue, and not yet called schedule().
>
> kthreadd, will inturn do a
> complete(&create->done); [kernel/kthread.c: create_kthread()]
> which will wake up the thread which had called kthread_create().
> In our case it's task A, which will run immediately, since its priority
> is higher.
>
> A will now call kthread_bind(B, cpu1).
> kthread_bind(), calls wait_task_inactive(B), to ensures that
> B has scheduled itself out.
>
> B is still on the runqueue, so A calls yield() in wait_task_inactive().
> But since A is the task with the highest prio, scheduler schedules it
> back again.
>
> Thus B never gets to run to schedule itself out.
> A loops waiting for B to schedule out leading to system hang.

As for kthread_bind(), I think wait_task_inactive+set_task_cpu is just
an optimization, and easy to "fix":

--- kernel/kthread.c 2007-07-28 16:58:17.000000000 +0400
+++ /proc/self/fd/0 2007-08-07 18:56:54.248073547 +0400
@@ -166,10 +166,7 @@ void kthread_bind(struct task_struct *k,
WARN_ON(1);
return;
}
- /* Must have done schedule() in kthread() before we set_task_cpu */
- wait_task_inactive(k);
- set_task_cpu(k, cpu);
- k->cpus_allowed = cpumask_of_cpu(cpu);
+ set_cpus_allowed(current, cpumask_of_cpu(cpu));
}
EXPORT_SYMBOL(kthread_bind);

But I think we have another case. An RT ptracer can share the same CPU
with ptracee. The latter sets TASK_STOPPED, unlocks ->siglock, and takes
a preemption. Ptracer does ptrace_check_attach(), sees TASK_STOPPED, and
yields in wait_task_inactive.

Oleg.

2007-08-07 17:40:11

by Pallipadi, Venkatesh

[permalink] [raw]
Subject: Re: Cpu-Hotplug and Real-Time

On Tue, Aug 07, 2007 at 07:13:36PM +0400, Oleg Nesterov wrote:
> On 08/07, Gautham R Shenoy wrote:
> >
> > After some debugging, I saw that the hang occured because
> > the high prio process was stuck in a loop doing yield() inside
> > wait_task_inactive(). Description follows:
> >
> > Say a high-prio task (A) does a kthread_create(B),
> > followed by a kthread_bind(B, cpu1). At this moment,
> > only cpu0 is online.
> >
> > Now, immediately after being created, B would
> > do a
> > complete(&create->started) [kernel/kthread.c: kthread()],
> > before scheduling itself out.
> >
> > This complete() will wake up kthreadd, which had spawned B.
> > It is possible that during the wakeup, kthreadd might preempt B.
> > Thus, B is still on the runqueue, and not yet called schedule().
> >
> > kthreadd, will inturn do a
> > complete(&create->done); [kernel/kthread.c: create_kthread()]
> > which will wake up the thread which had called kthread_create().
> > In our case it's task A, which will run immediately, since its priority
> > is higher.
> >
> > A will now call kthread_bind(B, cpu1).
> > kthread_bind(), calls wait_task_inactive(B), to ensures that
> > B has scheduled itself out.
> >
> > B is still on the runqueue, so A calls yield() in wait_task_inactive().
> > But since A is the task with the highest prio, scheduler schedules it
> > back again.
> >
> > Thus B never gets to run to schedule itself out.
> > A loops waiting for B to schedule out leading to system hang.
>
> As for kthread_bind(), I think wait_task_inactive+set_task_cpu is just
> an optimization, and easy to "fix":
>
> --- kernel/kthread.c 2007-07-28 16:58:17.000000000 +0400
> +++ /proc/self/fd/0 2007-08-07 18:56:54.248073547 +0400
> @@ -166,10 +166,7 @@ void kthread_bind(struct task_struct *k,
> WARN_ON(1);
> return;
> }
> - /* Must have done schedule() in kthread() before we set_task_cpu */
> - wait_task_inactive(k);
> - set_task_cpu(k, cpu);
> - k->cpus_allowed = cpumask_of_cpu(cpu);
> + set_cpus_allowed(current, cpumask_of_cpu(cpu));
> }
> EXPORT_SYMBOL(kthread_bind);
>

Not sure whether set_cpus_allowed() will work here. Looks like, it needs the
CPU to be online during the call and in kthread_bind() case CPU may be offline.

Thanks,
Venki

2007-08-07 18:36:16

by Oleg Nesterov

[permalink] [raw]
Subject: Re: Cpu-Hotplug and Real-Time

On 08/07, Venki Pallipadi wrote:
>
> On Tue, Aug 07, 2007 at 07:13:36PM +0400, Oleg Nesterov wrote:
> >
> > As for kthread_bind(), I think wait_task_inactive+set_task_cpu is just
> > an optimization, and easy to "fix":
> >
> > --- kernel/kthread.c 2007-07-28 16:58:17.000000000 +0400
> > +++ /proc/self/fd/0 2007-08-07 18:56:54.248073547 +0400
> > @@ -166,10 +166,7 @@ void kthread_bind(struct task_struct *k,
> > WARN_ON(1);
> > return;
> > }
> > - /* Must have done schedule() in kthread() before we set_task_cpu */
> > - wait_task_inactive(k);
> > - set_task_cpu(k, cpu);
> > - k->cpus_allowed = cpumask_of_cpu(cpu);
> > + set_cpus_allowed(current, cpumask_of_cpu(cpu));
> > }
> > EXPORT_SYMBOL(kthread_bind);
> >
>
> Not sure whether set_cpus_allowed() will work here. Looks like, it needs the
> CPU to be online during the call and in kthread_bind() case CPU may be offline.

Aah, you are right, of course.

Thanks,

Oleg.

2007-08-09 17:02:45

by Oleg Nesterov

[permalink] [raw]
Subject: rt ptracer can monopolize CPU (was: Cpu-Hotplug and Real-Time)

On 08/07, Oleg Nesterov wrote:
>
> On 08/07, Gautham R Shenoy wrote:
> >
> > A will now call kthread_bind(B, cpu1).
> > kthread_bind(), calls wait_task_inactive(B), to ensures that
> > B has scheduled itself out.
> >
> > B is still on the runqueue, so A calls yield() in wait_task_inactive().
> > But since A is the task with the highest prio, scheduler schedules it
> > back again.
> >
> > Thus B never gets to run to schedule itself out.
> > A loops waiting for B to schedule out leading to system hang.
>
> But I think we have another case. An RT ptracer can share the same CPU
> with ptracee. The latter sets TASK_STOPPED, unlocks ->siglock, and takes
> a preemption. Ptracer does ptrace_check_attach(), sees TASK_STOPPED, and
> yields in wait_task_inactive.

Even simpler.

#include <stdio.h>
#include <signal.h>
#include <unistd.h>
#include <sys/ptrace.h>
#include <sys/wait.h>
#define __USE_GNU
#include <sched.h>

void die(const char *msg)
{
printf("ERR!! %s: %m\n", msg);
kill(0, SIGKILL);
}

void set_cpu(int cpu)
{
unsigned cpuval = 1 << cpu;
if (sched_setaffinity(0, 4, (void*)&cpuval) < 0)
die("setaffinity");
}

// __wake_up_parent() does SYNC wake up, we need a handler to provoke
// signal_wake_up().
// otherwise ptrace_stop() is not preempted after read_unlock(tasklist).
static void sigchld(int sig)
{
}

int main(void)
{
set_cpu(0);

int pid = fork();
if (!pid)
for (;;)
;

struct sched_param sp = { 99 };
if (sched_setscheduler(0, SCHED_FIFO, &sp))
die("setscheduler");

signal(SIGCHLD, sigchld);

if (ptrace(PTRACE_ATTACH, pid, NULL, NULL))
die("attach");

wait(NULL);

if (ptrace(PTRACE_DETACH, pid, NULL, NULL))
die("detach");

kill(pid, SIGKILL);

return 0;
}

Locks CPU 0. Not a security problem, needs CAP_SYS_NICE and the task
could be reniced and killed, but still not good.

ptracee does ptrace_stop()->do_notify_parent_cldstop(), ptracer preempts
the child before it calls schedule(), ptrace(PTRACE_DETACH) goes to
wait_task_inactive() and yields forever.

Can we just replace yield() with schedule_timeout_uninterruptible(1) ?
wait_task_inactive() has no time-critical callers, and as it currently
used "on_rq" case is really unlikely.

Oleg.

Subject: Re: rt ptracer can monopolize CPU (was: Cpu-Hotplug and Real-Time)

On Thu, Aug 09, 2007 at 09:03:53PM +0400, Oleg Nesterov wrote:
> On 08/07, Oleg Nesterov wrote:
> >
> > On 08/07, Gautham R Shenoy wrote:
> > >
> > > A will now call kthread_bind(B, cpu1).
> > > kthread_bind(), calls wait_task_inactive(B), to ensures that
> > > B has scheduled itself out.
> > >
> > > B is still on the runqueue, so A calls yield() in wait_task_inactive().
> > > But since A is the task with the highest prio, scheduler schedules it
> > > back again.
> > >
> > > Thus B never gets to run to schedule itself out.
> > > A loops waiting for B to schedule out leading to system hang.
> >
> > But I think we have another case. An RT ptracer can share the same CPU
> > with ptracee. The latter sets TASK_STOPPED, unlocks ->siglock, and takes
> > a preemption. Ptracer does ptrace_check_attach(), sees TASK_STOPPED, and
> > yields in wait_task_inactive.
>
> Even simpler.
>
> #include <stdio.h>
> #include <signal.h>
> #include <unistd.h>
> #include <sys/ptrace.h>
> #include <sys/wait.h>
> #define __USE_GNU
> #include <sched.h>
>
> void die(const char *msg)
> {
> printf("ERR!! %s: %m\n", msg);
> kill(0, SIGKILL);
> }
>
> void set_cpu(int cpu)
> {
> unsigned cpuval = 1 << cpu;
> if (sched_setaffinity(0, 4, (void*)&cpuval) < 0)
> die("setaffinity");
> }
>
> // __wake_up_parent() does SYNC wake up, we need a handler to provoke
> // signal_wake_up().
> // otherwise ptrace_stop() is not preempted after read_unlock(tasklist).
> static void sigchld(int sig)
> {
> }
>
> int main(void)
> {
> set_cpu(0);
>
> int pid = fork();
> if (!pid)
> for (;;)
> ;
>
> struct sched_param sp = { 99 };
> if (sched_setscheduler(0, SCHED_FIFO, &sp))
> die("setscheduler");
>
> signal(SIGCHLD, sigchld);
>
> if (ptrace(PTRACE_ATTACH, pid, NULL, NULL))
> die("attach");
>
> wait(NULL);
>
> if (ptrace(PTRACE_DETACH, pid, NULL, NULL))
> die("detach");
>
> kill(pid, SIGKILL);
>
> return 0;
> }
>
> Locks CPU 0. Not a security problem, needs CAP_SYS_NICE and the task
> could be reniced and killed, but still not good.
>
> ptracee does ptrace_stop()->do_notify_parent_cldstop(), ptracer preempts
> the child before it calls schedule(), ptrace(PTRACE_DETACH) goes to
> wait_task_inactive() and yields forever.
>
> Can we just replace yield() with schedule_timeout_uninterruptible(1) ?
> wait_task_inactive() has no time-critical callers, and as it currently
> used "on_rq" case is really unlikely.

schedule_timeout_uninterruptible(1) works fine, in my case.
It makes sense to have it there instead of yield. Like you pointed out,
it gets called only in "unlikely" case.

patch below.
Thanks and Regards
gautham.

-->
yield() in wait_task_inactive(), can cause a high priority thread to be
scheduled back in, and there by loop forever while it is waiting for some
lower priority thread which is unfortunately still on the runqueue.

Use schedule_timeout_uninterruptible(1) instead.

Signed-off-by: Gautham R Shenoy <[email protected]>
Credit: Oleg Nesterov <[email protected]>

---
kernel/sched.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6.23-rc2/kernel/sched.c
===================================================================
--- linux-2.6.23-rc2.orig/kernel/sched.c
+++ linux-2.6.23-rc2/kernel/sched.c
@@ -1106,7 +1106,7 @@ repeat:
* yield - it could be a while.
*/
if (unlikely(on_rq)) {
- yield();
+ schedule_timeout_uninterruptible(1);
goto repeat;
}


--
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"