2007-10-06 02:15:31

by Mike Kravetz

[permalink] [raw]
Subject: -rt more realtime scheduling issues

Hi Ingo,

After applying the fix to try_to_wake_up() I was still seeing some large
latencies for realtime tasks. Some debug code pointed out two additional
causes of these latencies. I have put fixes into my 'old' kernel and the
scheduler related latencies have gone away. I'm pretty confident that
one of these bugs still exist in the latest RT patch set. Not so sure
about the other. But, I wanted to describe in detail so that you could
address in the latest version of the code if applicable.

finish_task_switch() contains the following code:

#if defined(CONFIG_PREEMPT_RT) && defined(CONFIG_SMP)
/*
* If we pushed an RT task off the runqueue,
* then kick other CPUs, they might run it:
*/
if (unlikely(rt_task(current) && prev->se.on_rq && rt_task(prev))) {
schedstat_inc(rq, rto_schedule);
smp_send_reschedule_allbutself_cpumask(current->cpus_allowed);
}
#endif

My debug code found instances where more than one realtime task got
put on the runqueue before the __schedule() was invoked. So, current
would be a realtime task, but prev was not realtime. And, there was
another (lesser priority, or last in) realtime task on the queue. I
believe that in this case we would still want to send the IPIs. In my
kernel I changed the test to be:

if (unlikely(rt_task(current) && rq->rt_nr_running > 1)) {

After this change, I definitely saw some long latencies go away.

The other place of concern is in the routine pull_task(). I was a
little surprised to see realtime tasks moved around via normal load
balancing. But, my debug code did point this out. In the code for
my old kernel, the routines end with:

/*
* Note that idle threads have a prio of MAX_PRIO, for this test
* to be always true for them.
*/
if (TASK_PREEMPTS_CURR(p, this_rq))
resched_task(this_rq->curr);

This reminded me very much of the situation/code in try_to_wake_up().
If pull_tasks() pulled in a realtime task, then I think it should also
deal with the case where (TASK_PREEMPTS_CURR(p, this_rq) is false. So
I changed the code in my kernel to be:

/*
* Note that idle threads have a prio of MAX_PRIO, for this test
* to be always true for them.
*/
if (TASK_PREEMPTS_CURR(p, this_rq)) {
resched_task(this_rq->curr);

} else if (unlikely(rt_task(p))) {
/* no appropriate rt_overload counter goes here */
smp_send_reschedule_allbutself();
}

To be perfectly honest, I don't know if this change helped eliminate
any of the large latencies I was seeing. I made this changes first,
and was still seeing some large latencies. I then made the modification
to finish_task_switch() and all my scheduler related latencies went
away. Entirely possible this change had no impact. Also, the above
code is replaced in the latest kernels with:

check_preempt_curr(this_rq, p);

What check_preempt_curr() does is not immediately obvious to me. So,
this may not apply at all. Just something to think about.

--
Mike


2007-10-08 18:45:03

by Mike Kravetz

[permalink] [raw]
Subject: Re: -rt more realtime scheduling issues

On Fri, Oct 05, 2007 at 07:15:48PM -0700, Mike Kravetz wrote:
> After applying the fix to try_to_wake_up() I was still seeing some large
> latencies for realtime tasks.

I've been looking for places in the code where reschedule IPIs should
be sent in the case of 'overload' to redistribute RealTime tasks based
on priority. However, an even more basic question to ask might be: Are
the use of reschedule IPIs reliable enough for this purpose. In the
code, there is the following comment:

/*
* this function sends a 'reschedule' IPI to another CPU.
* it goes straight through and wastes no time serializing
* anything. Worst case is that we lose a reschedule ...
*/

After a quick read of the code, it does appear that reschedule's can
be lost if the the IPI is sent at just the right time in schedule
processing. Can someone confirm this is actually the case?

The issue I see is that the 'rt_overload' mechanism depends on reschedule
IPIs for RealTime scheduling semantics. If this is not a reliable
mechanism then this can lead to breakdowns in RealTime scheduling semantics.

Are these accurate statements? I'll start working on a reliable delivery
mechanism for RealTime scheduling. But, I just want to make sure that
is really necessary.

--
Mike

2007-10-09 02:46:46

by Steven Rostedt

[permalink] [raw]
Subject: [PATCH RT] fix rt-task scheduling issue

Mike,

Can you attach your Signed-off-by to this patch, please.


On Fri, Oct 05, 2007 at 07:15:48PM -0700, Mike Kravetz wrote:
> Hi Ingo,
>
> After applying the fix to try_to_wake_up() I was still seeing some large
> latencies for realtime tasks. Some debug code pointed out two additional
> causes of these latencies. I have put fixes into my 'old' kernel and the
> scheduler related latencies have gone away. I'm pretty confident that
> one of these bugs still exist in the latest RT patch set. Not so sure
> about the other. But, I wanted to describe in detail so that you could
> address in the latest version of the code if applicable.
>
> finish_task_switch() contains the following code:
>
> #if defined(CONFIG_PREEMPT_RT) && defined(CONFIG_SMP)
> /*
> * If we pushed an RT task off the runqueue,
> * then kick other CPUs, they might run it:
> */
> if (unlikely(rt_task(current) && prev->se.on_rq && rt_task(prev))) {
> schedstat_inc(rq, rto_schedule);
> smp_send_reschedule_allbutself_cpumask(current->cpus_allowed);
> }
> #endif
>
> My debug code found instances where more than one realtime task got
> put on the runqueue before the __schedule() was invoked. So, current
> would be a realtime task, but prev was not realtime. And, there was
> another (lesser priority, or last in) realtime task on the queue. I
> believe that in this case we would still want to send the IPIs. In my
> kernel I changed the test to be:
>
> if (unlikely(rt_task(current) && rq->rt_nr_running > 1)) {
>
> After this change, I definitely saw some long latencies go away.

I definitely agree with your analysis.

>
> The other place of concern is in the routine pull_task(). I was a
> little surprised to see realtime tasks moved around via normal load
> balancing. But, my debug code did point this out. In the code for
> my old kernel, the routines end with:
>
> /*
> * Note that idle threads have a prio of MAX_PRIO, for this test
> * to be always true for them.
> */
> if (TASK_PREEMPTS_CURR(p, this_rq))
> resched_task(this_rq->curr);
>
> This reminded me very much of the situation/code in try_to_wake_up().
> If pull_tasks() pulled in a realtime task, then I think it should also
> deal with the case where (TASK_PREEMPTS_CURR(p, this_rq) is false. So
> I changed the code in my kernel to be:
>
> /*
> * Note that idle threads have a prio of MAX_PRIO, for this test
> * to be always true for them.
> */
> if (TASK_PREEMPTS_CURR(p, this_rq)) {
> resched_task(this_rq->curr);
>
> } else if (unlikely(rt_task(p))) {
> /* no appropriate rt_overload counter goes here */
> smp_send_reschedule_allbutself();
> }

I'm thinking that the first change would actually make this one
obsolete. The checking at the time of scheduling should cover most cases
where multiple rt tasks are being queued on the same CPU. When we see
that the rt tasks are bunching up on a queue we should handle it then.
Which I would think is at the time of schedule, and the time a task is
queued (try_to_wake_up). Hopefully this is enough.

>
> To be perfectly honest, I don't know if this change helped eliminate
> any of the large latencies I was seeing. I made this changes first,
> and was still seeing some large latencies. I then made the modification
> to finish_task_switch() and all my scheduler related latencies went
> away. Entirely possible this change had no impact. Also, the above

I'm thinking it may have had little to no effect. The first change seems
to be the culprit.

> code is replaced in the latest kernels with:
>
> check_preempt_curr(this_rq, p);
>
> What check_preempt_curr() does is not immediately obvious to me. So,
> this may not apply at all. Just something to think about.

I also don't want to put too many IPI reschedules when we see that we
have more than one rt task on queue. I can imaging an IPI scheduling
storm if we have one more rt tasks than CPUs. So sending the IPI when a
task switch actually occurs seems approriate.

-- Steve

Signed-off-by: Steven Rostedt <[email protected]>


Index: linux-2.6.23-rc9-rt2/kernel/sched.c
===================================================================
--- linux-2.6.23-rc9-rt2.orig/kernel/sched.c
+++ linux-2.6.23-rc9-rt2/kernel/sched.c
@@ -2207,7 +2207,7 @@ static inline void finish_task_switch(st
* If we pushed an RT task off the runqueue,
* then kick other CPUs, they might run it:
*/
- if (unlikely(rt_task(current) && prev->se.on_rq && rt_task(prev))) {
+ if (unlikely(rt_task(current) && rq->rt_nr_running > 1)) {
schedstat_inc(rq, rto_schedule);
smp_send_reschedule_allbutself_cpumask(current->cpus_allowed);
}

2007-10-09 03:04:50

by Steven Rostedt

[permalink] [raw]
Subject: Re: -rt more realtime scheduling issues

On Mon, Oct 08, 2007 at 11:45:23AM -0700, Mike Kravetz wrote:
> On Fri, Oct 05, 2007 at 07:15:48PM -0700, Mike Kravetz wrote:
> > After applying the fix to try_to_wake_up() I was still seeing some large
> > latencies for realtime tasks.
>
> I've been looking for places in the code where reschedule IPIs should
> be sent in the case of 'overload' to redistribute RealTime tasks based
> on priority. However, an even more basic question to ask might be: Are
> the use of reschedule IPIs reliable enough for this purpose. In the
> code, there is the following comment:
>
> /*
> * this function sends a 'reschedule' IPI to another CPU.
> * it goes straight through and wastes no time serializing
> * anything. Worst case is that we lose a reschedule ...
> */
>
> After a quick read of the code, it does appear that reschedule's can
> be lost if the the IPI is sent at just the right time in schedule
> processing. Can someone confirm this is actually the case?
>
> The issue I see is that the 'rt_overload' mechanism depends on reschedule
> IPIs for RealTime scheduling semantics. If this is not a reliable
> mechanism then this can lead to breakdowns in RealTime scheduling semantics.
>
> Are these accurate statements? I'll start working on a reliable delivery
> mechanism for RealTime scheduling. But, I just want to make sure that
> is really necessary.

For i386 I don't think so. Seems that the interrupt handler will set the
current task to "need_resched" and on exit of the interrupt handler, the
schedule should take place. I don't see the race (that doesn't mean
there is one).

For x86_64 though, I don't think that we schedule. All the reschedule
vector does is return with a comment:

/*
* Reschedule call back. Nothing to do,
* all the work is done automatically when
* we return from the interrupt.
*/
asmlinkage void smp_reschedule_interrupt(void)
{
ack_APIC_irq();
}

I'm thinking that this was the case for i386 a while back, and we fixed
it for RT.

/me does a quick search...

http://lkml.org/lkml/2005/5/13/174

Yep! This is a bug in x86_64. I'll fix this up tomorrow and send out a
patch.

-- Steve

2007-10-09 04:21:36

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH RT] fix rt-task scheduling issue

Hi Guys,
Nice find! Comment inline..

(adding linux-rt-users)

for reference to

http://lkml.org/lkml/2007/10/8/252

On Mon, 2007-10-08 at 22:46 -0400, Steven Rostedt wrote:
> Index: linux-2.6.23-rc9-rt2/kernel/sched.c
> ===================================================================
> --- linux-2.6.23-rc9-rt2.orig/kernel/sched.c
> +++ linux-2.6.23-rc9-rt2/kernel/sched.c
> @@ -2207,7 +2207,7 @@ static inline void finish_task_switch(st
> * If we pushed an RT task off the runqueue,
> * then kick other CPUs, they might run it:
> */
> - if (unlikely(rt_task(current) && prev->se.on_rq && rt_task(prev))) {
> + if (unlikely(rt_task(current) && rq->rt_nr_running > 1)) {
> schedstat_inc(rq, rto_schedule);
> smp_send_reschedule_allbutself_cpumask(current->cpus_allowed);

the current->cpus_allowed I think probably should have been
"prev->cpus_allowed" in the original code? However, in light of the new
findings with this bug Mike found, this should probably be sent to
allbutself() without the mask since you don't know what could have been
queued behind you.

Unless I am missing something?

Regards,
-Greg


2007-10-09 08:17:18

by Peter Zijlstra

[permalink] [raw]
Subject: Re: -rt more realtime scheduling issues

On Mon, 2007-10-08 at 23:04 -0400, Steven Rostedt wrote:
> On Mon, Oct 08, 2007 at 11:45:23AM -0700, Mike Kravetz wrote:
> > On Fri, Oct 05, 2007 at 07:15:48PM -0700, Mike Kravetz wrote:
> > > After applying the fix to try_to_wake_up() I was still seeing some large
> > > latencies for realtime tasks.
> >
> > I've been looking for places in the code where reschedule IPIs should
> > be sent in the case of 'overload' to redistribute RealTime tasks based
> > on priority. However, an even more basic question to ask might be: Are
> > the use of reschedule IPIs reliable enough for this purpose. In the
> > code, there is the following comment:
> >
> > /*
> > * this function sends a 'reschedule' IPI to another CPU.
> > * it goes straight through and wastes no time serializing
> > * anything. Worst case is that we lose a reschedule ...
> > */
> >
> > After a quick read of the code, it does appear that reschedule's can
> > be lost if the the IPI is sent at just the right time in schedule
> > processing. Can someone confirm this is actually the case?
> >
> > The issue I see is that the 'rt_overload' mechanism depends on reschedule
> > IPIs for RealTime scheduling semantics. If this is not a reliable
> > mechanism then this can lead to breakdowns in RealTime scheduling semantics.
> >
> > Are these accurate statements? I'll start working on a reliable delivery
> > mechanism for RealTime scheduling. But, I just want to make sure that
> > is really necessary.
>
> For i386 I don't think so. Seems that the interrupt handler will set the
> current task to "need_resched" and on exit of the interrupt handler, the
> schedule should take place. I don't see the race (that doesn't mean
> there is one).
>
> For x86_64 though, I don't think that we schedule. All the reschedule
> vector does is return with a comment:
>
> /*
> * Reschedule call back. Nothing to do,
> * all the work is done automatically when
> * we return from the interrupt.
> */
> asmlinkage void smp_reschedule_interrupt(void)
> {
> ack_APIC_irq();
> }
>
> I'm thinking that this was the case for i386 a while back, and we fixed
> it for RT.
>
> /me does a quick search...
>
> http://lkml.org/lkml/2005/5/13/174
>
> Yep! This is a bug in x86_64. I'll fix this up tomorrow and send out a
> patch.

Hmm, my understanding is that the IPI caller needs to set
TIF_NEED_RESCHED before issuing the IPI.

So I'm inclined to not like this 'fix'.

2007-10-09 18:49:55

by Mike Kravetz

[permalink] [raw]
Subject: Re: -rt more realtime scheduling issues

On Mon, Oct 08, 2007 at 11:04:12PM -0400, Steven Rostedt wrote:
> On Mon, Oct 08, 2007 at 11:45:23AM -0700, Mike Kravetz wrote:
> > Are these accurate statements? I'll start working on a reliable delivery
> > mechanism for RealTime scheduling. But, I just want to make sure that
> > is really necessary.
>
> For i386 I don't think so. Seems that the interrupt handler will set the
> current task to "need_resched" and on exit of the interrupt handler, the
> schedule should take place. I don't see the race (that doesn't mean
> there is one).

The more I try understand the IPI handling the more confused I get. :(
At fist I was concerned about an IPI happening in the middle of the
__schedule routine. But, then it occurred to me that interrupts are
disabled when in this routine (when holding the runqueue lock). So, IPIs
are not delivered during __schedule processing. Right?

But, if this is case then I don't understand the following code in
schedule():

local_irq_disable();

do {
__schedule();
} while (unlikely(test_thread_flag(TIF_NEED_RESCHED) ||
test_thread_flag(TIF_NEED_RESCHED_DELAYED)));

local_irq_enable();

How can the reschedule flags possibly be set AFTER running __schedule.
Especially when the call is explicitly surrounded by local_irq_disable/
local_irq_enable.

Can someone help me?
--
Mike

2007-10-09 18:51:14

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH RT] fix rt-task scheduling issue

On Mon, Oct 08, 2007 at 10:46:21PM -0400, Steven Rostedt wrote:
> Mike,
>
> Can you attach your Signed-off-by to this patch, please.
>
>
> On Fri, Oct 05, 2007 at 07:15:48PM -0700, Mike Kravetz wrote:
> > Hi Ingo,
> >
> > After applying the fix to try_to_wake_up() I was still seeing some large
> > latencies for realtime tasks. Some debug code pointed out two additional
> > causes of these latencies. I have put fixes into my 'old' kernel and the
> > scheduler related latencies have gone away. I'm pretty confident that
> > one of these bugs still exist in the latest RT patch set. Not so sure
> > about the other. But, I wanted to describe in detail so that you could
> > address in the latest version of the code if applicable.
> >
> > finish_task_switch() contains the following code:
> >
> > #if defined(CONFIG_PREEMPT_RT) && defined(CONFIG_SMP)
> > /*
> > * If we pushed an RT task off the runqueue,
> > * then kick other CPUs, they might run it:
> > */
> > if (unlikely(rt_task(current) && prev->se.on_rq && rt_task(prev))) {
> > schedstat_inc(rq, rto_schedule);
> > smp_send_reschedule_allbutself_cpumask(current->cpus_allowed);
> > }
> > #endif
> >
> > My debug code found instances where more than one realtime task got
> > put on the runqueue before the __schedule() was invoked. So, current
> > would be a realtime task, but prev was not realtime. And, there was
> > another (lesser priority, or last in) realtime task on the queue. I
> > believe that in this case we would still want to send the IPIs. In my
> > kernel I changed the test to be:
> >
> > if (unlikely(rt_task(current) && rq->rt_nr_running > 1)) {
> >
> > After this change, I definitely saw some long latencies go away.

OK, not really a patch but

Signed-off-by: Mike Kravetz <[email protected]>

--
Mike

2007-10-10 11:51:17

by Steven Rostedt

[permalink] [raw]
Subject: Re: -rt more realtime scheduling issues

On Tue, Oct 09, 2007 at 11:49:53AM -0700, Mike Kravetz wrote:
> The more I try understand the IPI handling the more confused I get. :(
> At fist I was concerned about an IPI happening in the middle of the
> __schedule routine. But, then it occurred to me that interrupts are
> disabled when in this routine (when holding the runqueue lock). So, IPIs
> are not delivered during __schedule processing. Right?
>
> But, if this is case then I don't understand the following code in
> schedule():
>
> local_irq_disable();
>
> do {
> __schedule();
> } while (unlikely(test_thread_flag(TIF_NEED_RESCHED) ||
> test_thread_flag(TIF_NEED_RESCHED_DELAYED)));
>
> local_irq_enable();
>
> How can the reschedule flags possibly be set AFTER running __schedule.
> Especially when the call is explicitly surrounded by local_irq_disable/
> local_irq_enable.
>
> Can someone help me?
>

Sure, another CPU can set the tasks NEED_RESCHED flag. In try_to_wake_up,
if the process that is waking up is on a runqueue on another CPU and it
is of higher priority than the current running task, the process that is
doing the waking will set the NEED_RESCHED flag for that task.

So to prevent a race where we have called schedule and after getting to
the new running task, a higher priority process just got scheduled in,
we will catch that here.

Now if this is really needed? I don't know. It seems that it just wants
to check here so we don't need to jump to the interrupt and then
schedule while coming back out of the interrupt handler as a preemption
schedule. This way we just schedule again and save a little overhead
from doing that through the interrupt.

But this brings up an interesting point. Since the IRQ handlers are run
as threads, and the interrupt is what will wake them, this seems to add
a bit of latency to interrupts.

For example:

We schedule in process A of prio 1

before exiting __schedule process B is woken up on that same rq
with a prio of 2 and sets A's NEED_RESCHED flag.

Also an interrupt goes off and sent to this CPU. But since interrupts
are disabled, we wait.

leaving __schedule() we see that A's NEED_RESCHED flag is set, so we
continue the do while loop and call __schedule again.

We schedule in B of prio 2.

Leave __schedule as well as the do while loop and then enable
interrupts.

The interrupt that was pending is now triggered.

Wakes up the handler of prio 90 and since it is higher in priority
than process B of prio 2 it sets B's NEED_RESCHED flag.

On return from the interrupt we call schedule again.

This seems strange. I can imagine on a large # of CPUs box that this can
happen quite often, and have the interrupts disabled for several rounds
through schedule.

I say we ax that while loop.

Ingo?

-- Steve

2007-10-11 02:37:00

by Mike Kravetz

[permalink] [raw]
Subject: Re: -rt more realtime scheduling issues

On Wed, Oct 10, 2007 at 07:50:52AM -0400, Steven Rostedt wrote:
> On Tue, Oct 09, 2007 at 11:49:53AM -0700, Mike Kravetz wrote:
> > The more I try understand the IPI handling the more confused I get. :(
> > At fist I was concerned about an IPI happening in the middle of the
> > __schedule routine. But, then it occurred to me that interrupts are
> > disabled when in this routine (when holding the runqueue lock). So, IPIs
> > are not delivered during __schedule processing. Right?
> >
> > But, if this is case then I don't understand the following code in
> > schedule():
> >
> > local_irq_disable();
> >
> > do {
> > __schedule();
> > } while (unlikely(test_thread_flag(TIF_NEED_RESCHED) ||
> > test_thread_flag(TIF_NEED_RESCHED_DELAYED)));
> >
> > local_irq_enable();
> >
> > How can the reschedule flags possibly be set AFTER running __schedule.
> > Especially when the call is explicitly surrounded by local_irq_disable/
> > local_irq_enable.
> >
> > Can someone help me?
> >
>
> Sure, another CPU can set the tasks NEED_RESCHED flag. In try_to_wake_up,
> if the process that is waking up is on a runqueue on another CPU and it
> is of higher priority than the current running task, the process that is
> doing the waking will set the NEED_RESCHED flag for that task.

Yes right. I guess I spent too much time thinking about the 'broadcast
IPI' case where NEED_RESCHED is only set by the handler. In the case
where we 'reschedule' on a specific CPU the flag is set and IPI sent.


> So to prevent a race where we have called schedule and after getting to
> the new running task, a higher priority process just got scheduled in,
> we will catch that here.
>
> Now if this is really needed? I don't know. It seems that it just wants
> to check here so we don't need to jump to the interrupt and then
> schedule while coming back out of the interrupt handler as a preemption
> schedule. This way we just schedule again and save a little overhead
> from doing that through the interrupt.

One more thing. test_thread_flag() uses thread info of the current
thread. But, if __schedule() did a context switch then it is possible
the NEED_RESCHED flag was set in the previous task, instead of current.
Does that make sense? The resched flags get cleared at the begining
of __schedule (as they should). But, if we really want that loop to
be valid shouldn't the flags be copied to the current task. Something
like the following after the context switch:

if (test_and_clear_tsk_thread_flag(prev, TIF_NEED_RESCHED))
set_tsk_need_resched(current);
if (test_and_clear_tsk_thread_flag(prev, TIF_NEED_RESCHED_DELAYED))
set_tsk_need_resched_delayed(current);

Don't we also need to worry about the flags being left set in the
previous task? That is why I think a test_and_clear would make sense.

But, your argument about axing the loop altogether makes sense as well.

> But this brings up an interesting point. Since the IRQ handlers are run
> as threads, and the interrupt is what will wake them, this seems to add
> a bit of latency to interrupts.
>
> For example:
>
> We schedule in process A of prio 1
>
> before exiting __schedule process B is woken up on that same rq
> with a prio of 2 and sets A's NEED_RESCHED flag.
>
> Also an interrupt goes off and sent to this CPU. But since interrupts
> are disabled, we wait.
>
> leaving __schedule() we see that A's NEED_RESCHED flag is set, so we
> continue the do while loop and call __schedule again.
>
> We schedule in B of prio 2.
>
> Leave __schedule as well as the do while loop and then enable
> interrupts.
>
> The interrupt that was pending is now triggered.
>
> Wakes up the handler of prio 90 and since it is higher in priority
> than process B of prio 2 it sets B's NEED_RESCHED flag.
>
> On return from the interrupt we call schedule again.
>
> This seems strange. I can imagine on a large # of CPUs box that this can
> happen quite often, and have the interrupts disabled for several rounds
> through schedule.
>
> I say we ax that while loop.
>
> Ingo?
>
> -- Steve

--
Mike