LinuxLists.cc - [PATCH] 2.4.18 scheduler bugs

2002-03-15 20:56:30

Subject: [PATCH] 2.4.18 scheduler bugs

Hi Marcelo et all,
The following fixes some rather straightforward bugs in the old
(pre-O(1)) scheduler that I discovered while exercising it with
custom instrumentation written in. These may be worth fixing, given
that it may be a long time before the new O(1) scheduler officially
shows up in a production tree.

Joe

- ksoftirqd() - change daemon nice(2) value from 19 to -19.

SoftIRQ servicing was less important than the most lowly of batch
tasks. This patch makes it more important than all but the realtime
tasks.

- reschedule_idle() - smp_send_reschedule when setting idle's need_resched

Idle tasks nowdays don't spin waiting for need->resched to change,
they sleep on a halt insn instead. Therefore any setting of
need->resched on an idle task running on a remote CPU should be
accompanied by a cross-processor interrupt.

diff -Nur linux-2.4.18-base/kernel/sched.c linux/kernel/sched.c
--- linux-2.4.18-base/kernel/sched.c Fri Dec 21 12:42:04 2001
+++ linux/kernel/sched.c Fri Mar 15 14:57:21 2002
@@ -225,16 +225,9 @@
if (can_schedule(p, best_cpu)) {
tsk = idle_task(best_cpu);
if (cpu_curr(best_cpu) == tsk) {
- int need_resched;
send_now_idle:
- /*
- * If need_resched == -1 then we can skip sending
- * the IPI altogether, tsk->need_resched is
- * actively watched by the idle thread.
- */
- need_resched = tsk->need_resched;
tsk->need_resched = 1;
- if ((best_cpu != this_cpu) && !need_resched)
+ if (best_cpu != this_cpu)
smp_send_reschedule(best_cpu);
return;
}
diff -Nur linux-2.4.18-base/kernel/softirq.c linux/kernel/softirq.c
--- linux-2.4.18-base/kernel/softirq.c Wed Oct 31 13:26:02 2001
+++ linux/kernel/softirq.c Fri Mar 15 14:55:38 2002
@@ -365,7 +365,7 @@
int cpu = cpu_logical_map(bind_cpu);

daemonize();
- current->nice = 19;
+ current->nice = -19;
sigfillset(&current->blocked);

/* Migrate to the right CPU */

2002-03-15 21:00:30

by Ingo Molnar

[permalink] [raw]

Subject: Re: [PATCH] 2.4.18 scheduler bugs

On Fri, 15 Mar 2002, Joe Korty wrote:

> - ksoftirqd() - change daemon nice(2) value from 19 to -19.

this is broken. The goal is to reduce softirq load during overload
situations. The default policy should be "do not allow external network
load to make your system essentially unusuable". Those who want to allow
this nevertheless can renice ksoftirqd manually.

> - reschedule_idle() - smp_send_reschedule when setting idle's need_resched
>
> Idle tasks nowdays don't spin waiting for need->resched to change,
> they sleep on a halt insn instead. Therefore any setting of
> need->resched on an idle task running on a remote CPU should be
> accompanied by a cross-processor interrupt.

this is broken as well. Check out the idle=poll feature i wrote some time
ago.

Ingo

2002-03-15 21:25:03

by Alan

[permalink] [raw]

Subject: Re: [PATCH] 2.4.18 scheduler bugs

> - ksoftirqd() - change daemon nice(2) value from 19 to -19.
>
> SoftIRQ servicing was less important than the most lowly of batch
> tasks. This patch makes it more important than all but the realtime
> tasks.

Bad idea - the right fix to this is to stop using ksoftirqd so readily
under load. If it bales after 20 iterations life is good. As shipped life
is bad.

Once ksoftirq triggers its because we are seriously overloaded (or without
fixing its use slightly randomly). In that case we want other stuff to
do work before we potentially unleash the next flood.

> - reschedule_idle() - smp_send_reschedule when setting idle's need_resched
> Idle tasks nowdays don't spin waiting for need->resched to change,
> they sleep on a halt insn instead. Therefore any setting of
> need->resched on an idle task running on a remote CPU should be
> accompanied by a cross-processor interrupt.

You can already tell the kernel to poll for idle tasks (and since newer intel
CPUS seem to gain nothing from hlt might as well on those)

2002-03-15 21:27:03

by jak

[permalink] [raw]

Subject: Re: [PATCH] 2.4.18 scheduler bugs

>> - reschedule_idle() - smp_send_reschedule when setting idle's need_resched
>>
>> Idle tasks nowdays don't spin waiting for need->resched to change,
>> they sleep on a halt insn instead. Therefore any setting of
>> need->resched on an idle task running on a remote CPU should be
>> accompanied by a cross-processor interrupt.
>
> this is broken as well. Check out the idle=poll feature i wrote some time
> ago.

The idle=poll stuff is a hack. I'd like my idle cpus to sleep and still
have them wake up the moment work for them becomes available. I see no
reason why an idle cpu should be forced to remain idle until the next
tick, nor why fixing that should be considered `broken'.

Joe

2002-03-15 21:29:13

by jak

[permalink] [raw]

Subject: Re: [PATCH] 2.4.18 scheduler bugs

>> - ksoftirqd() - change daemon nice(2) value from 19 to -19.
>>
>> SoftIRQ servicing was less important than the most lowly of batch
>> tasks. This patch makes it more important than all but the realtime
>> tasks.
>
> Bad idea - the right fix to this is to stop using ksoftirqd so readily
> under load. If it bales after 20 iterations life is good. As shipped life
> is bad.
>
> Once ksoftirq triggers its because we are seriously overloaded (or without
> fixing its use slightly randomly). In that case we want other stuff to
> do work before we potentially unleash the next flood.

That certainly makes sense. Thanks.
Joe

2002-03-15 21:40:13

by Ingo Molnar

[permalink] [raw]

Subject: Re: [PATCH] 2.4.18 scheduler bugs

On Fri, 15 Mar 2002, Joe Korty wrote:

> >> Idle tasks nowdays don't spin waiting for need->resched to change,
> >> they sleep on a halt insn instead. Therefore any setting of
> >> need->resched on an idle task running on a remote CPU should be
> >> accompanied by a cross-processor interrupt.
> >
> > this is broken as well. Check out the idle=poll feature i wrote some time
> > ago.
>
> The idle=poll stuff is a hack. [...]

it's a feature.

> [...] I'd like my idle cpus to sleep and still have them wake up the
> moment work for them becomes available. I see no reason why an idle cpu
> should be forced to remain idle until the next tick, nor why fixing that
> should be considered `broken'.

performance. IPIs are expensive.

Ingo

2002-03-15 21:43:13

by Alan

[permalink] [raw]

Subject: Re: [PATCH] 2.4.18 scheduler bugs

> > moment work for them becomes available. I see no reason why an idle cpu
> > should be forced to remain idle until the next tick, nor why fixing that
> > should be considered `broken'.
>
> performance. IPIs are expensive.

On a PIII I can see this being the case, especially as they dont power save
on hlt nowdays. But on the Athlon the IPI isnt going down a little side
channel between cpus.

2002-03-15 21:49:03

by Ingo Molnar

[permalink] [raw]

Subject: Re: [PATCH] 2.4.18 scheduler bugs

On Fri, 15 Mar 2002, Alan Cox wrote:

> > > moment work for them becomes available. I see no reason why an idle cpu
> > > should be forced to remain idle until the next tick, nor why fixing that
> > > should be considered `broken'.
> >
> > performance. IPIs are expensive.
>
> On a PIII I can see this being the case, especially as they dont power
> save on hlt nowdays.

it's an option, and the default is to use the hlt instruction. The main
reason is to let Linux save power - and those who need that final
performance edge (and it's measurable), can enable it. HTL still uses less
power than the tight idle loop.

> [...] But on the Athlon the IPI isnt going down a little side channel
> between cpus.

but even in the Athlon case an IPI is still an IRQ entry, which will add
at least 200 cycles or more to the idle wakeup latency.

Ingo

2002-03-15 22:00:53

by jak

[permalink] [raw]

Subject: Re: [PATCH] 2.4.18 scheduler bugs

> > [...] But on the Athlon the IPI isnt going down a little side channel
> > between cpus.
>
> but even in the Athlon case an IPI is still an IRQ entry, which will add
> at least 200 cycles or more to the idle wakeup latency.

It is an idle cpu that is spending those 200 cycles.

Joe

2002-03-15 22:01:45

by Ingo Molnar

[permalink] [raw]

Subject: Re: [PATCH] 2.4.18 scheduler bugs

On Fri, 15 Mar 2002, Joe Korty wrote:

> > > [...] But on the Athlon the IPI isnt going down a little side channel
> > > between cpus.
> >
> > but even in the Athlon case an IPI is still an IRQ entry, which will add
> > at least 200 cycles or more to the idle wakeup latency.
>
> It is an idle cpu that is spending those 200 cycles.

wrong. When it's woken up it's *not* an idle CPU anymore, and it's the
freshly woken up task that is going to execute 200 cycles later...

Ingo

2002-03-15 22:15:49

by jak

[permalink] [raw]

Subject: Re: [PATCH] 2.4.18 scheduler bugs

>>> but even in the Athlon case an IPI is still an IRQ entry, which will add
>>> at least 200 cycles or more to the idle wakeup latency.
>>
>> It is an idle cpu that is spending those 200 cycles.
>
> wrong. When it's woken up it's *not* an idle CPU anymore, and it's the
> freshly woken up task that is going to execute 200 cycles later...

I have to disagree. It is the woken up task *running on the
otherwise idle CPU* that burns up 200 cycles at the tail.

A cpu is wasting, say, 5,000,000 cycles (1GHz/100/2, or 1/2 tick) in
hlt when it could have been doing work. Why worry about an
alternative wakeup path that burns up 200-400 cycles of that on the
otherwise idling cpu, even if it is at the tail.

Joe

2002-03-15 23:54:19

by Mike Kravetz

[permalink] [raw]

Subject: Re: [PATCH] 2.4.18 scheduler bugs

Joe: from your original mail,

On Fri, Mar 15, 2002 at 03:54:39PM -0500, Joe Korty wrote:
>
> - reschedule_idle() - smp_send_reschedule when setting idle's need_resched
>
> Idle tasks nowdays don't spin waiting for need->resched to change,
> they sleep on a halt insn instead. Therefore any setting of
> need->resched on an idle task running on a remote CPU should be
> accompanied by a cross-processor interrupt.
>
> diff -Nur linux-2.4.18-base/kernel/sched.c linux/kernel/sched.c
> --- linux-2.4.18-base/kernel/sched.c Fri Dec 21 12:42:04 2001
> +++ linux/kernel/sched.c Fri Mar 15 14:57:21 2002
> @@ -225,16 +225,9 @@
> if (can_schedule(p, best_cpu)) {
> tsk = idle_task(best_cpu);
> if (cpu_curr(best_cpu) == tsk) {
> - int need_resched;
> send_now_idle:
> - /*
> - * If need_resched == -1 then we can skip sending
> - * the IPI altogether, tsk->need_resched is
> - * actively watched by the idle thread.
> - */
> - need_resched = tsk->need_resched;
> tsk->need_resched = 1;
> - if ((best_cpu != this_cpu) && !need_resched)
> + if (best_cpu != this_cpu)
> smp_send_reschedule(best_cpu);
> return;
> }

Note that the original code always did send an IPI when setting
need_resched. It only skipped the IPI if need_resched was already
set. I may be wrong, but I always considered the check for
need_resched already being set to be an optimization. In other
words, if need_resched was already set then you know an IPI was
previously sent but schedule has not yet run on that CPU.

For your patch to make any difference, there would need to be code
that frequently set need_resched and did not send an IPI to the
target CPU. I don't think there is any code that does this (except
code that sets need_resched while in interrupt context which is
handled by other means).

I would agree that the comment in the above code is out of date
and misleading.

--
Mike

2002-03-16 10:27:57

by Ingo Molnar

[permalink] [raw]

Subject: Re: [PATCH] 2.4.18 scheduler bugs

On Fri, 15 Mar 2002, Joe Korty wrote:

> >> It is an idle cpu that is spending those 200 cycles.
> >
> > wrong. When it's woken up it's *not* an idle CPU anymore, and it's the
> > freshly woken up task that is going to execute 200 cycles later...
>
> I have to disagree. It is the woken up task *running on the otherwise
> idle CPU* that burns up 200 cycles at the tail.

what do you disagree with? It's a fact that any overhead added to the
idle-wakeup path is not 'idle time' but adds latency (overhead) to the
freshly woken up task's runtime.

> A cpu is wasting, say, 5,000,000 cycles (1GHz/100/2, or 1/2 tick) in hlt
> when it could have been doing work. Why worry about an alternative
> wakeup path that burns up 200-400 cycles of that on the otherwise idling
> cpu, even if it is at the tail.

it's *not* idle time, it's naive to think that "it's in the idle task, so
it must be idle time". Latency added to the idle-wakeup shows up as direct
overhead in the woken up task. Lets look at an example, CPU0 is waking up
bdflush that will run on CPU1, CPU1 is idle currently:

CPU0 CPU1
[wakeup bdflush]
[send IPI]
[... IPI delivery latency ...]
[IRQ entry/exit]
[idle thread context switches]
[bdflush runs on CPU1]

contrasted with the idle=poll situation:

CPU0 CPU1
[wakeup bdflush]
[set need_resched]
[idle thread context switches]
[bdflush runs on CPU1]

as you can see, the overhead of 'send IPI', 'IPI delivery' and 'IRQ
entry/exit' delays bdflush. Even assuming that sending and receiving an
IPI is as fast as setting & detecting need_resched [which it theoretically
can be], the IPI variant still has the cost of IRQ entry (and exit), which
is 200 cycles only optimistically, it's more like thousands of cycles on a
GHZ box.

[ as mentioned before, the default idle method has power saving advantages
(even if it's not HLT, some of the better methods do save considerable
amount of power), but idle=poll is clearly an option for the truly
performance-sensitive applications. ]

Ingo

2002-03-19 02:08:25

by Rob Landley

[permalink] [raw]

Subject: Re: [PATCH] 2.4.18 scheduler bugs

On Friday 15 March 2002 04:39 pm, Alan Cox wrote:
> > - ksoftirqd() - change daemon nice(2) value from 19 to -19.
> >
> > SoftIRQ servicing was less important than the most lowly of batch
> > tasks. This patch makes it more important than all but the realtime
> > tasks.
>
> Bad idea - the right fix to this is to stop using ksoftirqd so readily
> under load. If it bales after 20 iterations life is good. As shipped life
> is bad.
>
> Once ksoftirq triggers its because we are seriously overloaded (or without
> fixing its use slightly randomly). In that case we want other stuff to
> do work before we potentially unleash the next flood.

Also, I'm not sure if this is still the case but in earlier versions of the
O(1) scheduler, nice(19) tasks tended to get their entire timeslice in one
big long uninterrupted gulp. (By the time they got a slice, everything with
a higher priority has already run.) This was great for long cpu-intensive
loads because it meant your cache stayed hot, so you wound up churning
through your CPU-bound load even faster than if you got little snippets of
time and had to keep reloading your cache. (Great for interactive latency,
sucks for number crunching.)

Higher priority tasks get first crack at the CPU, but that doesn't mean they
get to keep it for long. High priority is for latency reasons, not for
throughput reasons. If your router spills over to ksoftirqd, you're already
beyond optimizing for latency. If ksoftirqd is doing ping-pong scheduling
with cron, overall throughput is worse.

Rob