this patch (ontop of the current -mm scheduler patchset, in particular
ontop of Nick's idle-thread optimization patches) tweaks cpu_idle()
semantics a bit: it changes the idle loops (that do preemption) to call
the first schedule() without checking for need_resched().
the advantage is that as a result we dont have to set the idle thread's
NEED_RESCHED flag in init_idle(), which in turn makes cond_resched()
even more of an invariant: it can be called even from init code without
it having any effect. A cond resched in the init codepath hangs
otherwise.
this patch, while having no negative side-effects, enables wider use of
cond_resched()s. (which might happen in the stock kernel too, but it's
particulary important for voluntary-preempt)
(note that for now this patch only covers architectures that use
kernel/Kconfig.preempt, but all other architectures will work just fine
too.)
Signed-off-by: Ingo Molnar <[email protected]>
arch/i386/kernel/process.c | 2 +-
arch/ppc64/kernel/idle.c | 11 +++++------
arch/x86_64/kernel/process.c | 3 +--
kernel/sched.c | 12 +++++++++++-
4 files changed, 18 insertions(+), 10 deletions(-)
--- linux/kernel/sched.c.orig
+++ linux/kernel/sched.c
@@ -4163,6 +4163,17 @@ void show_state(void)
read_unlock(&tasklist_lock);
}
+/**
+ * init_idle - set up an idle thread for a given CPU
+ * @idle: task in question
+ * @cpu: cpu the idle task belongs to
+ *
+ * NOTE: this function does not set the idle thread's NEED_RESCHED
+ * flag, to make booting more robust. Architecture-level cpu_idle()
+ * functions should structure their idle loop to call the first
+ * schedule() unconditionally, and to check for need_resched() only
+ * afterwards.
+ */
void __devinit init_idle(task_t *idle, int cpu)
{
runqueue_t *rq = cpu_rq(cpu);
@@ -4180,7 +4191,6 @@ void __devinit init_idle(task_t *idle, i
#if defined(CONFIG_SMP) && defined(__ARCH_WANT_UNLOCKED_CTXSW)
idle->oncpu = 1;
#endif
- set_tsk_need_resched(idle);
spin_unlock_irqrestore(&rq->lock, flags);
/* Set the preempt count _outside_ the spinlocks! */
--- linux/arch/x86_64/kernel/process.c.orig
+++ linux/arch/x86_64/kernel/process.c
@@ -164,6 +164,7 @@ void cpu_idle (void)
{
/* endless idle loop with no priority at all */
while (1) {
+ schedule();
while (!need_resched()) {
void (*idle)(void);
@@ -176,8 +177,6 @@ void cpu_idle (void)
idle = default_idle;
idle();
}
-
- schedule();
}
}
--- linux/arch/ppc64/kernel/idle.c.orig
+++ linux/arch/ppc64/kernel/idle.c
@@ -86,6 +86,7 @@ static int iSeries_idle(void)
lpaca = get_paca();
while (1) {
+ schedule();
if (lpaca->lppaca.shared_proc) {
if (ItLpQueue_isLpIntPending(lpaca->lpqueue_ptr))
process_iSeries_events();
@@ -110,8 +111,6 @@ static int iSeries_idle(void)
set_need_resched();
}
}
-
- schedule();
}
return 0;
@@ -125,6 +124,10 @@ static int default_idle(void)
unsigned int cpu = smp_processor_id();
while (1) {
+ schedule();
+ if (cpu_is_offline(cpu) && system_state == SYSTEM_RUNNING)
+ cpu_die();
+
oldval = test_and_clear_thread_flag(TIF_NEED_RESCHED);
if (!oldval) {
@@ -145,10 +148,6 @@ static int default_idle(void)
} else {
set_need_resched();
}
-
- schedule();
- if (cpu_is_offline(cpu) && system_state == SYSTEM_RUNNING)
- cpu_die();
}
return 0;
--- linux/arch/i386/kernel/process.c.orig
+++ linux/arch/i386/kernel/process.c
@@ -181,6 +181,7 @@ void cpu_idle(void)
/* endless idle loop with no priority at all */
while (1) {
+ schedule();
while (!need_resched()) {
void (*idle)(void);
@@ -199,7 +200,6 @@ void cpu_idle(void)
__get_cpu_var(irq_stat).idle_timestamp = jiffies;
idle();
}
- schedule();
}
}
this patch (ontop of the current -mm scheduler patchset plus the
previous 2 patches from me) adds a new preemption model: 'Voluntary
Kernel Preemption'. The 3 models can be selected from a new menu:
(X) No Forced Preemption (Server)
( ) Voluntary Kernel Preemption (Desktop)
( ) Preemptible Kernel (Low-Latency Desktop)
we still default to the stock (Server) preemption model.
Voluntary preemption works by adding a cond_resched()
(reschedule-if-needed) call to every might_sleep() check. It is lighter
than CONFIG_PREEMPT - at the cost of not having as tight latencies. It
represents a different latency/complexity/overhead tradeoff.
it has no runtime impact at all if disabled. Here are size stats that
show how the various preemption models impact the kernel's size:
text data bss dec hex filename
3618774 547184 179896 4345854 424ffe vmlinux.stock
3626406 547184 179896 4353486 426dce vmlinux.voluntary +0.2%
3748414 548640 179896 4476950 445016 vmlinux.preempt +3.5%
voluntary-preempt is +0.2% of .text, preempt is +3.5%.
this feature has been tested for many months by lots of people (and it's
also included in the RHEL4 distribution and earlier variants were in
Fedora as well), and it's intended for users and distributions who dont
want to use full-blown CONFIG_PREEMPT for one reason or another.
the patched kernel builds/boots on my testsystems (x86 and x64) but it
obviously needs more testing. It's simple and straightforward enough to
be considered for upstream inclusion as well, after it gets exposure in
-mm.
Signed-off-by: Ingo Molnar <[email protected]>
include/linux/kernel.h | 18 +++++++++++----
kernel/Kconfig.preempt | 57 ++++++++++++++++++++++++++++++++++++++++++-------
2 files changed, 62 insertions(+), 13 deletions(-)
--- linux/kernel/Kconfig.preempt.orig
+++ linux/kernel/Kconfig.preempt
@@ -1,15 +1,56 @@
-config PREEMPT
- bool "Preemptible Kernel"
+choice
+ prompt "Preemption Model"
+ default PREEMPT_NONE
+
+config PREEMPT_NONE
+ bool "No Forced Preemption (Server)"
+ help
+ This is the traditional Linux preemption model, geared towards
+ throughput. It will still provide good latencies most of the
+ time, but there are no guarantees and occasional longer delays
+ are possible.
+
+ Select this option if you are building a kernel for a server or
+ scientific/computation system, or if you want to maximize the
+ raw processing power of the kernel, irrespective of scheduling
+ latencies.
+
+config PREEMPT_VOLUNTARY
+ bool "Voluntary Kernel Preemption (Desktop)"
help
- This option reduces the latency of the kernel when reacting to
- real-time or interactive events by allowing a low priority process to
- be preempted even if it is in kernel mode executing a system call.
- This allows applications to run more reliably even when the system is
+ This option reduces the latency of the kernel by adding more
+ "explicit preemption points" to the kernel code. These new
+ preemption points have been selected to reduce the maximum
+ latency of rescheduling, providing faster application reactions,
+ at the cost of slighly lower throughput.
+
+ This allows reaction to interactive events by allowing a
+ low priority process to voluntarily preempt itself even if it
+ is in kernel mode executing a system call. This allows
+ applications to run more 'smoothly' even when the system is
under load.
- Say Y here if you are building a kernel for a desktop, embedded
- or real-time system. Say N if you are unsure.
+ Select this if you are building a kernel for a desktop system.
+
+config PREEMPT
+ bool "Preemptible Kernel (Low-Latency Desktop)"
+ help
+ This option reduces the latency of the kernel by making
+ all kernel code (that is not executing in a critical section)
+ preemptible. This allows reaction to interactive events by
+ permitting a low priority process to be preempted involuntarily
+ even if it is in kernel mode executing a system call and would
+ otherwise not be about to reach a natural preemption point.
+ This allows applications to run more 'smoothly' even when the
+ system is under load, at the cost of slighly lower throughput
+ and a slight runtime overhead to kernel code.
+
+ Select this if you are building a kernel for a desktop or
+ embedded system with latency requirements in the milliseconds
+ range.
+
+endchoice
config PREEMPT_BKL
bool "Preempt The Big Kernel Lock"
--- linux/include/linux/kernel.h.orig
+++ linux/include/linux/kernel.h
@@ -58,15 +58,23 @@ struct completion;
* be biten later when the calling function happens to sleep when it is not
* supposed to.
*/
+#ifdef CONFIG_PREEMPT_VOLUNTARY
+extern int cond_resched(void);
+# define might_resched() cond_resched()
+#else
+# define might_resched() do { } while (0)
+#endif
+
#ifdef CONFIG_DEBUG_SPINLOCK_SLEEP
-#define might_sleep() __might_sleep(__FILE__, __LINE__)
-#define might_sleep_if(cond) do { if (unlikely(cond)) might_sleep(); } while (0)
-void __might_sleep(char *file, int line);
+ void __might_sleep(char *file, int line);
+# define might_sleep() \
+ do { __might_sleep(__FILE__, __LINE__); might_resched(); } while (0)
#else
-#define might_sleep() do {} while(0)
-#define might_sleep_if(cond) do {} while (0)
+# define might_sleep() do { might_resched(); } while (0)
#endif
+#define might_sleep_if(cond) do { if (unlikely(cond)) might_sleep(); } while (0)
+
#define abs(x) ({ \
int __x = (x); \
(__x < 0) ? -__x : __x; \
* Ingo Molnar <[email protected]> wrote:
> this patch (ontop of the current -mm scheduler patchset, in particular
> ontop of Nick's idle-thread optimization patches) tweaks cpu_idle()
> semantics a bit: [...]
i just noticed that Nick's latest idle-optimizations patch is not in
-rc4-mm2, so this will cause some clashes.
Ingo
On Tue, May 24, 2005 at 03:21:05PM +0200, Ingo Molnar wrote:
>
> this patch (ontop of the current -mm scheduler patchset plus the
> previous 2 patches from me) adds a new preemption model: 'Voluntary
> Kernel Preemption'. The 3 models can be selected from a new menu:
>
> (X) No Forced Preemption (Server)
> ( ) Voluntary Kernel Preemption (Desktop)
> ( ) Preemptible Kernel (Low-Latency Desktop)
>
> we still default to the stock (Server) preemption model.
>
> Voluntary preemption works by adding a cond_resched()
> (reschedule-if-needed) call to every might_sleep() check.
I still disagree with this one violently. If you want a cond_resched()
add it where nessecary, but don't hide it behind might_sleep - there
could be quite a lot might_sleeps in common codepathes and they should
stay purely a debug aid.
Ingo Molnar wrote:
> * Ingo Molnar <[email protected]> wrote:
>
>
>>this patch (ontop of the current -mm scheduler patchset, in particular
>>ontop of Nick's idle-thread optimization patches) tweaks cpu_idle()
>>semantics a bit: [...]
>
>
> i just noticed that Nick's latest idle-optimizations patch is not in
> -rc4-mm2, so this will cause some clashes.
>
Yeah, hmm. I've just got lazy with porting to other architectures and
sending to Andrew. I'll get it into shape in the next day or so and
so this patch will go on top of it hopefully before Andrew is ready
to release the next kernel.
* Nick Piggin <[email protected]> wrote:
> > i just noticed that Nick's latest idle-optimizations patch is not in
> > -rc4-mm2, so this will cause some clashes.
>
> Yeah, hmm. I've just got lazy with porting to other architectures and
> sending to Andrew. I'll get it into shape in the next day or so and so
> this patch will go on top of it hopefully before Andrew is ready to
> release the next kernel.
we could do it in the other direction just as much - i only touched 3
architectures. Up to Andrew i guess.
Ingo
* Christoph Hellwig <[email protected]> wrote:
> I still disagree with this one violently. [...]
(then you must be disagreeing with CONFIG_PREEMPT too to a certain
degree i guess?)
> [...] If you want a cond_resched() add it where nessecary, but don't
> hide it behind might_sleep - there could be quite a lot might_sleeps
> in common codepathes and they should stay purely a debug aid.
The recent prolifation of might_sleep() points was a direct result of
the -VP patch. I _did_ measure and lay out the might_sleep()s so that
key latency paths get cut. If we did what you propose we'd end up
duplicating 95% of the current might_sleep() invocations. So instead of
sprinking the source with cond_resched()s, we implicitly get them via
might_sleep().
there's another argument as well: if a function truly might sleep, it's
in most cases complex enough to not worry about one extra need_resched()
check. So might_sleep() and cond_resched() pair better than one would
think.
(it is also a debugging helper: by actually sleeping at might_sleep()
points we truly explore whether preemption at that point is safe.)
or if you think we can get away with using just a couple of
cond_resched()s then you are my guest to prove me wrong: take the -RT
kernel it has both -VP and the latency measurement tools integrated, and
remove the cond_resched() from might_sleep() and try to find the points
that are necessary to cut down latencies so that they fall into the
1msec range on typical hw.
Ingo
Ingo Molnar wrote:
> * Christoph Hellwig <[email protected]> wrote:
>
>
>>I still disagree with this one violently. [...]
>
>
> (then you must be disagreeing with CONFIG_PREEMPT too to a certain
> degree i guess?)
>
CONFIG_PREEMPT is different in that it explicitly defines and
delimits preempt critical sections, and allows maximum possible
preemption (whether or not the critical sections themselves are
too big is not really a CONFIG_PREEMPT issue).
Jamming in cond_resched in as many places as possible seems to
work quite well pragmatically, but is just pretty ugly for the
reasons Christoph mentioned (IMO).
The other thing is - if the users don't care about some extra
overhead, why don't they just use CONFIG_PREEMPT? Surely the case
is not that they can tolerate .5% overhead but not 1.5% (pulling
numbers out my bum).
IIRC, the reason (when you wrote the code) was that you didn't
want to enable preempt either because of binary compatibility, or
because of bugs? Well I think the bug issue is no more since your
debug patches went in, and the compatibility reason may be a fine
one for a distro kernel, but not a kernel.org one.
>
>>[...] If you want a cond_resched() add it where nessecary, but don't
>>hide it behind might_sleep - there could be quite a lot might_sleeps
>>in common codepathes and they should stay purely a debug aid.
>
>
[...]
> or if you think we can get away with using just a couple of
> cond_resched()s then you are my guest to prove me wrong: take the -RT
[...]
How about using CONFIG_PREEMPT instead?
--
SUSE Labs, Novell Inc.
Ingo Molnar wrote:
> * Nick Piggin <[email protected]> wrote:
>
>
>>>i just noticed that Nick's latest idle-optimizations patch is not in
>>>-rc4-mm2, so this will cause some clashes.
>>
>>Yeah, hmm. I've just got lazy with porting to other architectures and
>>sending to Andrew. I'll get it into shape in the next day or so and so
>>this patch will go on top of it hopefully before Andrew is ready to
>>release the next kernel.
>
>
> we could do it in the other direction just as much - i only touched 3
> architectures. Up to Andrew i guess.
>
How about just setting need_resched at the start of the cpu_idle
function instead? Rather than changing the structure of the idle
loops themselves. That would suit me best.
Thanks,
Nick
On Wed, 2005-05-25 at 01:21 +1000, Nick Piggin wrote:
> IIRC, the reason (when you wrote the code) was that you didn't
> want to enable preempt either because of binary compatibility, or
> because of bugs? Well I think the bug issue is no more since your
> debug patches went in, and the compatibility reason may be a fine
> one for a distro kernel, but not a kernel.org one.
I can't imagine binary compatibility having been a reason. At least for
the RH distros it really isn't kernel wise. At All.
PREEMPT was (and is?) a stability risk and so you'll see RHEL4 not
having it enabled. But it has nothing to do with in-kernel binary
compatibility; that just doesn't exist, kernel.org or distro alike.
Arjan van de Ven wrote:
> On Wed, 2005-05-25 at 01:21 +1000, Nick Piggin wrote:
>
>>IIRC, the reason (when you wrote the code) was that you didn't
>>want to enable preempt either because of binary compatibility, or
>>because of bugs? Well I think the bug issue is no more since your
>>debug patches went in, and the compatibility reason may be a fine
>>one for a distro kernel, but not a kernel.org one.
>
>
> I can't imagine binary compatibility having been a reason. At least for
> the RH distros it really isn't kernel wise. At All.
> PREEMPT was (and is?) a stability risk and so you'll see RHEL4 not
> having it enabled. But it has nothing to do with in-kernel binary
> compatibility; that just doesn't exist, kernel.org or distro alike.
>
Yeah that was the reason.
* Nick Piggin <[email protected]> wrote:
> > (then you must be disagreeing with CONFIG_PREEMPT too to a certain
> > degree i guess?)
>
> CONFIG_PREEMPT is different in that it explicitly defines and delimits
> preempt critical sections, and allows maximum possible preemption
> (whether or not the critical sections themselves are too big is not
> really a CONFIG_PREEMPT issue).
from a theoretical POV, this categorization into 'preempt critical' vs.
'preempt-noncritical' sections is pretty arbitrary too.
from a practical POV the amount of code that is non-preemptible is not
controllable under CONFIG_PREEMPT. So the end-result is that
CONFIG_PREEMPT is just as nondeterministic.
polling need_resched after reaching a zero preempt_count() is ugly
(doing cond_resched() in might_sleep() is ugly too) - pretty much the
only difference is overhead.
> Jamming in cond_resched in as many places as possible seems to work
> quite well pragmatically, [...]
yes, and that's what matters. It's just a single #ifdef in kernel.h, and
at least one major distribution would make use of it because it
significantly improves soft-RT latencies at a minimal cost. We can
remove it if it's not being used, but right now the only choice that
distributions have is no preemption or full-blown CONFIG_PREEMPT. Ask
the kernel maintainers at SuSE why they havent enabled CONFIG_PREEMPT in
their kernels.
Ingo
* Nick Piggin <[email protected]> wrote:
> > we could do it in the other direction just as much - i only touched
> > 3 architectures. Up to Andrew i guess.
>
> How about just setting need_resched at the start of the cpu_idle
> function instead? Rather than changing the structure of the idle loops
> themselves. That would suit me best.
that's fine with me too.
Ingo
Ingo Molnar wrote:
> * Nick Piggin <[email protected]> wrote:
>
>
>>>(then you must be disagreeing with CONFIG_PREEMPT too to a certain
>>>degree i guess?)
>>
>>CONFIG_PREEMPT is different in that it explicitly defines and delimits
>>preempt critical sections, and allows maximum possible preemption
>>(whether or not the critical sections themselves are too big is not
>>really a CONFIG_PREEMPT issue).
>
>
> from a theoretical POV, this categorization into 'preempt critical' vs.
> 'preempt-noncritical' sections is pretty arbitrary too.
>
Not really, is it? If so then wouldn't that be a bug.
I guess some code might hold a critical section open for longer than
it absolutely needs to, in order to gain efficiency, but basically
your're bound pretty well by critical section size.
> from a practical POV the amount of code that is non-preemptible is not
> controllable under CONFIG_PREEMPT. So the end-result is that
> CONFIG_PREEMPT is just as nondeterministic.
>
It is determined by the amount of code that is not preemptible, rather
than the maximum amount of time between sucessive calls to cond_resched.
IMO the former is cleaner.
> polling need_resched after reaching a zero preempt_count() is ugly
> (doing cond_resched() in might_sleep() is ugly too) - pretty much the
> only difference is overhead.
>
>
>>Jamming in cond_resched in as many places as possible seems to work
>>quite well pragmatically, [...]
>
>
> yes, and that's what matters. It's just a single #ifdef in kernel.h, and
> at least one major distribution would make use of it because it
> significantly improves soft-RT latencies at a minimal cost. We can
Yeah definitely. And in practice distos sometimes have to just go with
what works at the time. So it might be a fine solution for them indeed.
> remove it if it's not being used, but right now the only choice that
> distributions have is no preemption or full-blown CONFIG_PREEMPT. Ask
> the kernel maintainers at SuSE why they havent enabled CONFIG_PREEMPT in
> their kernels.
>
I guess it is a number of reasons. Probably the main one had
traditionally been the chance of bugs. I guess the next big one is
return on overhead (ie. the scheduling latency soon runs into the
problem of long critical sections), although thanks to you and
others, I understand that is becoming less and less of an issue over
time too.
If a new SUSE kernel branch was started from 2.6.12 with VP turned on
rather than PREEMPT then I would probably argue against it a little
bit ;)
--
SUSE Labs, Novell Inc.
Ingo Molnar wrote:
> --
>
> this patch (ontop of the current -mm scheduler patchset) tweaks
> cpu_idle() semantics a bit: it changes the idle loops (that do
> preemption) to call the first schedule() unconditionally.
>
> the advantage is that as a result we dont have to set the idle thread's
> NEED_RESCHED flag in init_idle(), which in turn makes cond_resched()
> even more of an invariant: it can be called even from init code without
> it having any effect. A cond resched in the init codepath hangs
> otherwise.
>
> this patch, while having no negative side-effects, enables wider use of
> cond_resched()s. (which might happen in the stock kernel too, but it's
> particulary important for voluntary-preempt) (note that for now this
> patch only covers architectures that use kernel/Kconfig.preempt, but all
> other architectures will work just fine too.)
>
> Signed-off-by: Ingo Molnar <[email protected]>
>
Looks fine.
Acked-by: Nick Piggin <[email protected]>
* Ingo Molnar <[email protected]> wrote:
>
> * Nick Piggin <[email protected]> wrote:
>
> > > we could do it in the other direction just as much - i only touched
> > > 3 architectures. Up to Andrew i guess.
> >
> > How about just setting need_resched at the start of the cpu_idle
> > function instead? Rather than changing the structure of the idle loops
> > themselves. That would suit me best.
>
> that's fine with me too.
updated patch below.
--
this patch (ontop of the current -mm scheduler patchset) tweaks
cpu_idle() semantics a bit: it changes the idle loops (that do
preemption) to call the first schedule() unconditionally.
the advantage is that as a result we dont have to set the idle thread's
NEED_RESCHED flag in init_idle(), which in turn makes cond_resched()
even more of an invariant: it can be called even from init code without
it having any effect. A cond resched in the init codepath hangs
otherwise.
this patch, while having no negative side-effects, enables wider use of
cond_resched()s. (which might happen in the stock kernel too, but it's
particulary important for voluntary-preempt) (note that for now this
patch only covers architectures that use kernel/Kconfig.preempt, but all
other architectures will work just fine too.)
Signed-off-by: Ingo Molnar <[email protected]>
arch/i386/kernel/process.c | 2 ++
arch/ppc64/kernel/idle.c | 1 +
arch/x86_64/kernel/process.c | 2 ++
kernel/sched.c | 10 +++++++++-
4 files changed, 14 insertions(+), 1 deletion(-)
--- linux/kernel/sched.c.orig
+++ linux/kernel/sched.c
@@ -4163,6 +4163,15 @@ void show_state(void)
read_unlock(&tasklist_lock);
}
+/**
+ * init_idle - set up an idle thread for a given CPU
+ * @idle: task in question
+ * @cpu: cpu the idle task belongs to
+ *
+ * NOTE: this function does not set the idle thread's NEED_RESCHED
+ * flag, to make booting more robust. Architecture-level cpu_idle()
+ * functions should set it explicitly, before entering their idle-loop.
+ */
void __devinit init_idle(task_t *idle, int cpu)
{
runqueue_t *rq = cpu_rq(cpu);
@@ -4180,7 +4189,6 @@ void __devinit init_idle(task_t *idle, i
#if defined(CONFIG_SMP) && defined(__ARCH_WANT_UNLOCKED_CTXSW)
idle->oncpu = 1;
#endif
- set_tsk_need_resched(idle);
spin_unlock_irqrestore(&rq->lock, flags);
/* Set the preempt count _outside_ the spinlocks! */
--- linux/arch/x86_64/kernel/process.c.orig
+++ linux/arch/x86_64/kernel/process.c
@@ -162,6 +162,8 @@ EXPORT_SYMBOL_GPL(cpu_idle_wait);
*/
void cpu_idle (void)
{
+ set_tsk_need_resched(current);
+
/* endless idle loop with no priority at all */
while (1) {
while (!need_resched()) {
--- linux/arch/ppc64/kernel/idle.c.orig
+++ linux/arch/ppc64/kernel/idle.c
@@ -305,6 +305,7 @@ static int native_idle(void)
void cpu_idle(void)
{
+ set_tsk_need_resched(current);
idle_loop();
}
--- linux/arch/i386/kernel/process.c.orig
+++ linux/arch/i386/kernel/process.c
@@ -179,6 +179,8 @@ void cpu_idle(void)
{
int cpu = _smp_processor_id();
+ set_tsk_need_resched(current);
+
/* endless idle loop with no priority at all */
while (1) {
while (!need_resched()) {
* Nick Piggin <[email protected]> wrote:
> >remove it if it's not being used, but right now the only choice that
> >distributions have is no preemption or full-blown CONFIG_PREEMPT. Ask
> >the kernel maintainers at SuSE why they havent enabled CONFIG_PREEMPT in
> >their kernels.
> >
>
> I guess it is a number of reasons. Probably the main one had
> traditionally been the chance of bugs. I guess the next big one is
> return on overhead (ie. the scheduling latency soon runs into the
> problem of long critical sections), although thanks to you and others,
> I understand that is becoming less and less of an issue over time too.
>
> If a new SUSE kernel branch was started from 2.6.12 with VP turned on
> rather than PREEMPT then I would probably argue against it a little
> bit ;)
dont think of scheduling latencies as a binary thing a'ka "do we have
good preemption latencies". It's a continuum, with almost a continuum
number of techniques. One thing is sure: close to one end of the
spectrum we have PREEMPT_NONE, and pretty close to the other end of the
spectrum we have PREEMPT_RT.
both PREEMPT_VOLUNTARY and CONFIG_PREEMPT are at arbitrary points within
that continuum, with different cost/benefit tradeoffs. Neither is
perfect, and both are 'ugly' in the theoretical sense.
now, i dont intend to populate our .config with a continuum number of
preemption models ;) But clearly the past 4 years have shown that no
major distro was brave enough to go CONFIG_PREEMPT, so a solution
inbetween is needed. -VP is precisely such a (very low-impact) solution.
It has a ridiculously low impact:
include/linux/kernel.h | 18 +++++++++++----
we already talked an order of magnitude more about this feature than its
size is (with help text included :). Lets go with it and let people know
that the water is fine. If it's unused it can be zapped easily.
Ingo
Ingo Molnar <[email protected]> wrote:
>
> ..
>
> this patch (ontop of the current -mm scheduler patchset) tweaks
> cpu_idle() semantics a bit: it changes the idle loops (that do
> preemption) to call the first schedule() unconditionally.
>
> the advantage is that as a result we dont have to set the idle thread's
> NEED_RESCHED flag in init_idle(), which in turn makes cond_resched()
> even more of an invariant: it can be called even from init code without
> it having any effect. A cond resched in the init codepath hangs
> otherwise.
>
> this patch, while having no negative side-effects, enables wider use of
> cond_resched()s. (which might happen in the stock kernel too, but it's
> particulary important for voluntary-preempt) (note that for now this
> patch only covers architectures that use kernel/Kconfig.preempt, but all
> other architectures will work just fine too.)
>
> --- linux/arch/x86_64/kernel/process.c.orig
> +++ linux/arch/x86_64/kernel/process.c
> @@ -162,6 +162,8 @@ EXPORT_SYMBOL_GPL(cpu_idle_wait);
> */
> void cpu_idle (void)
> {
> + set_tsk_need_resched(current);
> +
> /* endless idle loop with no priority at all */
> while (1) {
> while (!need_resched()) {
ia64 needed this treatment also to fix a hang during boot. u o me 3 hrs.
Are all the other architectures busted as well?
* Andrew Morton <[email protected]> wrote:
> > --- linux/arch/x86_64/kernel/process.c.orig
> > +++ linux/arch/x86_64/kernel/process.c
> > @@ -162,6 +162,8 @@ EXPORT_SYMBOL_GPL(cpu_idle_wait);
> > */
> > void cpu_idle (void)
> > {
> > + set_tsk_need_resched(current);
> > +
> > /* endless idle loop with no priority at all */
> > while (1) {
> > while (!need_resched()) {
>
> ia64 needed this treatment also to fix a hang during boot. u o me 3 hrs.
>
> Are all the other architectures busted as well?
oh damn, they are indeed, because they need to hit schedule() at least
once.
The patch below should address this problem for all architectures, by
doing an explicit schedule() in the init code before calling into
cpu_idle(). It's a replacement for the following patch:
sched-remove-set_tsk_need_resched-from-init_idle.patch
Ingo
--
This patch tweaks idle thread setup semantics a bit: instead of setting
NEED_RESCHED in init_idle(), we do an explicit schedule() before
calling into cpu_idle().
This patch, while having no negative side-effects, enables wider use of
cond_resched()s. (which might happen in the stock kernel too, but it's
particulary important for voluntary-preempt)
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: Nick Piggin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
--- linux/kernel/sched.c.orig
+++ linux/kernel/sched.c
@@ -4163,6 +4163,14 @@ void show_state(void)
read_unlock(&tasklist_lock);
}
+/**
+ * init_idle - set up an idle thread for a given CPU
+ * @idle: task in question
+ * @cpu: cpu the idle task belongs to
+ *
+ * NOTE: this function does not set the idle thread's NEED_RESCHED
+ * flag, to make booting more robust.
+ */
void __devinit init_idle(task_t *idle, int cpu)
{
runqueue_t *rq = cpu_rq(cpu);
@@ -4180,7 +4188,6 @@ void __devinit init_idle(task_t *idle, i
#if defined(CONFIG_SMP) && defined(__ARCH_WANT_UNLOCKED_CTXSW)
idle->oncpu = 1;
#endif
- set_tsk_need_resched(idle);
spin_unlock_irqrestore(&rq->lock, flags);
/* Set the preempt count _outside_ the spinlocks! */
--- linux/init/main.c.orig
+++ linux/init/main.c
@@ -383,6 +383,13 @@ static void noinline rest_init(void)
numa_default_policy();
unlock_kernel();
preempt_enable_no_resched();
+
+ /*
+ * The boot idle thread must execute schedule()
+ * at least one to get things moving:
+ */
+ schedule();
+
cpu_idle();
}
* Ingo Molnar <[email protected]> wrote:
> The patch below should address this problem for all architectures, by
> doing an explicit schedule() in the init code before calling into
> cpu_idle(). It's a replacement for the following patch:
>
> sched-remove-set_tsk_need_resched-from-init_idle.patch
builds/boots fine on x86.
Ingo
On Tue, May 24, 2005 at 05:09:50PM +0200, Ingo Molnar wrote:
>
> * Christoph Hellwig <[email protected]> wrote:
>
> > I still disagree with this one violently. [...]
>
> (then you must be disagreeing with CONFIG_PREEMPT too to a certain
> degree i guess?)
Yes.
On Wed, May 25, 2005 at 03:51:30PM +0200, Ingo Molnar wrote:
> * Andrew Morton <[email protected]> wrote:
> > Are all the other architectures busted as well?
>
> oh damn, they are indeed, because they need to hit schedule() at least
> once.
>
> The patch below should address this problem for all architectures, by
> doing an explicit schedule() in the init code before calling into
> cpu_idle().
Yuck - wouldn't it be better just to fix all the architectures instead
of applying band aid?
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core
* Russell King <[email protected]> wrote:
> > The patch below should address this problem for all architectures, by
> > doing an explicit schedule() in the init code before calling into
> > cpu_idle().
>
> Yuck - wouldn't it be better just to fix all the architectures instead
> of applying band aid?
it's not really a bug in any architecture - it's a scheduler setup
detail that i changed, and which i initially thought would be best
handled in cpu_idle(), but which is easier to do in rest_init().
Ingo
[re-added Russell to the CC list]
Ingo Molnar wrote:
> * Russell King <[email protected]> wrote:
>
>
>>>The patch below should address this problem for all architectures, by
>>>doing an explicit schedule() in the init code before calling into
>>>cpu_idle().
>>
>>Yuck - wouldn't it be better just to fix all the architectures instead
>>of applying band aid?
>
>
> it's not really a bug in any architecture - it's a scheduler setup
> detail that i changed, and which i initially thought would be best
> handled in cpu_idle(), but which is easier to do in rest_init().
>
Hmm, what has changed is that secondary CPUs haven't got
need_resched set in their idle routines. Whether or not it
is possible to a task on their runqueue at that stage, I
didn't bother looking - I assume you did.
However, Ingo - instead of calling schedule() at the end of
rest_init(), why not just set need_resched instead?
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
* Nick Piggin <[email protected]> wrote:
> >it's not really a bug in any architecture - it's a scheduler setup
> >detail that i changed, and which i initially thought would be best
> >handled in cpu_idle(), but which is easier to do in rest_init().
>
> Hmm, what has changed is that secondary CPUs haven't got need_resched
> set in their idle routines. Whether or not it is possible to a task on
> their runqueue at that stage, I didn't bother looking - I assume you
> did.
the idle thread is the first thing that gets on the runqueue of
secondary CPUs. It's really just the boot process that has this
inevitable 'we do some stuff before the scheduler has been set up'
sequence. Secondary CPUs _must not be active_ before their idle thread
has been set up. (which would be the source of other bugs as well.)
> However, Ingo - instead of calling schedule() at the end of
> rest_init(), why not just set need_resched instead?
it's slightly smaller code.
Ingo