The 2.4.18 kernel was behaving incorrectly for SCHED_FIFO and SCHED_RR
scheduling.
The correct behaviour for SCHED_FIFO is priority preemption:
run to completion, or system call, or preemption by higher priority
process. The correct behaviour for SCHED_RR is the same as SCHED_FIFO for
the preemption case, or run for a time slice, and go to the back of the
run queue for that priority.
More details can be found at:
http://www.opengroup.org/onlinepubs/7908799/xsh/realtime.html
As a side note, SCHED_RR is completely broken in the 2.2 series kernels.
This is a small patch, but fixes the behaviour for SCHED_FIFO and SCHED_RR
scheduling in the 2.4.18 kernel. It also improves the efficiency of the
kernel by NOT calling schedule() for every tick for a SCHED_FIFO process.
--
Bhavesh P. Davda
Avaya Inc
[email protected]
On Wed, 12 Jun 2002, Bhavesh P. Davda wrote:
> The 2.4.18 kernel was behaving incorrectly for SCHED_FIFO and SCHED_RR
> scheduling.
>
> The correct behaviour for SCHED_FIFO is priority preemption: run to
> completion, or system call, or preemption by higher priority process.
> The correct behaviour for SCHED_RR is the same as SCHED_FIFO for the
> preemption case, or run for a time slice, and go to the back of the run
> queue for that priority.
>
> More details can be found at:
>
> http://www.opengroup.org/onlinepubs/7908799/xsh/realtime.html
>
> As a side note, SCHED_RR is completely broken in the 2.2 series kernels.
>
> This is a small patch, but fixes the behaviour for SCHED_FIFO and
> SCHED_RR scheduling in the 2.4.18 kernel. It also improves the
> efficiency of the kernel by NOT calling schedule() for every tick for a
> SCHED_FIFO process.
good catch, your observations are correct.
btw., have you checked the 2.5 kernel's scheduler? It does all these
things correctly: it queues freshly woken up tasks to the tail of the
queue, it does not reschedule SCHED_FIFO tasks every timer tick and does
not move RT tasks to the head of the queue in sys_setscheduler().
in terms of 2.4.18, the timer and the setscheduler() change is OK, but i
dont think we want the add_to_runqueue() change. It changes wakeup
characteristics for non-RT tasks, it could affect any many-threads or
many-processes application adversely. And we've been doing FIFO wakeups
like this for ages and nobody complained, so it's not that we are in a big
hurry. Fundamental changes like this are fair game for the 2.5 kernel.
[and we dont even know the full performance impact of this change even in
2.5, although it's been in since 2.5.3 or so. The full effect of things
like this will show up during beta-testing of 2.6 i suspect.] Plus this
change does not make *that* much of a difference - not many people use
SCHED_FIFO tasks with the same priority, the typical usage is to sort the
tasks by priority - this is one reason why there's a push to increase the
number of RT priority levels to something like 1000 in the 2.5 kernel.
And if multiple SCHED_FIFO tasks have the same priority then exact
scheduling is more like the matter of luck anyway.
I've attached a minimal patch for 2.4.18 which does the other two things
only - the timer optimization has no semantical impact, and the
setscheduler() change only affects tasks that switch between RT and non-RT
scheduling modes.
Ingo
--- linux/kernel/sched.c.orig Thu Jun 13 20:14:31 2002
+++ linux/kernel/sched.c Thu Jun 13 20:23:51 2002
@@ -334,12 +334,6 @@
list_add_tail(&p->run_list, &runqueue_head);
}
-static inline void move_first_runqueue(struct task_struct * p)
-{
- list_del(&p->run_list);
- list_add(&p->run_list, &runqueue_head);
-}
-
/*
* Wake up a process. Put it on the run-queue if it's not
* already there. The "current" process is always on the
@@ -955,9 +949,6 @@
retval = 0;
p->policy = policy;
p->rt_priority = lp.sched_priority;
- if (task_on_runqueue(p))
- move_first_runqueue(p);
-
current->need_resched = 1;
out_unlock:
--- linux/kernel/timer.c.orig Thu Jun 13 20:17:04 2002
+++ linux/kernel/timer.c Thu Jun 13 20:23:15 2002
@@ -585,7 +585,8 @@
if (p->pid) {
if (--p->counter <= 0) {
p->counter = 0;
- p->need_resched = 1;
+ if (p->policy != SCHED_FIFO)
+ p->need_resched = 1;
}
if (p->nice > 0)
kstat.per_cpu_nice[cpu] += user_tick;
Ingo,
Ingo Molnar wrote:
> good catch, your observations are correct.
Thank you.
> btw., have you checked the 2.5 kernel's scheduler? It does all these
> things correctly: it queues freshly woken up tasks to the tail of the
> queue, it does not reschedule SCHED_FIFO tasks every timer tick and does
> not move RT tasks to the head of the queue in sys_setscheduler().
No I haven't. What prompted me to go with the kernel.org 2.4.18 kernel
is the fact that the RedHat 7.3 2.4.18-3 kernel, with your O(1)
scheduler patches besides hundreds of other patches any of which might
also have changed the scheduler, doesn't honour SCHED_FIFO or SCHED_RR
real-time priorities at all.
> in terms of 2.4.18, the timer and the setscheduler() change is OK, but i
> dont think we want the add_to_runqueue() change. It changes wakeup
> characteristics for non-RT tasks, it could affect any many-threads or
> many-processes application adversely. And we've been doing FIFO wakeups
I would think that the logical place to add any process to the runqueue
would be the back of the runqueue. If all processes are ALWAYS added to
the back of the runqueue, then every process is GUARANTEED to eventually
be scheduled. No process will be starved indefinitely.
> like this for ages and nobody complained, so it's not that we are in a big
> hurry. Fundamental changes like this are fair game for the 2.5 kernel.
> [and we dont even know the full performance impact of this change even in
> 2.5, although it's been in since 2.5.3 or so. The full effect of things
> like this will show up during beta-testing of 2.6 i suspect.] Plus this
> change does not make *that* much of a difference - not many people use
> SCHED_FIFO tasks with the same priority, the typical usage is to sort the
> tasks by priority - this is one reason why there's a push to increase the
> number of RT priority levels to something like 1000 in the 2.5 kernel.
> And if multiple SCHED_FIFO tasks have the same priority then exact
> scheduling is more like the matter of luck anyway.
The application that I am dealing with is a communications application
with 86 SCHED_FIFO processes, crammed between priority levels 7-23, that
depend on priority preemption using System V semaphores. The 2.2 kernel
SCHED_FIFO behaviour was correct as far as a preempted SCHED_FIFO
process being put in the back of the runqueue is concerned. But the 2.4
kernel SCHED_FIFO behaviour was broken because of the add_to_runqueue()
bug. That lead to our application grossly misbehaving under the 2.4.18
scheduler.
As far as performance is concerned, putting the "if" test in
update_process_times for SCHED_FIFO actually improved the performance of
our application by 15%, as it would for any SCHED_FIFO centric
application that relies on priority preemption where the average
preemption time is > a timer tick.
Therefore, since my guess is that several applications out there depend
on correct SCHED_FIFO and SCHED_RR behaviour as per the POSIX
definition, I would like to request that my patch be applied to the
2.4.19 kernel for people and companies who are reluctant to move to the
2.5 series kernel for stability reasons.
Thank you.
- Bhavesh
--
Bhavesh P. Davda
Avaya Inc
Room B3-B03 E-mail : [email protected]
1300 West 120th Avenue Phone : (303) 538-4438
Westminster, CO 80234 Fax : (303) 538-3155
On Thu, 2002-06-13 at 14:14, Bhavesh P. Davda wrote:
> No I haven't. What prompted me to go with the kernel.org 2.4.18 kernel
> is the fact that the RedHat 7.3 2.4.18-3 kernel, with your O(1)
> scheduler patches besides hundreds of other patches any of which might
> also have changed the scheduler, doesn't honour SCHED_FIFO or SCHED_RR
> real-time priorities at all.
Huh? It certainly should...
> I would think that the logical place to add any process to the runqueue
> would be the back of the runqueue. If all processes are ALWAYS added to
> the back of the runqueue, then every process is GUARANTEED to eventually
> be scheduled. No process will be starved indefinitely.
You are entirely right, but Ingo's point is very valid: changing wakeup
behavior is risky and is not ideal in the middle of 2.4.
Fixing major bugs is fine for 2.4, but changing behavior to suit an
ideal is not. Now is your chance to do so for 2.5, however.
Robert Love
On 13 Jun 2002, Robert Love wrote:
> You are entirely right, but Ingo's point is very valid: changing wakeup
> behavior is risky and is not ideal in the middle of 2.4.
>
> Fixing major bugs is fine for 2.4, but changing behavior to suit an
> ideal is not. Now is your chance to do so for 2.5, however.
i'd like to point out again that the current 2.5 scheduler has all the
functionality the patch adds for 2.4, so there's no argument about whether
it's correct technically.
Ingo
On Thu, 13 Jun 2002, Bhavesh P. Davda wrote:
> > in terms of 2.4.18, the timer and the setscheduler() change is OK, but i
> > dont think we want the add_to_runqueue() change. It changes wakeup
> > characteristics for non-RT tasks, it could affect any many-threads or
> > many-processes application adversely. And we've been doing FIFO wakeups
>
> I would think that the logical place to add any process to the runqueue
> would be the back of the runqueue. If all processes are ALWAYS added to
> the back of the runqueue, then every process is GUARANTEED to eventually
> be scheduled. No process will be starved indefinitely.
in the case of the Linux scheduler non-RT processes wont be starved
indefinitely, because timeslices will expire after some time so even
'backlogged' processes will get a chance to run. The LIFO wakeup method
can be argued to improve performance as well: if N equal priority
processes are to be considered then the one with the most recent activity
(the most cache-hot) process should be run. Fairness is enforced via the
timeslice distribution scheme.
for the case of SCHED_FIFO processes there is no timeslice-driven
fairness. For them it's clearly better to do FIFO wakeups.
> The application that I am dealing with is a communications application
> with 86 SCHED_FIFO processes, crammed between priority levels 7-23, that
> depend on priority preemption using System V semaphores. The 2.2 kernel
> SCHED_FIFO behaviour was correct as far as a preempted SCHED_FIFO
> process being put in the back of the runqueue is concerned. But the 2.4
> kernel SCHED_FIFO behaviour was broken because of the add_to_runqueue()
> bug. That lead to our application grossly misbehaving under the 2.4.18
> scheduler.
okay, you've certainly convinced me.
we can do this without affecting SCHED_OTHER behavior - it adds one more
branch to a hotpath, but correctness comes first. Patch against vanilla
2.4.18 attached, it does the FIFO wakeup if it's a RT task, otherwise the
wakeup is still LIFO. Clearly a FIFO wakeup is broken wrt. RT tasks.
(any recent merge of the O(1) scheduler to 2.4 should have this behavior
'automatically'.)
> As far as performance is concerned, putting the "if" test in
> update_process_times for SCHED_FIFO actually improved the performance of
> our application by 15%, as it would for any SCHED_FIFO centric
> application that relies on priority preemption where the average
> preemption time is > a timer tick.
(strange, calling schedule() every 10 msecs should not cost 1.5 msecs.)
Ingo
--- linux/kernel/sched.c.orig Thu Jun 13 20:14:31 2002
+++ linux/kernel/sched.c Thu Jun 13 23:33:41 2002
@@ -324,7 +324,10 @@
*/
static inline void add_to_runqueue(struct task_struct * p)
{
- list_add(&p->run_list, &runqueue_head);
+ if (p->policy == SCHED_OTHER)
+ list_add(&p->run_list, &runqueue_head);
+ else
+ list_add_tail(&p->run_list, &runqueue_head);
nr_running++;
}
@@ -334,12 +337,6 @@
list_add_tail(&p->run_list, &runqueue_head);
}
-static inline void move_first_runqueue(struct task_struct * p)
-{
- list_del(&p->run_list);
- list_add(&p->run_list, &runqueue_head);
-}
-
/*
* Wake up a process. Put it on the run-queue if it's not
* already there. The "current" process is always on the
@@ -955,9 +952,6 @@
retval = 0;
p->policy = policy;
p->rt_priority = lp.sched_priority;
- if (task_on_runqueue(p))
- move_first_runqueue(p);
-
current->need_resched = 1;
out_unlock:
--- linux/kernel/timer.c.orig Thu Jun 13 20:17:04 2002
+++ linux/kernel/timer.c Thu Jun 13 20:23:15 2002
@@ -585,7 +585,8 @@
if (p->pid) {
if (--p->counter <= 0) {
p->counter = 0;
- p->need_resched = 1;
+ if (p->policy != SCHED_FIFO)
+ p->need_resched = 1;
}
if (p->nice > 0)
kstat.per_cpu_nice[cpu] += user_tick;
On Thu, Jun 13, 2002 at 03:14:53PM -0600, Bhavesh P. Davda wrote:
> I would think that the logical place to add any process to the runqueue
> would be the back of the runqueue. If all processes are ALWAYS added to
> the back of the runqueue, then every process is GUARANTEED to eventually
> be scheduled. No process will be starved indefinitely.
FYI, from SuSv3:
"Under the SCHED_FIFO policy, the modification of the definitional
thread lists is as follows:
1. When a running thread becomes a preempted thread, it becomes
the head of the thread list for its priority.
2. When a blocked thread becomes a runnable thread, it becomes
the tail of the thread list for its priority.
....
7. If a thread whose policy or priority has been modified other
than by pthread_setschedprio() is a running thread or is runnable,
it then becomes the tail of the thread list for its new priority.
8. If a thread whose policy or priority has been modified by
pthread_setschedprio() is a running thread or is runnable, the
effect on its position in the thread list depends on the direction
of the modification, as follows:
1. If the priority is raised, the thread becomes the tail of
the thread list.
2. If the priority is unchanged, the thread does not change
position in the thread list.
3. If the priority is lowered, the thread becomes the head
of the thread list.
9. When a running thread issues the sched_yield() function, the
thread becomes the tail of the thread list for its priority.
...."
Also, regarding SCHED_RR:
"...This policy shall be identical to the SCHED_FIFO policy with the
additional condition that when the implementation detects that a
running thread has been executing as a running thread for a time
period of the length returned by the sched_rr_get_interval() function
or longer, the thread shall become the tail of its thread list and
the head of that thread list shall be removed and made a running
thread......"
I'm not suggesting Linux HAS to comply with these requirements,
but its worth consideration.
--
Richard Seaman, Jr. email: [email protected]
5182 N. Maple Lane phone: 262-367-5450
Nashotah WI 53058 fax: 262-367-5852
And my patch *MAKES* it compliant with these definitions. The scheduler
was *NOT* compliant with these definitions.
You've quoted me out of context below. My statement that you quote
applies to SCHED_OTHER processes.
Please see my original post with the patch. And thanks for reinforcing
what I was saying!
- Bhavesh
Richard Seaman, Jr. wrote:
> On Thu, Jun 13, 2002 at 03:14:53PM -0600, Bhavesh P. Davda wrote:
>
>
>>I would think that the logical place to add any process to the runqueue
>>would be the back of the runqueue. If all processes are ALWAYS added to
>>the back of the runqueue, then every process is GUARANTEED to eventually
>>be scheduled. No process will be starved indefinitely.
>
>
> FYI, from SuSv3:
>
> "Under the SCHED_FIFO policy, the modification of the definitional
> thread lists is as follows:
>
> 1. When a running thread becomes a preempted thread, it becomes
> the head of the thread list for its priority.
>
> 2. When a blocked thread becomes a runnable thread, it becomes
> the tail of the thread list for its priority.
>
> ....
>
> 7. If a thread whose policy or priority has been modified other
> than by pthread_setschedprio() is a running thread or is runnable,
> it then becomes the tail of the thread list for its new priority.
>
> 8. If a thread whose policy or priority has been modified by
> pthread_setschedprio() is a running thread or is runnable, the
> effect on its position in the thread list depends on the direction
> of the modification, as follows:
>
> 1. If the priority is raised, the thread becomes the tail of
> the thread list.
> 2. If the priority is unchanged, the thread does not change
> position in the thread list.
> 3. If the priority is lowered, the thread becomes the head
> of the thread list.
>
> 9. When a running thread issues the sched_yield() function, the
> thread becomes the tail of the thread list for its priority.
>
> ...."
>
> Also, regarding SCHED_RR:
>
> "...This policy shall be identical to the SCHED_FIFO policy with the
> additional condition that when the implementation detects that a
> running thread has been executing as a running thread for a time
> period of the length returned by the sched_rr_get_interval() function
> or longer, the thread shall become the tail of its thread list and
> the head of that thread list shall be removed and made a running
> thread......"
>
> I'm not suggesting Linux HAS to comply with these requirements,
> but its worth consideration.
>
--
Bhavesh P. Davda
Avaya Inc
Room B3-B03 E-mail : [email protected]
1300 West 120th Avenue Phone : (303) 538-4438
Westminster, CO 80234 Fax : (303) 538-3155