the sched-2.5.24-D3 patch is my current scheduler tree against 2.5.24,
which also includes the latest version of SCHED_BATCH support:
http://redhat.com/~mingo/O(1)-scheduler/sched-2.5.24-D3
the setbatch utility can be found at:
http://redhat.com/~mingo/O(1)-scheduler/setbatch.c
Changes relative to the previous SCHED_BATCH patch:
- fix signal delivery - call the 'kick batch processes' code on UP as
well.
- simplify and speed up the batch queue handling code: the expired/active
queues are merged into a single queue. If a SCHED_BATCH process uses up
all its timeslices then it is queued to the tail of the batch-queue -
otherwise it's queued to the head of the batch-queue. This simplifies
the load-balancer as well.
- add 'default context-switch locking' if prepare_arch_schedule() is not
defined. The majority of architectures thus do not have to define the
context-switch locking macros.
bug reports, success reports, comments, suggestions are welcome,
Ingo
Hello.
I beleive this patch against entry.S should be sufficient:
--- entry.S~ Sat Jul 6 21:01:16 2002
+++ entry.S Sat Jul 6 21:06:14 2002
@@ -255,7 +255,7 @@
testb $_TIF_NEED_RESCHED, %cl
jz work_notifysig
work_resched:
- call schedule
+ call schedule_userspace
cli # make sure we don't miss an
interrupt
# setting need_resched or
sigpending
# between sampling and the iret
Both calls to schedule() at resume_kernel: and work_pending:
have clear kernel/user return path.
And users of __KERNEL_SYSCALLS__ and kernel_thread() should not
have policy == SCHED_BATCH.
No?
On Sat, 6 Jul 2002 [email protected] wrote:
> Hello.
>
> I beleive this patch against entry.S should be sufficient:
>
> --- entry.S~ Sat Jul 6 21:01:16 2002
> +++ entry.S Sat Jul 6 21:06:14 2002
> @@ -255,7 +255,7 @@
> testb $_TIF_NEED_RESCHED, %cl
> jz work_notifysig
> work_resched:
> - call schedule
> + call schedule_userspace
> cli # make sure we don't miss an
> interrupt
> # setting need_resched or
> sigpending
> # between sampling and the iret
>
> Both calls to schedule() at resume_kernel: and work_pending:
> have clear kernel/user return path.
agreed, good catch. This greatly simplifies things.
> And users of __KERNEL_SYSCALLS__ and kernel_thread() should not
> have policy == SCHED_BATCH.
yep. And even if they do they should be aware of the consequences.
Ingo
> > And users of __KERNEL_SYSCALLS__ and kernel_thread() should not
> > have policy == SCHED_BATCH.
>
> yep. And even if they do they should be aware of the consequences.
well, there's one security consequence here - module loading
(request_module()), which spawns a kernel thread must not run as
SCHED_BATCH. I think the right solution for that path is to set the policy
to SCHED_OTHER upon entry, and restore it to the previous one afterwards -
this way the helper thread has SCHED_OTHER priority.
Ingo
Hi Ingo,
I looked through your sched-2.5.25-A5 patch, and I'm confused by the
idle_count array. It calculates the idle average of the last 9 seconds -
but why not just use a weighted average. A weighted average is going to be
very close to the true average, and where it differs the weighted average
should be preferable.
Incremental patch (untested):
--- kernel/sched.c Tue Jul 9 22:06:38 2002
+++ ../linux-2.5.25-a/kernel/sched.c Tue Jul 9 22:23:41 2002
@@ -171,7 +171,7 @@
#define IDLE_TICKS (HZ)
int idle_ticks_left;
- int idle_count[IDLE_SLOTS];
+ int idle_count;
int idle_avg;
} ____cacheline_aligned;
@@ -886,17 +886,6 @@
#define BUSY_REBALANCE_TICK (HZ/4 ?: 1)
#define IDLE_REBALANCE_TICK (HZ/1000 ?: 1)
-static inline int recalc_idle_avg(runqueue_t *rq)
-{
- int i, count = 0, avg;
-
- for (i = 1; i < IDLE_SLOTS; i++)
- count += rq->idle_count[i];
-
- avg = count / (IDLE_SLOTS - 1);
- return avg;
-}
-
static inline void idle_tick(runqueue_t *rq)
{
if (jiffies % IDLE_REBALANCE_TICK)
@@ -938,17 +927,13 @@
* This code is rare, triggered only once per second:
*/
if (--rq->idle_ticks_left <= 0) {
- int i;
-
- rq->idle_ticks_left = IDLE_TICKS;
- for (i = IDLE_SLOTS-1; i > 0; i--)
- rq->idle_count[i] = rq->idle_count[i-1];
- rq->idle_count[0] = 0;
- rq->idle_avg = recalc_idle_avg(rq);
+ rq->idle_avg = (rq->idle_avg * (IDLE_SLOTS - 1)
+ + rq->idle_count) / IDLE_SLOTS;
+ rq->idle_count = 0;
}
}
if (p == rq->idle || p->policy == SCHED_BATCH)
- rq->idle_count[0]++;
+ rq->idle_count++;
#endif
if (p == rq->idle) {
if (local_bh_count(cpu) || local_irq_count(cpu) > 1)
--
------------------------------------------------------------------------
| Kevin O'Connor "BTW, IMHO we need a FAQ for |
| [email protected] 'IMHO', 'FAQ', 'BTW', etc. !" |
------------------------------------------------------------------------
On Tue, 9 Jul 2002, Kevin O'Connor wrote:
> I looked through your sched-2.5.25-A5 patch, and I'm confused by the
> idle_count array. It calculates the idle average of the last 9 seconds -
> but why not just use a weighted average. A weighted average is going to
> be very close to the true average, and where it differs the weighted
> average should be preferable.
i agree, the hybrid weighted average you suggest is the right solution
here, because the sampling in that case has a fixed frequency which is
HZ-independent. I've applied your patch to my tree.
the problem with a pure weighted average (ie. no ->idle_count, just a
weighted average calculated in the scheduler tick) is that with HZ=1000
and a 32-bit word length the sampling gets too inaccurate. For the average
to be meaningful it needs to be at least 'a few seconds worth' - which is
'a few thousands of events' - the rounding errors are pretty severe in
that case.
(a good example where a running average has fundamental accuracy problem
is the ->sleep_avg sampling. The frequency of wakeups/sleep events can be
almost arbitrarily high, destroying the accuracy of a weighted average.)
Ingo
On Tue, 9 Jul 2002, Kevin O'Connor wrote:
> - rq->idle_ticks_left = IDLE_TICKS;
> - for (i = IDLE_SLOTS-1; i > 0; i--)
> - rq->idle_count[i] = rq->idle_count[i-1];
> - rq->idle_count[0] = 0;
> - rq->idle_avg = recalc_idle_avg(rq);
> + rq->idle_avg = (rq->idle_avg * (IDLE_SLOTS - 1)
> + + rq->idle_count) / IDLE_SLOTS;
> + rq->idle_count = 0;
this part is buggy: ->idle_ticks_left needs to be reset in this branch.
Ingo
you can find my latest scheduler tree, against 2.5.25-vanilla, at:
http://redhat.com/~mingo/O(1)-scheduler/sched-2.5.25-A7
note that in addition to the bugfix i've further simplified the
idle-average calculation - simple, weightless running average is
more than enough.
Ingo
> > > And users of __KERNEL_SYSCALLS__ and kernel_thread() should not
> > > have policy == SCHED_BATCH.
>
> well, there's one security consequence here - module loading
> (request_module()), which spawns a kernel thread must not run as
> SCHED_BATCH. I think the right solution for that path is to set the
> policy to SCHED_OTHER upon entry, and restore it to the previous one
> afterwards - this way the helper thread has SCHED_OTHER priority.
i've solved this problem by making kernel_thread() spawned threads drop
back to SCHED_NORMAL:
http://redhat.com/~mingo/O(1)-scheduler/sched-2.5.25-A7
I believe this is the secure way of doing it - independently of
SCHED_BATCH - a RT task should not spawn a RT kernel thread 'unwillingly'.
Ingo
Hello.
> > > > And users of __KERNEL_SYSCALLS__ and kernel_thread() should not
> > > > have policy == SCHED_BATCH.
> >
> > well, there's one security consequence here - module loading
> > (request_module()), which spawns a kernel thread must not run as
> > SCHED_BATCH. I think the right solution for that path is to set the
> > policy to SCHED_OTHER upon entry, and restore it to the previous one
> > afterwards - this way the helper thread has SCHED_OTHER priority.
>
> i've solved this problem by making kernel_thread() spawned threads drop
> back to SCHED_NORMAL:
Note that request_module() also does waitpid(). So it's better to change
policy upon entry, as You suggested.
> I believe this is the secure way of doing it - independently of
> SCHED_BATCH - a RT task should not spawn a RT kernel thread 'unwillingly'.
Yes, but this semantic change should be ported to all archs
independently of
low level microoptimizations, for consistency. Rename all definitions to
arch_kernel_thread() ?
Btw, how about this tiny bit of cleanup:
asmlinkage void schedule_userspace(void)
{
/*
* Only handle batch tasks that are runnable.
*/
if (current->policy == SCHED_BATCH &&
current->state == TASK_RUNNING) {
runqueue_t *rq = this_rq_lock();
deactivate_batch_task(current, rq);
// we can keep irqs disabled:
spin_unlock(&rq->lock);
}
schedule();
}
Oleg.
On Thu, Jul 11, 2002 at 08:27:13AM +0200, Ingo Molnar wrote:
>
> the problem with a pure weighted average (ie. no ->idle_count, just a
> weighted average calculated in the scheduler tick) is that with HZ=1000
> and a 32-bit word length the sampling gets too inaccurate. For the average
> to be meaningful it needs to be at least 'a few seconds worth' - which is
> 'a few thousands of events' - the rounding errors are pretty severe in
> that case.
Yes, I see where truncation could be a problem. However, one should be
able to offset this with a larger base unit and with appropriate rounding.
On Thu, Jul 11, 2002 at 09:05:35AM +0200, Ingo Molnar wrote:
>
> you can find my latest scheduler tree, against 2.5.25-vanilla, at:
>
> http://redhat.com/~mingo/O(1)-scheduler/sched-2.5.25-A7
>
> note that in addition to the bugfix i've further simplified the
> idle-average calculation - simple, weightless running average is
> more than enough.
I looked through the A8 code, and I see that it is using a weighted average
with the last second having as much "weight" as all previous seconds. This
is understandable (as a second is an eternity to a computer), but I wonder
how this interacts with the idle_ticks_left calculation.
Since idle_avg is only updated once a second, it will basically be around a
half-second outdated whenever it is queried. With the smaller time
duration on idle_avg calculations, idle_avg will only describe the last 2
to 3 seconds of the CPU. This means the statistic describes about two
seconds and is a half-second outdated - I wonder how useful this statistic
is.
Just for kicks, I wrote up a patch with a pure weighted average
calculation. Again, I wasn't able to test this on a real kernel (I don't
have a development box I can risk 2.5 on yet). However, I did run a bunch
of data through the two different algorithms. The "hybrid" code matches
the pure weighted average code right after the hybrid code updates
idle_avg. However, the pure weighted average code responds quicker to a
changing environment. YMMV.
-Kevin
--- linux-2.5.25-b/kernel/sched.c Wed Jul 10 23:55:47 2002
+++ linux-2.5.25-a/kernel/sched.c Thu Jul 11 00:33:22 2002
@@ -166,14 +166,10 @@
/*
* Per-CPU idle CPU time tracking:
- *
- * - idle_ticks_left counts back from HZ to 0.
- * - idle_count is the number of idle ticks in the last second.
- * - once it reaches 0, a new idle_avg is calculated.
*/
#define IDLE_TICKS (HZ)
- unsigned int idle_ticks_left, idle_count, idle_avg;
+ unsigned int idle_avg;
} ____cacheline_aligned;
@@ -926,21 +922,14 @@
#if CONFIG_SMP
if (user_ticks || sys_ticks) {
/*
- * This code is rare, triggered only once per second:
+ * Maintain a weighted average between 0 and 100000:
*/
- if (--rq->idle_ticks_left <= 0) {
- /*
- * Maintain a simple running average:
- */
- rq->idle_avg += rq->idle_count;
- rq->idle_avg >>= 1;
-
- rq->idle_ticks_left = IDLE_TICKS;
- rq->idle_count = 0;
- }
+ int idle_count = 0;
+ if (p == rq->idle || p->policy == SCHED_BATCH)
+ idle_count = 100000;
+ rq->idle_avg = (rq->idle_avg * (IDLE_TICKS - 1)
+ + idle_count + IDLE_TICKS/2) / IDLE_TICKS;
}
- if (p == rq->idle || p->policy == SCHED_BATCH)
- rq->idle_count++;
#endif
if (p == rq->idle) {
if (local_bh_count(cpu) || local_irq_count(cpu) > 1)
@@ -2085,7 +2074,6 @@
spin_lock_init(&rq->lock);
INIT_LIST_HEAD(&rq->migration_queue);
INIT_LIST_HEAD(&rq->batch_queue);
- rq->idle_ticks_left = IDLE_TICKS;
for (j = 0; j < 2; j++) {
array = rq->arrays + j;
--
------------------------------------------------------------------------
| Kevin O'Connor "BTW, IMHO we need a FAQ for |
| [email protected] 'IMHO', 'FAQ', 'BTW', etc. !" |
------------------------------------------------------------------------
Hi!
> > > > And users of __KERNEL_SYSCALLS__ and kernel_thread() should not
> > > > have policy == SCHED_BATCH.
> >
> > well, there's one security consequence here - module loading
> > (request_module()), which spawns a kernel thread must not run as
> > SCHED_BATCH. I think the right solution for that path is to set the
> > policy to SCHED_OTHER upon entry, and restore it to the previous one
> > afterwards - this way the helper thread has SCHED_OTHER priority.
>
> i've solved this problem by making kernel_thread() spawned threads drop
> back to SCHED_NORMAL:
Does it mean that we now have working scheduler class that only
schedules jobs when no other thread wants to run (as long as
SCHED_BATCH task does not enter the kernel)?
Pavel
--
Worst form of spam? Adding advertisment signatures ala sourceforge.net.
What goes next? Inserting advertisment *into* email?
On Sun, 2002-07-14 at 05:29, Pavel Machek wrote:
> Does it mean that we now have working scheduler class that only
> schedules jobs when no other thread wants to run (as long as
> SCHED_BATCH task does not enter the kernel)?
Yep.
Now, is the implementation OK?
Robert Love
On Sun, 14 Jul 2002, Pavel Machek wrote:
> Does it mean that we now have working scheduler class that only
> schedules jobs when no other thread wants to run (as long as SCHED_BATCH
> task does not enter the kernel)?
yes, that's exactly the point of SCHED_BATCH. Furthermore: it enables the
correct timeslicing of such SCHED_BATCH tasks as well between each other -
so it's not equivalent to 'infinitely low SCHED_NORMAL priority', but it's
rather an additional hierarchy of scheduling classes. Plus it uses longer
timeslices.
Ingo