2002-07-01 09:45:50

by Ingo Molnar

[permalink] [raw]
Subject: [patch] sched-2.5.24-D3, batch/idle priority scheduling, SCHED_BATCH


the sched-2.5.24-D3 patch is my current scheduler tree against 2.5.24,
which also includes the latest version of SCHED_BATCH support:

http://redhat.com/~mingo/O(1)-scheduler/sched-2.5.24-D3

the setbatch utility can be found at:

http://redhat.com/~mingo/O(1)-scheduler/setbatch.c

Changes relative to the previous SCHED_BATCH patch:

- fix signal delivery - call the 'kick batch processes' code on UP as
well.

- simplify and speed up the batch queue handling code: the expired/active
queues are merged into a single queue. If a SCHED_BATCH process uses up
all its timeslices then it is queued to the tail of the batch-queue -
otherwise it's queued to the head of the batch-queue. This simplifies
the load-balancer as well.

- add 'default context-switch locking' if prepare_arch_schedule() is not
defined. The majority of architectures thus do not have to define the
context-switch locking macros.

bug reports, success reports, comments, suggestions are welcome,

Ingo


2002-07-06 18:13:35

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [patch] sched-2.5.24-D3, batch/idle priority scheduling, SCHED_BATCH

Hello.

I beleive this patch against entry.S should be sufficient:

--- entry.S~ Sat Jul 6 21:01:16 2002
+++ entry.S Sat Jul 6 21:06:14 2002
@@ -255,7 +255,7 @@
testb $_TIF_NEED_RESCHED, %cl
jz work_notifysig
work_resched:
- call schedule
+ call schedule_userspace
cli # make sure we don't miss an
interrupt
# setting need_resched or
sigpending
# between sampling and the iret

Both calls to schedule() at resume_kernel: and work_pending:
have clear kernel/user return path.

And users of __KERNEL_SYSCALLS__ and kernel_thread() should not
have policy == SCHED_BATCH.

No?

2002-07-09 19:37:35

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] sched-2.5.24-D3, batch/idle priority scheduling, SCHED_BATCH


On Sat, 6 Jul 2002 [email protected] wrote:

> Hello.
>
> I beleive this patch against entry.S should be sufficient:
>
> --- entry.S~ Sat Jul 6 21:01:16 2002
> +++ entry.S Sat Jul 6 21:06:14 2002
> @@ -255,7 +255,7 @@
> testb $_TIF_NEED_RESCHED, %cl
> jz work_notifysig
> work_resched:
> - call schedule
> + call schedule_userspace
> cli # make sure we don't miss an
> interrupt
> # setting need_resched or
> sigpending
> # between sampling and the iret
>
> Both calls to schedule() at resume_kernel: and work_pending:
> have clear kernel/user return path.

agreed, good catch. This greatly simplifies things.

> And users of __KERNEL_SYSCALLS__ and kernel_thread() should not
> have policy == SCHED_BATCH.

yep. And even if they do they should be aware of the consequences.

Ingo

2002-07-09 19:47:05

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] sched-2.5.24-D3, batch/idle priority scheduling, SCHED_BATCH


> > And users of __KERNEL_SYSCALLS__ and kernel_thread() should not
> > have policy == SCHED_BATCH.
>
> yep. And even if they do they should be aware of the consequences.

well, there's one security consequence here - module loading
(request_module()), which spawns a kernel thread must not run as
SCHED_BATCH. I think the right solution for that path is to set the policy
to SCHED_OTHER upon entry, and restore it to the previous one afterwards -
this way the helper thread has SCHED_OTHER priority.

Ingo

2002-07-10 02:27:23

by Kevin O'Connor

[permalink] [raw]
Subject: O(1) batch scheduler

Hi Ingo,

I looked through your sched-2.5.25-A5 patch, and I'm confused by the
idle_count array. It calculates the idle average of the last 9 seconds -
but why not just use a weighted average. A weighted average is going to be
very close to the true average, and where it differs the weighted average
should be preferable.

Incremental patch (untested):

--- kernel/sched.c Tue Jul 9 22:06:38 2002
+++ ../linux-2.5.25-a/kernel/sched.c Tue Jul 9 22:23:41 2002
@@ -171,7 +171,7 @@
#define IDLE_TICKS (HZ)

int idle_ticks_left;
- int idle_count[IDLE_SLOTS];
+ int idle_count;
int idle_avg;

} ____cacheline_aligned;
@@ -886,17 +886,6 @@
#define BUSY_REBALANCE_TICK (HZ/4 ?: 1)
#define IDLE_REBALANCE_TICK (HZ/1000 ?: 1)

-static inline int recalc_idle_avg(runqueue_t *rq)
-{
- int i, count = 0, avg;
-
- for (i = 1; i < IDLE_SLOTS; i++)
- count += rq->idle_count[i];
-
- avg = count / (IDLE_SLOTS - 1);
- return avg;
-}
-
static inline void idle_tick(runqueue_t *rq)
{
if (jiffies % IDLE_REBALANCE_TICK)
@@ -938,17 +927,13 @@
* This code is rare, triggered only once per second:
*/
if (--rq->idle_ticks_left <= 0) {
- int i;
-
- rq->idle_ticks_left = IDLE_TICKS;
- for (i = IDLE_SLOTS-1; i > 0; i--)
- rq->idle_count[i] = rq->idle_count[i-1];
- rq->idle_count[0] = 0;
- rq->idle_avg = recalc_idle_avg(rq);
+ rq->idle_avg = (rq->idle_avg * (IDLE_SLOTS - 1)
+ + rq->idle_count) / IDLE_SLOTS;
+ rq->idle_count = 0;
}
}
if (p == rq->idle || p->policy == SCHED_BATCH)
- rq->idle_count[0]++;
+ rq->idle_count++;
#endif
if (p == rq->idle) {
if (local_bh_count(cpu) || local_irq_count(cpu) > 1)

--
------------------------------------------------------------------------
| Kevin O'Connor "BTW, IMHO we need a FAQ for |
| [email protected] 'IMHO', 'FAQ', 'BTW', etc. !" |
------------------------------------------------------------------------

2002-07-10 06:25:50

by Ingo Molnar

[permalink] [raw]
Subject: Re: O(1) batch scheduler


On Tue, 9 Jul 2002, Kevin O'Connor wrote:

> I looked through your sched-2.5.25-A5 patch, and I'm confused by the
> idle_count array. It calculates the idle average of the last 9 seconds -
> but why not just use a weighted average. A weighted average is going to
> be very close to the true average, and where it differs the weighted
> average should be preferable.

i agree, the hybrid weighted average you suggest is the right solution
here, because the sampling in that case has a fixed frequency which is
HZ-independent. I've applied your patch to my tree.

the problem with a pure weighted average (ie. no ->idle_count, just a
weighted average calculated in the scheduler tick) is that with HZ=1000
and a 32-bit word length the sampling gets too inaccurate. For the average
to be meaningful it needs to be at least 'a few seconds worth' - which is
'a few thousands of events' - the rounding errors are pretty severe in
that case.

(a good example where a running average has fundamental accuracy problem
is the ->sleep_avg sampling. The frequency of wakeups/sleep events can be
almost arbitrarily high, destroying the accuracy of a weighted average.)

Ingo

2002-07-10 06:31:06

by Ingo Molnar

[permalink] [raw]
Subject: Re: O(1) batch scheduler


On Tue, 9 Jul 2002, Kevin O'Connor wrote:

> - rq->idle_ticks_left = IDLE_TICKS;
> - for (i = IDLE_SLOTS-1; i > 0; i--)
> - rq->idle_count[i] = rq->idle_count[i-1];
> - rq->idle_count[0] = 0;
> - rq->idle_avg = recalc_idle_avg(rq);
> + rq->idle_avg = (rq->idle_avg * (IDLE_SLOTS - 1)
> + + rq->idle_count) / IDLE_SLOTS;
> + rq->idle_count = 0;

this part is buggy: ->idle_ticks_left needs to be reset in this branch.

Ingo

2002-07-10 07:04:17

by Ingo Molnar

[permalink] [raw]
Subject: [patch] Re: O(1) batch scheduler


you can find my latest scheduler tree, against 2.5.25-vanilla, at:

http://redhat.com/~mingo/O(1)-scheduler/sched-2.5.25-A7

note that in addition to the bugfix i've further simplified the
idle-average calculation - simple, weightless running average is
more than enough.

Ingo

2002-07-10 07:14:11

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] sched-2.5.24-D3, batch/idle priority scheduling, SCHED_BATCH


> > > And users of __KERNEL_SYSCALLS__ and kernel_thread() should not
> > > have policy == SCHED_BATCH.
>
> well, there's one security consequence here - module loading
> (request_module()), which spawns a kernel thread must not run as
> SCHED_BATCH. I think the right solution for that path is to set the
> policy to SCHED_OTHER upon entry, and restore it to the previous one
> afterwards - this way the helper thread has SCHED_OTHER priority.

i've solved this problem by making kernel_thread() spawned threads drop
back to SCHED_NORMAL:

http://redhat.com/~mingo/O(1)-scheduler/sched-2.5.25-A7

I believe this is the secure way of doing it - independently of
SCHED_BATCH - a RT task should not spawn a RT kernel thread 'unwillingly'.

Ingo

2002-07-10 22:28:51

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [patch] sched-2.5.24-D3, batch/idle priority scheduling, SCHED_BATCH

Hello.

> > > > And users of __KERNEL_SYSCALLS__ and kernel_thread() should not
> > > > have policy == SCHED_BATCH.
> >
> > well, there's one security consequence here - module loading
> > (request_module()), which spawns a kernel thread must not run as
> > SCHED_BATCH. I think the right solution for that path is to set the
> > policy to SCHED_OTHER upon entry, and restore it to the previous one
> > afterwards - this way the helper thread has SCHED_OTHER priority.
>
> i've solved this problem by making kernel_thread() spawned threads drop
> back to SCHED_NORMAL:

Note that request_module() also does waitpid(). So it's better to change
policy upon entry, as You suggested.

> I believe this is the secure way of doing it - independently of
> SCHED_BATCH - a RT task should not spawn a RT kernel thread 'unwillingly'.

Yes, but this semantic change should be ported to all archs
independently of
low level microoptimizations, for consistency. Rename all definitions to
arch_kernel_thread() ?

Btw, how about this tiny bit of cleanup:

asmlinkage void schedule_userspace(void)
{
/*
* Only handle batch tasks that are runnable.
*/
if (current->policy == SCHED_BATCH &&
current->state == TASK_RUNNING) {
runqueue_t *rq = this_rq_lock();
deactivate_batch_task(current, rq);

// we can keep irqs disabled:
spin_unlock(&rq->lock);
}

schedule();
}

Oleg.

2002-07-11 04:38:51

by Kevin O'Connor

[permalink] [raw]
Subject: Re: [patch] Re: O(1) batch scheduler

On Thu, Jul 11, 2002 at 08:27:13AM +0200, Ingo Molnar wrote:
>
> the problem with a pure weighted average (ie. no ->idle_count, just a
> weighted average calculated in the scheduler tick) is that with HZ=1000
> and a 32-bit word length the sampling gets too inaccurate. For the average
> to be meaningful it needs to be at least 'a few seconds worth' - which is
> 'a few thousands of events' - the rounding errors are pretty severe in
> that case.

Yes, I see where truncation could be a problem. However, one should be
able to offset this with a larger base unit and with appropriate rounding.

On Thu, Jul 11, 2002 at 09:05:35AM +0200, Ingo Molnar wrote:
>
> you can find my latest scheduler tree, against 2.5.25-vanilla, at:
>
> http://redhat.com/~mingo/O(1)-scheduler/sched-2.5.25-A7
>
> note that in addition to the bugfix i've further simplified the
> idle-average calculation - simple, weightless running average is
> more than enough.

I looked through the A8 code, and I see that it is using a weighted average
with the last second having as much "weight" as all previous seconds. This
is understandable (as a second is an eternity to a computer), but I wonder
how this interacts with the idle_ticks_left calculation.

Since idle_avg is only updated once a second, it will basically be around a
half-second outdated whenever it is queried. With the smaller time
duration on idle_avg calculations, idle_avg will only describe the last 2
to 3 seconds of the CPU. This means the statistic describes about two
seconds and is a half-second outdated - I wonder how useful this statistic
is.

Just for kicks, I wrote up a patch with a pure weighted average
calculation. Again, I wasn't able to test this on a real kernel (I don't
have a development box I can risk 2.5 on yet). However, I did run a bunch
of data through the two different algorithms. The "hybrid" code matches
the pure weighted average code right after the hybrid code updates
idle_avg. However, the pure weighted average code responds quicker to a
changing environment. YMMV.

-Kevin


--- linux-2.5.25-b/kernel/sched.c Wed Jul 10 23:55:47 2002
+++ linux-2.5.25-a/kernel/sched.c Thu Jul 11 00:33:22 2002
@@ -166,14 +166,10 @@

/*
* Per-CPU idle CPU time tracking:
- *
- * - idle_ticks_left counts back from HZ to 0.
- * - idle_count is the number of idle ticks in the last second.
- * - once it reaches 0, a new idle_avg is calculated.
*/
#define IDLE_TICKS (HZ)

- unsigned int idle_ticks_left, idle_count, idle_avg;
+ unsigned int idle_avg;

} ____cacheline_aligned;

@@ -926,21 +922,14 @@
#if CONFIG_SMP
if (user_ticks || sys_ticks) {
/*
- * This code is rare, triggered only once per second:
+ * Maintain a weighted average between 0 and 100000:
*/
- if (--rq->idle_ticks_left <= 0) {
- /*
- * Maintain a simple running average:
- */
- rq->idle_avg += rq->idle_count;
- rq->idle_avg >>= 1;
-
- rq->idle_ticks_left = IDLE_TICKS;
- rq->idle_count = 0;
- }
+ int idle_count = 0;
+ if (p == rq->idle || p->policy == SCHED_BATCH)
+ idle_count = 100000;
+ rq->idle_avg = (rq->idle_avg * (IDLE_TICKS - 1)
+ + idle_count + IDLE_TICKS/2) / IDLE_TICKS;
}
- if (p == rq->idle || p->policy == SCHED_BATCH)
- rq->idle_count++;
#endif
if (p == rq->idle) {
if (local_bh_count(cpu) || local_irq_count(cpu) > 1)
@@ -2085,7 +2074,6 @@
spin_lock_init(&rq->lock);
INIT_LIST_HEAD(&rq->migration_queue);
INIT_LIST_HEAD(&rq->batch_queue);
- rq->idle_ticks_left = IDLE_TICKS;

for (j = 0; j < 2; j++) {
array = rq->arrays + j;



--
------------------------------------------------------------------------
| Kevin O'Connor "BTW, IMHO we need a FAQ for |
| [email protected] 'IMHO', 'FAQ', 'BTW', etc. !" |
------------------------------------------------------------------------

2002-07-15 10:53:16

by Pavel Machek

[permalink] [raw]
Subject: Re: [patch] sched-2.5.24-D3, batch/idle priority scheduling, SCHED_BATCH

Hi!

> > > > And users of __KERNEL_SYSCALLS__ and kernel_thread() should not
> > > > have policy == SCHED_BATCH.
> >
> > well, there's one security consequence here - module loading
> > (request_module()), which spawns a kernel thread must not run as
> > SCHED_BATCH. I think the right solution for that path is to set the
> > policy to SCHED_OTHER upon entry, and restore it to the previous one
> > afterwards - this way the helper thread has SCHED_OTHER priority.
>
> i've solved this problem by making kernel_thread() spawned threads drop
> back to SCHED_NORMAL:

Does it mean that we now have working scheduler class that only
schedules jobs when no other thread wants to run (as long as
SCHED_BATCH task does not enter the kernel)?
Pavel
--
Worst form of spam? Adding advertisment signatures ala sourceforge.net.
What goes next? Inserting advertisment *into* email?

2002-07-15 16:28:34

by Robert Love

[permalink] [raw]
Subject: Re: [patch] sched-2.5.24-D3, batch/idle priority scheduling, SCHED_BATCH

On Sun, 2002-07-14 at 05:29, Pavel Machek wrote:

> Does it mean that we now have working scheduler class that only
> schedules jobs when no other thread wants to run (as long as
> SCHED_BATCH task does not enter the kernel)?

Yep.

Now, is the implementation OK?

Robert Love

2002-07-17 06:44:31

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] sched-2.5.24-D3, batch/idle priority scheduling, SCHED_BATCH


On Sun, 14 Jul 2002, Pavel Machek wrote:

> Does it mean that we now have working scheduler class that only
> schedules jobs when no other thread wants to run (as long as SCHED_BATCH
> task does not enter the kernel)?

yes, that's exactly the point of SCHED_BATCH. Furthermore: it enables the
correct timeslicing of such SCHED_BATCH tasks as well between each other -
so it's not equivalent to 'infinitely low SCHED_NORMAL priority', but it's
rather an additional hierarchy of scheduling classes. Plus it uses longer
timeslices.

Ingo