LinuxLists.cc - [patch] sched-2.6.0-test1-G3, interactivity changes, audio latency

2003-07-25 19:44:49

Subject: [patch] sched-2.6.0-test1-G3, interactivity changes, audio latency

my current "interactivity changes" scheduler patchset can be found at:

redhat.com/~mingo/O(1)-scheduler/sched-2.6.0-test1-G3

(this patch is mostly orthogonal to Con's patchset, but obviously collides
patch-wise. The patch should also cleanly apply to 2.6.0-test1-bk2.)

Changes:

- cycle accuracy (nanosec resolution) timekeeping within the scheduler.
This fixes a number of audio artifacts (skipping) i've reproduced. I
dont think we can get away without going cycle accuracy - reading the
cycle counter adds some overhead, but it's acceptable. The first
nanosec-accuracy patch was done by Mike Galbraith - this patch is
different but similar in nature. I went further in also changing the
sleep_avg to be of nanosec resolution.

- more finegrained timeslices: there's now a timeslice 'sub unit' of 50
usecs (TIMESLICE_GRANULARITY) - CPU hogs on the same priority level
will roundrobin with this unit. This change is intended to make gaming
latencies shorter.

- include scheduling latency in sleep bonus calculation. This change
extends the sleep-average calculation to the period of time a task
spends on the runqueue but doesnt get scheduled yet, right after
wakeup. Note that tasks that were preempted (ie. not woken up) and are
still on the runqueue do not get this benefit. This change closes one
of the last hole in the dynamic priority estimation, it should result
in interactive tasks getting more priority under heavy load. This
change also fixes the test-starve.c testcase from David Mosberger.

- (some other, smaller changes.)

if you've experienced audio skipping in 2.6.0-test1 (and later) kernels
then please give this patch a go. Reports, testing feedback and comments
are welcome,

Ingo

2003-07-25 21:01:44

by Rudolf Thomas

[permalink] [raw]

Subject: Re: [patch] sched-2.6.0-test1-G3

--- linux/arch/i386/kernel/timers/timer_tsc.c.orig
+++ linux/arch/i386/kernel/timers/timer_tsc.c
@@ -116,6 +117,24 @@ static unsigned long long monotonic_cloc
return base + cycles_2_ns(this_offset - last_offset);
}

+/*
+ * Scheduler clock - returns current time in nanosec units.
+ */
+unsigned long long sched_clock(void)
+{
+ unsigned long long this_offset;
+
+ if (unlikely(!cpu_has_tsc))
+ return (unsigned long long)jiffies * (1000000000 / HZ);
+
+ /* Read the Time Stamp Counter */
+ rdtscll(this_offset);
+
+ /* return the value in ns */
+ return cycles_2_ns(this_offset);
+}
+
+
static void mark_offset_tsc(void)
{
unsigned long lost,delay;
--- linux/arch/i386/kernel/smpboot.c.orig
+++ linux/arch/i386/kernel/smpboot.c
@@ -915,13 +915,13 @@ static void smp_tune_scheduling (void)
cacheflush_time = (cpu_khz>>10) * (cachesize<<10) / bandwidth;
}

- cache_decay_ticks = (long)cacheflush_time/cpu_khz * HZ / 1000;
+ cache_decay_ticks = (long)cacheflush_time/cpu_khz + 1;

printk("per-CPU timeslice cutoff: %ld.%02ld usecs.\n",
(long)cacheflush_time/(cpu_khz/1000),
((long)cacheflush_time*100/(cpu_khz/1000)) % 100);
printk("task migration cache decay timeout: %ld msecs.\n",
- (cache_decay_ticks + 1) * 1000 / HZ);
+ cache_decay_ticks);
}

/*
--- linux/include/linux/sched.h.orig
+++ linux/include/linux/sched.h
@@ -341,7 +341,8 @@ struct task_struct {
prio_array_t *array;

unsigned long sleep_avg;
- unsigned long last_run;
+ unsigned long long timestamp;
+ int activated;

unsigned long policy;
unsigned long cpus_allowed;
@@ -511,6 +512,8 @@ static inline int set_cpus_allowed(task_
}
#endif

+extern unsigned long long sched_clock(void);
+
#ifdef CONFIG_NUMA
extern void sched_balance_exec(void);
extern void node_nr_running_init(void);
--- linux/kernel/fork.c.orig
+++ linux/kernel/fork.c
@@ -924,7 +924,7 @@ struct task_struct *copy_process(unsigne
*/
p->first_time_slice = 1;
current->time_slice >>= 1;
- p->last_run = jiffies;
+ p->timestamp = sched_clock();
if (!current->time_slice) {
/*
* This case is rare, it happens when the parent has only
--- linux/kernel/sched.c.orig
+++ linux/kernel/sched.c
@@ -68,12 +68,13 @@
*/
#define MIN_TIMESLICE ( 10 * HZ / 1000)
#define MAX_TIMESLICE (200 * HZ / 1000)
+#define TIMESLICE_GRANULARITY (HZ/20 ?: 1)
#define CHILD_PENALTY 50
#define PARENT_PENALTY 100
#define EXIT_WEIGHT 3
#define PRIO_BONUS_RATIO 25
#define INTERACTIVE_DELTA 2
-#define MAX_SLEEP_AVG (10*HZ)
+#define MAX_SLEEP_AVG (2*1000000000)
#define STARVATION_LIMIT (10*HZ)
#define NODE_THRESHOLD 125

@@ -115,6 +116,11 @@
#define TASK_INTERACTIVE(p) \
((p)->prio <= (p)->static_prio - DELTA(p))

+#define TASK_PREEMPTS_CURR(p, rq) \
+ ((p)->prio < (rq)->curr->prio || \
+ ((p)->prio == (rq)->curr->prio && \
+ (p)->time_slice > (rq)->curr->time_slice * 2))
+
/*
* BASE_TIMESLICE scales user-nice values [ -20 ... 19 ]
* to time slice values.
@@ -319,8 +325,8 @@ static int effective_prio(task_t *p)
if (rt_task(p))
return p->prio;

- bonus = MAX_USER_PRIO*PRIO_BONUS_RATIO*p->sleep_avg/MAX_SLEEP_AVG/100 -
- MAX_USER_PRIO*PRIO_BONUS_RATIO/100/2;
+ bonus = MAX_USER_PRIO*PRIO_BONUS_RATIO*(p->sleep_avg/1024)/(MAX_SLEEP_AVG/1024)/100;
+ bonus -= MAX_USER_PRIO*PRIO_BONUS_RATIO/100/2;

prio = p->static_prio - bonus;
if (prio < MAX_RT_PRIO)
@@ -339,24 +345,24 @@ static inline void __activate_task(task_
nr_running_inc(rq);
}

-/*
- * activate_task - move a task to the runqueue and do priority recalculation
- *
- * Update all the scheduling statistics stuff. (sleep average
- * calculation, priority modifiers, etc.)
- */
-static inline void activate_task(task_t *p, runqueue_t *rq)
+static void recalc_task_prio(task_t *p, unsigned long long now)
{
- long sleep_time = jiffies - p->last_run - 1;
+ unsigned long long __sleep_time = now - p->timestamp;
+ unsigned long sleep_time;
+
+ if (__sleep_time > MAX_SLEEP_AVG)
+ sleep_time = MAX_SLEEP_AVG;
+ else
+ sleep_time = (unsigned long)__sleep_time;

if (sleep_time > 0) {
- int sleep_avg;
+ unsigned long long sleep_avg;

/*
* This code gives a bonus to interactive tasks.
*
* The boost works by updating the 'average sleep time'
- * value here, based on ->last_run. The more time a task
+ * value here, based on ->timestamp. The more time a task
* spends sleeping, the higher the average gets - and the
* higher the priority boost gets as well.
*/
@@ -375,6 +381,23 @@ static inline void activate_task(task_t
p->prio = effective_prio(p);
}
}
+}
+
+/*
+ * activate_task - move a task to the runqueue and do priority recalculation
+ *
+ * Update all the scheduling statistics stuff. (sleep average
+ * calculation, priority modifiers, etc.)
+ */
+static inline void activate_task(task_t *p, runqueue_t *rq)
+{
+ unsigned long long now = sched_clock();
+
+ recalc_task_prio(p, now);
+
+ p->activated = 1;
+ p->timestamp = now;
+
__activate_task(p, rq);
}

@@ -501,7 +524,7 @@ repeat_lock_task:
__activate_task(p, rq);
else {
activate_task(p, rq);
- if (p->prio < rq->curr->prio)
+ if (TASK_PREEMPTS_CURR(p, rq))
resched_task(rq->curr);
}
success = 1;
@@ -550,8 +573,8 @@ void wake_up_forked_process(task_t * p)
* and children as well, to keep max-interactive tasks
* from forking tasks that are max-interactive.
*/
- current->sleep_avg = current->sleep_avg * PARENT_PENALTY / 100;
- p->sleep_avg = p->sleep_avg * CHILD_PENALTY / 100;
+ current->sleep_avg = current->sleep_avg / 100 * PARENT_PENALTY;
+ p->sleep_avg = p->sleep_avg / 100 * CHILD_PENALTY;
p->prio = effective_prio(p);
set_task_cpu(p, smp_processor_id());

@@ -592,8 +615,7 @@ void sched_exit(task_t * p)
* the sleep_avg of the parent as well.
*/
if (p->sleep_avg < p->parent->sleep_avg)
- p->parent->sleep_avg = (p->parent->sleep_avg * EXIT_WEIGHT +
- p->sleep_avg) / (EXIT_WEIGHT + 1);
+ p->parent->sleep_avg = p->parent->sleep_avg / (EXIT_WEIGHT + 1) * EXIT_WEIGHT + p->sleep_avg / (EXIT_WEIGHT + 1);
}

/**
@@ -978,13 +1000,8 @@ static inline void pull_task(runqueue_t
* Note that idle threads have a prio of MAX_PRIO, for this test
* to be always true for them.
*/
- if (p->prio < this_rq->curr->prio)
+ if (TASK_PREEMPTS_CURR(p, this_rq))
set_need_resched();
- else {
- if (p->prio == this_rq->curr->prio &&
- p->time_slice > this_rq->curr->time_slice)
- set_need_resched();
- }
}

/*
@@ -1001,12 +1018,14 @@ static void load_balance(runqueue_t *thi
runqueue_t *busiest;
prio_array_t *array;
struct list_head *head, *curr;
+ unsigned long long now;
task_t *tmp;

busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance, cpumask);
if (!busiest)
goto out;

+ now = sched_clock();
/*
* We first consider expired tasks. Those will likely not be
* executed in the near future, and they are most likely to
@@ -1052,8 +1071,9 @@ skip_queue:
* 3) are cache-hot on their current CPU.
*/

+#warning fixme
#define CAN_MIGRATE_TASK(p,rq,this_cpu) \
- ((!idle || (jiffies - (p)->last_run > cache_decay_ticks)) && \
+ ((!idle || (((now - (p)->timestamp)>>10) > cache_decay_ticks)) &&\
!task_running(rq, p) && \
((p)->cpus_allowed & (1UL << (this_cpu))))

@@ -1213,14 +1233,11 @@ void scheduler_tick(int user_ticks, int
spin_lock(&rq->lock);
/*
* The task was running during this tick - update the
- * time slice counter and the sleep average. Note: we
- * do not update a thread's priority until it either
- * goes to sleep or uses up its timeslice. This makes
- * it possible for interactive tasks to use up their
- * timeslices at their highest priority levels.
+ * time slice counter. Note: we do not update a thread's
+ * priority until it either goes to sleep or uses up its
+ * timeslice. This makes it possible for interactive tasks
+ * to use up their timeslices at their highest priority levels.
*/
- if (p->sleep_avg)
- p->sleep_avg--;
if (unlikely(rt_task(p))) {
/*
* RR tasks need a special form of timeslice management.
@@ -1244,12 +1261,33 @@ void scheduler_tick(int user_ticks, int
p->time_slice = task_timeslice(p);
p->first_time_slice = 0;

- if (!TASK_INTERACTIVE(p) || EXPIRED_STARVING(rq)) {
+ if (!TASK_INTERACTIVE(p) || EXPIRED_STARVING(rq) || 0) {
if (!rq->expired_timestamp)
rq->expired_timestamp = jiffies;
enqueue_task(p, rq->expired);
} else
enqueue_task(p, rq->active);
+ } else {
+ /*
+ * Prevent a too long timeslice allowing a task to monopolize
+ * the CPU. We do this by splitting up the timeslice into
+ * smaller pieces.
+ *
+ * Note: this does not mean the task's timeslices expire or
+ * get lost in any way, they just might be preempted by
+ * another task of equal priority. (one with higher
+ * priority would have preempted this task already.) We
+ * requeue this task to the end of the list on this priority
+ * level, which is in essence a round-robin of tasks with
+ * equal priority.
+ */
+ if (!(p->time_slice % TIMESLICE_GRANULARITY) &&
+ (p->array == rq->active)) {
+ dequeue_task(p, rq->active);
+ set_tsk_need_resched(p);
+ p->prio = effective_prio(p);
+ enqueue_task(p, rq->active);
+ }
}
out_unlock:
spin_unlock(&rq->lock);
@@ -1268,6 +1306,8 @@ asmlinkage void schedule(void)
runqueue_t *rq;
prio_array_t *array;
struct list_head *queue;
+ unsigned long long now;
+ unsigned long run_time;
int idx;

/*
@@ -1288,7 +1328,11 @@ need_resched:
rq = this_rq();

release_kernel_lock(prev);
- prev->last_run = jiffies;
+ now = sched_clock();
+ if (likely(now - prev->timestamp < MAX_SLEEP_AVG))
+ run_time = now - prev->timestamp;
+ else
+ run_time = MAX_SLEEP_AVG;
spin_lock_irq(&rq->lock);

/*
@@ -1336,12 +1380,25 @@ pick_next_task:
queue = array->queue + idx;
next = list_entry(queue->next, task_t, run_list);

+ if (next->activated) {
+ next->activated = 0;
+ array = next->array;
+ dequeue_task(next, array);
+ recalc_task_prio(next, now);
+ enqueue_task(next, array);
+ }
switch_tasks:
prefetch(next);
clear_tsk_need_resched(prev);
RCU_qsctr(task_cpu(prev))++;

+ prev->sleep_avg -= run_time;
+ if ((long)prev->sleep_avg < 0)
+ prev->sleep_avg = 0;
+ prev->timestamp = now;
+
if (likely(prev != next)) {
+ next->timestamp = now;
rq->nr_switches++;
rq->curr = next;

@@ -2362,6 +2419,12 @@ static void move_task_away(struct task_s
local_irq_restore(flags);
}

+typedef struct {
+ int cpu;
+ struct completion startup_done;
+ task_t *task;
+} migration_startup_t;
+
/*
* migration_thread - this is a highprio system thread that performs
* thread migration by bumping thread off CPU then 'pushing' onto
@@ -2371,20 +2434,21 @@ static int migration_thread(void * data)
{
/* Marking "param" __user is ok, since we do a set_fs(KERNEL_DS); */
struct sched_param __user param = { .sched_priority = MAX_RT_PRIO-1 };
- int cpu = (long) data;
+ migration_startup_t *startup = data;
+ int cpu = startup->cpu;
runqueue_t *rq;
int ret;

+ startup->task = current;
+ complete(&startup->startup_done);
+ set_current_state(TASK_UNINTERRUPTIBLE);
+ schedule();
+
+ BUG_ON(smp_processor_id() != cpu);
+
daemonize("migration/%d", cpu);
set_fs(KERNEL_DS);

- /*
- * Either we are running on the right CPU, or there's a a
- * migration thread on this CPU, guaranteed (we're started
- * serially).
- */
- set_cpus_allowed(current, 1UL << cpu);
-
ret = setscheduler(0, SCHED_FIFO, &param);

rq = this_rq();
@@ -2420,13 +2484,30 @@ static int migration_call(struct notifie
unsigned long action,
void *hcpu)
{
+ long cpu = (long) hcpu;
+ migration_startup_t startup;
+
switch (action) {
case CPU_ONLINE:
- printk("Starting migration thread for cpu %li\n",
- (long)hcpu);
- kernel_thread(migration_thread, hcpu, CLONE_KERNEL);
- while (!cpu_rq((long)hcpu)->migration_thread)
+
+ printk("Starting migration thread for cpu %li\n", cpu);
+
+ startup.cpu = cpu;
+ startup.task = NULL;
+ init_completion(&startup.startup_done);
+
+ kernel_thread(migration_thread, &startup, CLONE_KERNEL);
+ wait_for_completion(&startup.startup_done);
+ wait_task_inactive(startup.task);
+
+ startup.task->thread_info->cpu = cpu;
+ startup.task->cpus_allowed = cpumask_of_cpu(cpu);
+
+ wake_up_process(startup.task);
+
+ while (!cpu_rq(cpu)->migration_thread)
yield();
+
break;
}
return NOTIFY_OK;

Attachments:

(No filename) (160.00 B)
sched-2.6.0-test1-bk2-G3 (12.20 kB)
Download all attachments

2003-07-25 22:39:11

by Diego Calleja

[permalink] [raw]

Subject: Re: [patch] sched-2.6.0-test1-G3

El Fri, 25 Jul 2003 23:13:14 +0200 Rudo Thomas <[email protected]> escribi?:

> > [snip] The patch should also cleanly apply to 2.6.0-test1-bk2.
>
> Actually, it does not, due to the cpumask_t -> unsigned long change. This one
> applies.

For me, it applies with offsets :-/

I get this
kernel/built-in.o(.text+0x414e): In function `migration_call':
: undefined reference to `cpumask_of_cpu'
make: *** [.tmp_vmlinux1] Error 1
diego@estel:~/kernel/unsta$

cpumask_of_cpu is not declared/used anywhere in the kernel (apart
of kernel/sched.c:2498, after applying bk2 + your patch)

2003-07-25 22:43:37

by Felipe Alfaro Solana

[permalink] [raw]

Subject: Re: [patch] sched-2.6.0-test1-G3, interactivity changes, audio latency

On Fri, 2003-07-25 at 21:59, Ingo Molnar wrote:
> my current "interactivity changes" scheduler patchset can be found at:
>
> redhat.com/~mingo/O(1)-scheduler/sched-2.6.0-test1-G3
>
> (this patch is mostly orthogonal to Con's patchset, but obviously collides
> patch-wise. The patch should also cleanly apply to 2.6.0-test1-bk2.)

For my workloads, this patch behaves just a little bit better than -G2,
but overall, X still feels jumpy, even under no load. I haven't been
able to reproduce any XMMS sound skips, due to the raising of
CHILD_PENALTY.

I've seen much better response times and less jerkiness of the system by
setting MAX_SLEEP_AVG at 1000000000 and STARVATION_LIMIT to HZ. With
these settings, under heavy load, no process starvation or slowdown can
be perceived, an opening new terminals, or launching processes is nearly
as fast as when the system is under no load. This is great, as the
vanilla scheduler is unable to handle this correctly.

Additionally, renicing X to -20 gives pretty good results for X
applications, like Evolution, but there are still some remanents of
jerkiness from time to time.

2003-07-25 23:10:59

by Con Kolivas

[permalink] [raw]

Subject: Re: [patch] sched-2.6.0-test1-G3, interactivity changes, audio latency

On Sat, 26 Jul 2003 05:59, Ingo Molnar wrote:
> my current "interactivity changes" scheduler patchset can be found at:
>
> redhat.com/~mingo/O(1)-scheduler/sched-2.6.0-test1-G3
>
> (this patch is mostly orthogonal to Con's patchset, but obviously collides
> patch-wise. The patch should also cleanly apply to 2.6.0-test1-bk2.)

If the engineer is back to tune his own engine, the mechanic shall stop
playing with it.

Con

2003-07-25 23:16:24

by Rudolf Thomas

[permalink] [raw]

Subject: Re: [patch] sched-2.6.0-test1-G3

> cpumask_of_cpu is not declared/used anywhere in the kernel (apart
> of kernel/sched.c:2498, after applying bk2 + your patch)

Oh, well. I'd better go to sleep.

I suppose you figured out that you have to change that to 1ULL << cpu . Just in
case you didn't, get the right one at
http://www.ms.mff.cuni.cz/~thomr9am/sched-2.6.0-test1-bk2-G3 .

Rudo.

2003-07-26 01:37:15

by Con Kolivas

[permalink] [raw]

Subject: Re: [patch] sched-2.6.0-test1-G3

Ingo

With this section of your patch:

+ if (!(p->time_slice % TIMESLICE_GRANULARITY) &&

that would requeue active tasks a variable duration into their running time
dependent on their task timeslice rather than every 50ms. Shouldn't it be:

+ if (!((task_timeslice(p) - p->time_slice) % TIMESLICE_GRANULARITY) &&

to ensure it is TIMESLICE_GRANULARITY into the timeslice of a process?

Con

2003-07-26 15:54:46

by Guillaume Chazarain

[permalink] [raw]

Subject: Re: [patch] sched-2.6.0-test1-G3, interactivity changes, audio latency

Hi,

> - cycle accuracy (nanosec resolution) timekeeping within the scheduler.
> This fixes a number of audio artifacts (skipping) i've reproduced. I
> dont think we can get away without going cycle accuracy - reading the
> cycle counter adds some overhead, but it's acceptable. The first
> nanosec-accuracy patch was done by Mike Galbraith - this patch is
> different but similar in nature. I went further in also changing the
> sleep_avg to be of nanosec resolution.

I'd like to have your opinion on this, in 2.6.0-test1 we have:

static inline void activate_task(task_t *p, runqueue_t *rq)
{
long sleep_time = jiffies - p->last_run - 1;

I'd like to have it back to jiffies - p->last_run. See why:
suppose jiffies - p->last_run == 1, here are the extremum timescales
possible for this: (S: sleep, W: wake up)

p->last_run jiffies
time: |-------------|------------->
longest: S W => sleep_time = 2
shortest: S W => sleep_time = 0

That is, when jiffies - p->last_run == 1, all we know is that sleep_time
is between 0 and 2. And currently we take 0 for sleep_time in this case.
Wouldn't it be more accurate to take 1 as a better approximation for something
between 0 and 2? This is a sensible thing to me, because it makes mplayer
move from max CPU hog to max interactive. When I see that mplayer uses only
10% of the cpu, it makes some sense.
BTW, with your nanosecond resolution, mplayer is also max interactive.
Don't you find it strange to need a nanosecond resolution to evaluate a simple
integer in a [-5, 5] range?

> - more finegrained timeslices: there's now a timeslice 'sub unit' of 50
> usecs (TIMESLICE_GRANULARITY) - CPU hogs on the same priority level
> will roundrobin with this unit. This change is intended to make gaming
> latencies shorter.

Strange that no one complained about the throughput. Anyway I don't care too.

> - include scheduling latency in sleep bonus calculation. This change
>
> extends the sleep-average calculation to the period of time a task
> spends on the runqueue but doesnt get scheduled yet, right after
> wakeup. Note that tasks that were preempted (ie. not woken up) and are
> still on the runqueue do not get this benefit. This change closes one
> of the last hole in the dynamic priority estimation, it should result
> in interactive tasks getting more priority under heavy load. This
> change also fixes the test-starve.c testcase from David Mosberger.

Right, it solves test-starve.c, but not irman2.c. With sched-G4, when irman2
is launched, a kernel compile could take ages, I tried it. After 3 hours it
was still with the first file (scripts/fixdep.c), it produced no .o file.
With the patch at the end a kernel compile takes one hour (with -j1 and -j16)
versus five minutes when nothing else runs (config: allnoconfig).

The idea in the patch is to keep a list of the tasks on the runqueue, without
the one currently running, and sorted by insertion date. Before reinserting an
interactive task in the active array, we check that no task has waited too
long on this list. Davide, does it implement the interactivity throttle you had
in mind?

It's very similar to EXPIRED_STARVING(), but it has the advantage of considering
all tasks, not only the expired. It seems that with irman2, tasks don't even
have the time to expire.

Regards,
Guillaume

--- linux-2.6.0-test1/include/linux/sched.h
+++ linux-2.6.0-test1-gfc/include/linux/sched.h
@@ -335,6 +335,8 @@

int prio, static_prio;
struct list_head run_list;
+ unsigned long activated_timestamp;
+ struct list_head activated_list;
prio_array_t *array;

unsigned long sleep_avg;
--- linux-2.6.0-test1/kernel/fork.c
+++ linux-2.6.0-test1-gfc/kernel/fork.c
@@ -839,6 +839,7 @@
p->proc_dentry = NULL;

INIT_LIST_HEAD(&p->run_list);
+ INIT_LIST_HEAD(&p->activated_list);

INIT_LIST_HEAD(&p->children);
INIT_LIST_HEAD(&p->sibling);
--- linux-2.6.0-test1/kernel/sched.c
+++ linux-2.6.0-test1-gfc/kernel/sched.c
@@ -74,14 +74,14 @@
#define PRIO_BONUS_RATIO 25
#define INTERACTIVE_DELTA 2
#define MAX_SLEEP_AVG (10*HZ)
-#define STARVATION_LIMIT (10*HZ)
+#define STARVATION_LIMIT MAX_TIMESLICE
#define NODE_THRESHOLD 125

/*
- * If a task is 'interactive' then we reinsert it in the active
- * array after it has expired its current timeslice. (it will not
- * continue to run immediately, it will still roundrobin with
- * other interactive tasks.)
+ * If a task is 'interactive' and there is no starvation, then we
+ * reinsert it in the active array after it has expired its current
+ * timeslice. (it will not continue to run immediately, it will still
+ * roundrobin with other interactive tasks.)
*
* This part scales the interactivity limit depending on niceness.
*
@@ -157,11 +157,11 @@
*/
struct runqueue {
spinlock_t lock;
- unsigned long nr_running, nr_switches, expired_timestamp,
- nr_uninterruptible;
+ unsigned long nr_running, nr_switches, nr_uninterruptible;
task_t *curr, *idle;
struct mm_struct *prev_mm;
prio_array_t *active, *expired, arrays[2];
+ struct list_head activated_list;
int prev_cpu_load[NR_CPUS];
#ifdef CONFIG_NUMA
atomic_t *node_nr_running;
@@ -330,6 +330,17 @@
return prio;
}

+static inline void activated_task(task_t *p, runqueue_t *rq)
+{
+ if (likely(p != rq->idle) &&
+ likely(!rt_task(p)) &&
+ likely(!p->activated_list.next)) {
+
+ list_add_tail(&p->activated_list, &rq->activated_list);
+ p->activated_timestamp = jiffies;
+ }
+}
+
/*
* __activate_task - move a task to the runqueue.
*/
@@ -337,6 +348,7 @@
{
enqueue_task(p, rq->active);
nr_running_inc(rq);
+ activated_task(p, rq);
}

/*
@@ -563,6 +575,7 @@
p->array = current->array;
p->array->nr_active++;
nr_running_inc(rq);
+ activated_task(p, rq);
}
task_rq_unlock(rq, &flags);
}
@@ -1155,15 +1168,34 @@
* We place interactive tasks back into the active array, if possible.
*
* To guarantee that this does not starve expired tasks we ignore the
- * interactivity of a task if the first expired task had to wait more
- * than a 'reasonable' amount of time. This deadline timeout is
- * load-dependent, as the frequency of array switched decreases with
- * increasing number of running tasks:
- */
-#define EXPIRED_STARVING(rq) \
- (STARVATION_LIMIT && ((rq)->expired_timestamp && \
- (jiffies - (rq)->expired_timestamp >= \
- STARVATION_LIMIT * ((rq)->nr_running) + 1)))
+ * interactivity of a task if a task had to wait more than a 'reasonable'
+ * amount of time. This deadline timeout is load-dependent, as the frequency
+ * of array switched decreases with increasing number of running tasks:
+ */
+static inline int runqueue_starvation(runqueue_t *rq)
+{
+ struct task_struct *first;
+ unsigned long delay;
+
+ if (list_empty(&rq->activated_list))
+ return 0;
+
+ first = list_entry(rq->activated_list.next, task_t, activated_list);
+ delay = jiffies - first->activated_timestamp;
+
+ if (unlikely(delay > rq->nr_running * STARVATION_LIMIT)) {
+ list_del(&first->run_list);
+ list_add(&first->run_list, first->array->queue + first->prio);
+
+ /* DEBUG */
+ printk("%d (%s) is starved (%lu ms) ", first->pid,
+ first->comm, delay - rq->nr_running * MAX_TIMESLICE);
+ printk("punishing %d (%s)\n", current->pid, current->comm);
+ return 1;
+ }
+
+ return 0;
+}

/*
* This function gets called by the timer code, with HZ frequency.
@@ -1238,11 +1270,9 @@
p->time_slice = task_timeslice(p);
p->first_time_slice = 0;

- if (!TASK_INTERACTIVE(p) || EXPIRED_STARVING(rq)) {
- if (!rq->expired_timestamp)
- rq->expired_timestamp = jiffies;
+ if (!TASK_INTERACTIVE(p) || unlikely(runqueue_starvation(rq)))
enqueue_task(p, rq->expired);
- } else
+ else
enqueue_task(p, rq->active);
}
out_unlock:
@@ -1251,6 +1281,28 @@
rebalance_tick(rq, 0);
}

+static inline void prepare_switch(runqueue_t *rq, task_t *prev, task_t *next)
+{
+ /* prev has been preempted. */
+ if (prev->state == TASK_RUNNING ||
+ unlikely(preempt_count() & PREEMPT_ACTIVE))
+ activated_task(prev, rq);
+
+ /* next is no more waiting for being scheduled. */
+ if (likely(next != rq->idle) &&
+ likely(!rt_task(next))) {
+ if (likely(next->activated_list.next != NULL)) {
+ list_del(&next->activated_list);
+ next->activated_list.next = NULL;
+ } else {
+ /* DEBUG */
+ dump_stack();
+ printk("not activated: %d (%s)\n", next->pid,
+ next->comm);
+ }
+ }
+}
+
void scheduling_functions_start_here(void) { }

/*
@@ -1311,7 +1363,6 @@
goto pick_next_task;
#endif
next = rq->idle;
- rq->expired_timestamp = 0;
goto switch_tasks;
}

@@ -1323,7 +1374,6 @@
rq->active = rq->expired;
rq->expired = array;
array = rq->active;
- rq->expired_timestamp = 0;
}

idx = sched_find_first_bit(array->bitmap);
@@ -1336,6 +1386,8 @@
RCU_qsctr(task_cpu(prev))++;

if (likely(prev != next)) {
+ prepare_switch(rq, prev, next);
+
rq->nr_switches++;
rq->curr = next;

@@ -2497,6 +2549,7 @@
rq = cpu_rq(i);
rq->active = rq->arrays;
rq->expired = rq->arrays + 1;
+ INIT_LIST_HEAD(&rq->activated_list);
spin_lock_init(&rq->lock);
INIT_LIST_HEAD(&rq->migration_queue);
atomic_set(&rq->nr_iowait, 0);

2003-07-27 09:12:18

by Mike Galbraith

[permalink] [raw]

Subject: Re: [patch] sched-2.6.0-test1-G3, interactivity changes, audio latency

Hi,

At 06:09 PM 7/26/2003 +0200, Guillaume Chazarain wrote:

> > - include scheduling latency in sleep bonus calculation. This change
> >
> > extends the sleep-average calculation to the period of time a task
> > spends on the runqueue but doesnt get scheduled yet, right after
> > wakeup. Note that tasks that were preempted (ie. not woken up) and are
> > still on the runqueue do not get this benefit. This change closes one
> > of the last hole in the dynamic priority estimation, it should result
> > in interactive tasks getting more priority under heavy load. This
> > change also fixes the test-starve.c testcase from David Mosberger.
>
>Right, it solves test-starve.c, but not irman2.c. With sched-G4, when irman2
>is launched, a kernel compile could take ages, I tried it. After 3 hours it
>was still with the first file (scripts/fixdep.c), it produced no .o file.
>With the patch at the end a kernel compile takes one hour (with -j1 and -j16)
>versus five minutes when nothing else runs (config: allnoconfig).
>
>The idea in the patch is to keep a list of the tasks on the runqueue, without
>the one currently running, and sorted by insertion date. Before
>reinserting an
>interactive task in the active array, we check that no task has waited too
>long on this list. Davide, does it implement the interactivity throttle
>you had
>in mind?
>
>It's very similar to EXPIRED_STARVING(), but it has the advantage of
>considering
>all tasks, not only the expired. It seems that with irman2, tasks don't even
>have the time to expire.

True, irman2 is a very nasty little bugger.

Your method works well here. With your patch in test1+G4, I can run irman2
and still have a quite usable system. It could possibly use a little
refinement though. If xmms happens to run out of slice at a time when you
determine that starvation is in progress, it'll nail the innocent xmms (or
other very light task).

I went a different route in tackling irman2 to avoid the pain of expiring X
(glitch), but really there's no difference between X and xmms... they
should both be run RR if you want 100% glitch free operation (and thus
would be exempted from punishment under your method, and mine as well)

-Mike

2003-07-27 10:19:09

by Ingo Molnar

[permalink] [raw]

Subject: Re: [patch] sched-2.6.0-test1-G3, interactivity changes, audio latency

On Sat, 26 Jul 2003, Guillaume Chazarain wrote:

> BTW, with your nanosecond resolution, mplayer is also max interactive.

good.

> Don't you find it strange to need a nanosecond resolution to evaluate a
> simple integer in a [-5, 5] range?

no, i dont find it strange, as the -5..5 integer range is only the end
result. What we really want is an accurate sleep average, the integral of
sleep times done over the last N seconds. Given that tasks can schedule at
very high frequencies these days, this needs accurate measurements. I used
to hope that a 1 msec sampling frequency would to be enough for this, but
when we fix the statistics cornercase you mention, another cornercase pops
up. And it's not _that_ hard to use the cycle counter and still have a
reasonably fast algorithm, anyway.

Ingo

2003-07-27 12:19:14

by Ingo Molnar

[permalink] [raw]

Subject: Re: [patch] sched-2.6.0-test1-G3

On Sat, 26 Jul 2003, Con Kolivas wrote:

> With this section of your patch:
>
> + if (!(p->time_slice % TIMESLICE_GRANULARITY) &&
>
> that would requeue active tasks a variable duration into their running
> time dependent on their task timeslice rather than every 50ms. Shouldn't
> it be:
>
> + if (!((task_timeslice(p) - p->time_slice) % TIMESLICE_GRANULARITY) &&
>
> to ensure it is TIMESLICE_GRANULARITY into the timeslice of a process?

yeah, agreed.

Ingo