Greetings,
Below, please find the latest version of my task throttling attempt.
What the patch addresses:
The current interactivity heuristics are so heavily slanted in favor of
tasks which sleep at all, that any task which sleeps for 5% of the time
can use 95% cpu forever. Even if this slant were removed, became linear
again, any task which sleeps even a tiny bit longer than it runs will
eventually attain maximum dynamic priority.
That said, the current heuristics work very well in the general case.
When they fail, however, it can get pretty darn ugly. That is the
problem space this patch attempts to address, and does pretty well at.
It tries to solve the nasty corner cases.
How it works is by no means rocket science, and is described in the
comments of the first hunk of sched.c changes.
In addition to the throttling, this simple patch does three things
(more?) which may be controversial.
1. It removes the barrier between kernel threads and user tasks wrt the
handling of dynamic priority.
2. It changes the meaning of INTERACTIVE_SLEEP() a little.
3. It removes TASK_NONINTERACTIVE from fs/pipe.c
My argument for #1: Either dynamic priority is fully applicable or not
at all. If any kernel thread really needs to be exempt from any aspect
of dynamic priority, it should be in a different scheduling class.
My argument for #2: The boundary established by TASK_INTERACTIVE() is a
portal, and should be treated as such. This patch does not establish
the boundary, it only enforces it. It also adds a very modest level of
caution to the promotion of tasks beyond the interactive task boundary.
One single 10ms sleep will no longer take a pure cpu hog to the very
top.
My argument for #3: AnaroK is an mp3 player which uses pipes to
communicate with its various components. I can see no that reason it
should not receive the same boost as any other task. It is after all a
genuine interactive task. Besides, it's no longer necessary.
Oops, there's a #4.
4. It adds knobs to the scheduler.
My argument for #4: These knobs put policy where it belongs. If the
user wants a maximum interactive environment, and is willing to accept
that this means some starvation comes with the deal, it's his choice.
Those who have little to no tolerance for starvation can set both knobs
to zero, and be happy. Both can have their cake with whatever icing
they are partial to.
There may be a #5, #6... who knows, you tell me ;-)
Comments? Suggestions? Fla^H^H^H (nah... beware what you ask)
-Mike
P.S. Any interactive feel improvement inflicted upon the user by the
throttling of hyperactive cpu-hog is absolutely free... as is the
strangulation of their favorite mp3 player GL visualization ;-)
Not to be taken as a sign of misguided confidence, I'm just taking blame
for this in all of it's ugly-is-in-the-eye-of-the-beholder glory.
Signed-off-by: Mike Galbraith <[email protected]>
--- linux-2.6.16-rc3-mm1x/include/linux/sched.h.org 2006-02-15 08:40:13.000000000 +0100
+++ linux-2.6.16-rc3-mm1x/include/linux/sched.h 2006-02-15 08:40:37.000000000 +0100
@@ -719,14 +719,14 @@
unsigned short ioprio;
unsigned int btrace_seq;
- unsigned long sleep_avg;
+ unsigned long sleep_avg, last_slice, throttle_stamp;
unsigned long long timestamp, last_ran;
unsigned long long sched_time; /* sched_clock time spent running */
enum sleep_type sleep_type;
unsigned long policy;
cpumask_t cpus_allowed;
- unsigned int time_slice, first_time_slice;
+ unsigned int time_slice, slice_info;
#ifdef CONFIG_SCHEDSTATS
struct sched_info sched_info;
--- linux-2.6.16-rc3-mm1x/include/linux/sysctl.h.org 2006-02-15 08:40:25.000000000 +0100
+++ linux-2.6.16-rc3-mm1x/include/linux/sysctl.h 2006-02-15 08:40:37.000000000 +0100
@@ -147,6 +147,8 @@
KERN_SETUID_DUMPABLE=69, /* int: behaviour of dumps for setuid core */
KERN_SPIN_RETRY=70, /* int: number of spinlock retries */
KERN_ACPI_VIDEO_FLAGS=71, /* int: flags for setting up video after ACPI sleep */
+ KERN_SCHED_THROTTLE1=72, /* int: throttling grace period 1 in secs */
+ KERN_SCHED_THROTTLE2=73, /* int: throttling grace period 2 in secs */
};
--- linux-2.6.16-rc3-mm1x/kernel/sched.c.org 2006-02-15 08:32:15.000000000 +0100
+++ linux-2.6.16-rc3-mm1x/kernel/sched.c 2006-02-17 12:13:48.000000000 +0100
@@ -158,9 +158,192 @@
#define TASK_INTERACTIVE(p) \
((p)->prio <= (p)->static_prio - DELTA(p))
-#define INTERACTIVE_SLEEP(p) \
- (JIFFIES_TO_NS(MAX_SLEEP_AVG * \
- (MAX_BONUS / 2 + DELTA((p)) + 1) / MAX_BONUS - 1))
+/*
+ * Interactive boost can lead to serious starvation problems if the
+ * task being boosted turns out to be a cpu hog. To combat this, we
+ * compute the sane upper limit for cpu usage 'slice_avg' based on a
+ * task's sleep_avg, and use this information combined with a timer
+ * to determine when a task is in need of throttling.
+ *
+ * Strategy:
+ *
+ * All tasks begin life with a throttle timer which is expired, but a
+ * slice_avg which allows brief interactive status. This status will
+ * expire at the end of their very first timeslice if they don't sleep,
+ * thus preventing them from doing harm right from the beginning. The
+ * only ways to reset the throttle timer, and thus escape it's effects,
+ * are to either change behavior to match what sleep_avg indicates, or
+ * to become insignificant. We push the time stamp forward in time when
+ * a task's behavior matches expectations until it's in the future, we
+ * then reset the timer, which will allow the task to run unencumbered
+ * for a maximum amount of time determined by the user. The less cpu a
+ * task uses, the faster it can reset it's timer, so light weight tasks
+ * very rapidly attain interactive status, as do things like X which
+ * change behavior radically. Tasks with cpu usage which is constantly
+ * above what their sleep_avg indicates can't escape.
+ *
+ * /proc/sys/kernel tunables.
+ *
+ * sched_g1: Grace period in seconds that a task is allowed to run free.
+ * sched_g2: seconds thereafter, to force a priority adjustment.
+ */
+
+int sched_g1 = 20;
+int sched_g2 = 10;
+
+#define G1 (sched_g1 * HZ)
+#define G2 (sched_g2 * HZ + G1)
+
+/*
+ * Throttle distance maximum.
+ */
+#define GRACE_MAX (G2 << 1)
+
+#define grace_expired(p, grace) ((p)->throttle_stamp && \
+ time_after_eq(jiffies, (p)->throttle_stamp + (grace)))
+
+#define NS_MAX_BONUS (NS_MAX_SLEEP_AVG / MAX_BONUS)
+#define NS_SLEEP_AVG_PCNT (NS_MAX_SLEEP_AVG / 100)
+
+/*
+ * Masks for p->slice_info, formerly p->first_time_slice.
+ * SLICE_FTS: 0x80000000 Task is in it's first ever timeslice.
+ * SLICE_NEW: 0x40000000 Slice refreshed.
+ * SLICE_SPA: 0x3FFF8000 Spare bits.
+ * SLICE_LTS: 0x00007F80 Last time slice
+ * SLICE_AVG: 0x0000007F Task slice_avg stored as percentage.
+ */
+#define SLICE_AVG_BITS 7
+#define SLICE_LTS_BITS 10
+#define SLICE_SPA_BITS 13
+#define SLICE_NEW_BITS 1
+#define SLICE_FTS_BITS 1
+
+#define SLICE_AVG_SHIFT 0
+#define SLICE_LTS_SHIFT (SLICE_AVG_SHIFT + SLICE_AVG_BITS)
+#define SLICE_SPA_SHIFT (SLICE_LTS_SHIFT + SLICE_LTS_BITS)
+#define SLICE_NEW_SHIFT (SLICE_SPA_SHIFT + SLICE_SPA_BITS)
+#define SLICE_FTS_SHIFT (SLICE_NEW_SHIFT + SLICE_NEW_BITS)
+
+#define INFO_MASK(x) ((1U << (x))-1)
+#define SLICE_AVG_MASK (INFO_MASK(SLICE_AVG_BITS) << SLICE_AVG_SHIFT)
+#define SLICE_LTS_MASK (INFO_MASK(SLICE_LTS_BITS) << SLICE_LTS_SHIFT)
+#define SLICE_SPA_MASK (INFO_MASK(SLICE_SPA_BITS) << SLICE_SPA_SHIFT)
+#define SLICE_NEW_MASK (INFO_MASK(SLICE_NEW_BITS) << SLICE_NEW_SHIFT)
+#define SLICE_FTS_MASK (INFO_MASK(SLICE_FTS_BITS) << SLICE_FTS_SHIFT)
+
+/*
+ * p->slice_info access macros.
+ */
+#define first_time_slice(p) ((p)->slice_info & SLICE_FTS_MASK)
+#define set_first_time_slice(p) ((p)->slice_info |= SLICE_FTS_MASK)
+#define clr_first_time_slice(p) ((p)->slice_info &= ~SLICE_FTS_MASK)
+
+#define slice_is_new(p) ((p)->slice_info & SLICE_NEW_MASK)
+#define set_slice_is_new(p) ((p)->slice_info |= SLICE_NEW_MASK)
+#define clr_slice_is_new(p) ((p)->slice_info &= ~SLICE_NEW_MASK)
+
+#define last_slice(p) \
+ ((((p)->slice_info & SLICE_LTS_MASK) >> SLICE_LTS_SHIFT) ? : \
+ DEF_TIMESLICE)
+#define set_last_slice(p, n) ((p)->slice_info = (((p)->slice_info & \
+ ~SLICE_LTS_MASK) | (((n) << SLICE_LTS_SHIFT) & SLICE_LTS_MASK)))
+
+#define slice_avg(p) ((typeof((p)->sleep_avg)) \
+ ((((p)->slice_info & SLICE_AVG_MASK) >> SLICE_AVG_SHIFT) * \
+ NS_SLEEP_AVG_PCNT))
+#define set_slice_avg(p, n) ((p)->slice_info = (((p)->slice_info & \
+ ~SLICE_AVG_MASK) | ((((n) / NS_SLEEP_AVG_PCNT) \
+ << SLICE_AVG_SHIFT) & SLICE_AVG_MASK)))
+#define slice_avg_raw(p) \
+ (((p)->slice_info & SLICE_AVG_MASK) >> SLICE_AVG_SHIFT)
+#define set_slice_avg_raw(p, n) ((p)->slice_info = (((p)->slice_info & \
+ ~SLICE_AVG_MASK) | (((n) << SLICE_AVG_SHIFT) & SLICE_AVG_MASK)))
+
+/*
+ * cpu usage macros.
+ */
+#define cpu_avg(p) \
+ (100 - slice_avg_raw(p))
+
+#define slice_time_avg(p) \
+ (100 * last_slice(p) / max((unsigned) cpu_avg(p), 1U))
+
+#define cpu_max(p) \
+ (100 - ((p)->sleep_avg / NS_SLEEP_AVG_PCNT))
+
+#define slice_time_min(p) \
+ (100 * last_slice(p) / max((unsigned) cpu_max(p), 1U))
+
+#define time_this_slice(p) \
+ (jiffies - (p)->last_slice)
+
+#define cpu_this_slice(p) \
+ (100 * last_slice(p) / max((unsigned) time_this_slice(p), \
+ (unsigned) last_slice(p)))
+
+#define this_slice_avg(p) ((typeof((p)->sleep_avg)) \
+ ((100 - cpu_this_slice(p)) * NS_SLEEP_AVG_PCNT))
+
+/*
+ * Those who use the least cpu receive the most encouragement.
+ */
+#define SLICE_AVG_MULTIPLIER(p) \
+ (1 + NS_TO_JIFFIES(this_slice_avg(p)) * MAX_BONUS / MAX_SLEEP_AVG)
+
+#define CPU_MINIMAL (100 / MAX_BONUS / 2)
+
+static void throttle_cond_reset(task_t *p)
+{
+ int delay;
+
+ if (cpu_avg(p) > cpu_max(p) && cpu_this_slice(p) > CPU_MINIMAL)
+ return;
+
+ delay = slice_time_avg(p) - last_slice(p);
+
+ if (delay > 0) {
+ delay *= SLICE_AVG_MULTIPLIER(p);
+ p->throttle_stamp += delay;
+ }
+ if (time_before(jiffies, p->throttle_stamp)) {
+ p->throttle_stamp = jiffies;
+ if (!(p->state & TASK_NONINTERACTIVE))
+ p->sleep_type = SLEEP_NORMAL;
+ }
+}
+
+/*
+ * CURRENT_BONUS(p) adjusted to match slice_avg after grace expiration.
+ */
+#define ADJUSTED_BONUS(p, grace) \
+({ \
+ unsigned long sleep_avg = (p)->sleep_avg; \
+ if (grace_expired(p, (grace))) \
+ sleep_avg = min((p)->sleep_avg, slice_avg(p)); \
+ NS_TO_JIFFIES(sleep_avg) * MAX_BONUS / MAX_SLEEP_AVG; \
+})
+
+#define BONUS_MULTIPLIER(p) \
+ (grace_expired(p, G1) ? : SLICE_AVG_MULTIPLIER(p))
+
+#define BONUS_DIVISOR(p) \
+ (grace_expired(p, G2) ? : (1 + ADJUSTED_BONUS(p, G1)))
+
+#define INTERACTIVE_SLEEP_AVG(p) \
+ (min(JIFFIES_TO_NS(MAX_SLEEP_AVG * (MAX_BONUS / 2 + DELTA(p)) / \
+ MAX_BONUS), NS_MAX_SLEEP_AVG))
+
+/*
+ * Quantity of sleep quaranteed to elevate a task to interactive status,
+ * or once there, to elevate it to the next priority or beyond.
+ */
+#define INTERACTIVE_SLEEP_NS(p, ns) \
+ (BONUS_MULTIPLIER(p) * (ns) >= INTERACTIVE_SLEEP_AVG(p) || \
+ ((p)->sleep_avg < INTERACTIVE_SLEEP_AVG(p) && BONUS_MULTIPLIER(p) * \
+ (ns) + (p)->sleep_avg >= INTERACTIVE_SLEEP_AVG(p)) || \
+ ((p)->sleep_avg >= INTERACTIVE_SLEEP_AVG(p) && BONUS_MULTIPLIER(p) * \
+ (ns) + ((p)->sleep_avg % NS_MAX_BONUS) >= NS_MAX_BONUS))
#define TASK_PREEMPTS_CURR(p, rq) \
((p)->prio < (rq)->curr->prio)
@@ -668,7 +851,7 @@
if (rt_task(p))
return p->prio;
- bonus = CURRENT_BONUS(p) - MAX_BONUS / 2;
+ bonus = ADJUSTED_BONUS(p, G2) - MAX_BONUS / 2;
prio = p->static_prio - bonus;
if (prio < MAX_RT_PRIO)
@@ -792,21 +975,41 @@
sleep_time = (unsigned long)__sleep_time;
}
+ throttle_cond_reset(p);
+
if (likely(sleep_time > 0)) {
/*
- * User tasks that sleep a long time are categorised as
- * idle. They will only have their sleep_avg increased to a
+ * Tasks that sleep a long time are categorised as idle.
+ * They will only have their sleep_avg increased to a
* level that makes them just interactive priority to stay
* active yet prevent them suddenly becoming cpu hogs and
- * starving other processes.
+ * starving other processes. All tasks must stop at each
+ * TASK_INTERACTIVE boundary before moving on so that no
+ * single sleep slams it straight into NS_MAX_SLEEP_AVG.
+ * Tasks which have exceeded their authorized cpu usage
+ * will not be promoted beyond minimally interactive.
*/
- if (p->mm && sleep_time > INTERACTIVE_SLEEP(p)) {
- unsigned long ceiling;
+ if (INTERACTIVE_SLEEP_NS(p, sleep_time)) {
+ unsigned long ceiling = INTERACTIVE_SLEEP_AVG(p);
+ unsigned int slice = p->time_slice / BONUS_DIVISOR(p);
+ int throttle = grace_expired(p, G1);
+
+ /*
+ * Promote previously interactive tasks.
+ */
+ if (!throttle && p->sleep_avg >= ceiling) {
+ ceiling = p->sleep_avg / NS_MAX_BONUS;
+ if (ceiling < MAX_BONUS)
+ ceiling++;
+ ceiling *= NS_MAX_BONUS;
+ }
+
+ ceiling += JIFFIES_TO_NS(slice);
+ if (ceiling > NS_MAX_SLEEP_AVG)
+ ceiling = NS_MAX_SLEEP_AVG;
+ if (p->sleep_avg < ceiling)
+ p->sleep_avg = ceiling;
- ceiling = JIFFIES_TO_NS(MAX_SLEEP_AVG -
- DEF_TIMESLICE);
- if (p->sleep_avg < ceiling)
- p->sleep_avg = ceiling;
} else {
/*
@@ -816,9 +1019,8 @@
* If a task was sleeping with the noninteractive
* label do not apply this non-linear boost
*/
- if (p->sleep_type != SLEEP_NONINTERACTIVE || !p->mm)
- sleep_time *=
- (MAX_BONUS - CURRENT_BONUS(p)) ? : 1;
+ if (p->sleep_type != SLEEP_NONINTERACTIVE)
+ sleep_time *= BONUS_MULTIPLIER(p);
/*
* This code gives a bonus to interactive tasks.
@@ -1362,7 +1564,8 @@
out_activate:
#endif /* CONFIG_SMP */
- if (old_state == TASK_UNINTERRUPTIBLE) {
+
+ if (old_state & TASK_UNINTERRUPTIBLE) {
rq->nr_uninterruptible--;
/*
* Tasks waking from uninterruptible sleep are likely
@@ -1460,9 +1663,27 @@
* The remainder of the first timeslice might be recovered by
* the parent if the child exits early enough.
*/
- p->first_time_slice = 1;
+ set_first_time_slice(p);
current->time_slice >>= 1;
p->timestamp = sched_clock();
+
+ /*
+ * Set up slice_info for the child.
+ *
+ * Note: All new tasks receive the benefit of the doubt in that
+ * they begin life with a slice_avg of right at the interactive
+ * boundary. They are also born with an maximim expired throttle
+ * stamp, and will lose interactive status after their first slice
+ * if they don't sleep before it expires.
+ */
+ set_slice_avg(p, INTERACTIVE_SLEEP_AVG(p) + p->time_slice);
+ if (unlikely(slice_avg(p) > NS_MAX_SLEEP_AVG))
+ set_slice_avg(p, NS_MAX_SLEEP_AVG);
+ set_last_slice(p, p->time_slice);
+ set_slice_is_new(p);
+ p->last_slice = jiffies;
+ p->throttle_stamp = jiffies - GRACE_MAX;
+
if (unlikely(!current->time_slice)) {
/*
* This case is rare, it happens when the parent has only
@@ -1576,7 +1797,7 @@
* the sleep_avg of the parent as well.
*/
rq = task_rq_lock(p->parent, &flags);
- if (p->first_time_slice && task_cpu(p) == task_cpu(p->parent)) {
+ if (first_time_slice(p) && task_cpu(p) == task_cpu(p->parent)) {
p->parent->time_slice += p->time_slice;
if (unlikely(p->parent->time_slice > task_timeslice(p)))
p->parent->time_slice = task_timeslice(p);
@@ -2657,6 +2878,49 @@
}
/*
+ * Calculate a task's average cpu usage in terms of sleep_avg, and store
+ * it along with the task's last timeslice in slice_info. Must be called
+ * after refreshing the task's time slice.
+ * @p: task for which usage should be calculated.
+ */
+static void recalc_task_slice_avg(task_t *p)
+{
+ unsigned int slice = last_slice(p);
+ unsigned int prev = slice_avg_raw(p);
+ unsigned int this = 100 - cpu_this_slice(p);
+ int delta = max(prev, this) - min(prev, this);
+ int w = MAX_BONUS;
+
+ /*
+ * Weigh by behavior delta magnitude.
+ */
+ w -= delta / w;
+ if (!w)
+ w = 1;
+ this = (w * (prev ? : 1) + this) / (w + 1);
+
+ /*
+ * Update slice_info.
+ */
+ set_slice_avg_raw(p, this);
+ if (slice != p->time_slice)
+ set_last_slice(p, p->time_slice);
+
+ /*
+ * Stamp and tag the new slice.
+ */
+ set_slice_is_new(p);
+ p->last_slice = jiffies;
+
+ /*
+ * And finally, insure that our hero doesn't ride off
+ * into the sunset.
+ */
+ if (p->throttle_stamp && grace_expired(p, GRACE_MAX))
+ p->throttle_stamp = jiffies - GRACE_MAX;
+}
+
+/*
* This function gets called by the timer code, with HZ frequency.
* We call it with interrupts disabled.
*
@@ -2701,7 +2965,8 @@
*/
if ((p->policy == SCHED_RR) && !--p->time_slice) {
p->time_slice = task_timeslice(p);
- p->first_time_slice = 0;
+ recalc_task_slice_avg(p);
+ clr_first_time_slice(p);
set_tsk_need_resched(p);
/* put it at the end of the queue: */
@@ -2712,9 +2977,10 @@
if (!--p->time_slice) {
dequeue_task(p, rq->active);
set_tsk_need_resched(p);
- p->prio = effective_prio(p);
p->time_slice = task_timeslice(p);
- p->first_time_slice = 0;
+ recalc_task_slice_avg(p);
+ p->prio = effective_prio(p);
+ clr_first_time_slice(p);
if (!rq->expired_timestamp)
rq->expired_timestamp = jiffies;
@@ -3025,7 +3291,7 @@
* Tasks charged proportionately less run_time at high sleep_avg to
* delay them losing their interactive status
*/
- run_time /= (CURRENT_BONUS(prev) ? : 1);
+ run_time /= BONUS_DIVISOR(prev);
spin_lock_irq(&rq->lock);
@@ -3039,7 +3305,7 @@
unlikely(signal_pending(prev))))
prev->state = TASK_RUNNING;
else {
- if (prev->state == TASK_UNINTERRUPTIBLE)
+ if (prev->state & TASK_UNINTERRUPTIBLE)
rq->nr_uninterruptible++;
deactivate_task(prev, rq);
}
@@ -3088,6 +3354,7 @@
rq->best_expired_prio = MAX_PRIO;
}
+repeat_selection:
idx = sched_find_first_bit(array->bitmap);
queue = array->queue + idx;
next = list_entry(queue->next, task_t, run_list);
@@ -3107,8 +3374,14 @@
dequeue_task(next, array);
next->prio = new_prio;
enqueue_task(next, array);
- } else
- requeue_task(next, array);
+
+ /*
+ * We may have just been demoted below other
+ * runnable tasks in our previous queue.
+ */
+ next->sleep_type = SLEEP_NORMAL;
+ goto repeat_selection;
+ }
}
next->sleep_type = SLEEP_NORMAL;
switch_tasks:
@@ -3126,6 +3399,14 @@
prev->sleep_avg = 0;
prev->timestamp = prev->last_ran = now;
+ /*
+ * Tag start of execution of a new timeslice.
+ */
+ if (unlikely(slice_is_new(next))) {
+ next->last_slice = jiffies;
+ clr_slice_is_new(next);
+ }
+
sched_info_switch(prev, next);
if (likely(prev != next)) {
next->timestamp = now;
--- linux-2.6.16-rc3-mm1x/kernel/sysctl.c.org 2006-02-15 08:32:15.000000000 +0100
+++ linux-2.6.16-rc3-mm1x/kernel/sysctl.c 2006-02-15 08:40:37.000000000 +0100
@@ -69,6 +69,8 @@
extern int pid_max_min, pid_max_max;
extern int sysctl_drop_caches;
extern int percpu_pagelist_fraction;
+extern int sched_g1;
+extern int sched_g2;
#if defined(CONFIG_X86_LOCAL_APIC) && defined(CONFIG_X86)
int unknown_nmi_panic;
@@ -222,6 +224,11 @@
{ .ctl_name = 0 }
};
+/* Constants for minimum and maximum testing in vm_table and
+ * kern_table. We use these as one-element integer vectors. */
+static int zero;
+static int one_hundred = 100;
+
static ctl_table kern_table[] = {
{
.ctl_name = KERN_OSTYPE,
@@ -664,15 +671,29 @@
.proc_handler = &proc_dointvec,
},
#endif
+ {
+ .ctl_name = KERN_SCHED_THROTTLE1,
+ .procname = "sched_g1",
+ .data = &sched_g1,
+ .maxlen = sizeof (int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ .strategy = &sysctl_intvec,
+ .extra1 = &zero,
+ },
+ {
+ .ctl_name = KERN_SCHED_THROTTLE2,
+ .procname = "sched_g2",
+ .data = &sched_g2,
+ .maxlen = sizeof (int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ .strategy = &sysctl_intvec,
+ .extra1 = &zero,
+ },
{ .ctl_name = 0 }
};
-/* Constants for minimum and maximum testing in vm_table.
- We use these as one-element integer vectors. */
-static int zero;
-static int one_hundred = 100;
-
-
static ctl_table vm_table[] = {
{
.ctl_name = VM_OVERCOMMIT_MEMORY,
--- linux-2.6.16-rc3-mm1x/fs/pipe.c.org 2006-02-15 08:32:12.000000000 +0100
+++ linux-2.6.16-rc3-mm1x/fs/pipe.c 2006-02-15 08:40:37.000000000 +0100
@@ -39,11 +39,7 @@
{
DEFINE_WAIT(wait);
- /*
- * Pipes are system-local resources, so sleeping on them
- * is considered a noninteractive wait:
- */
- prepare_to_wait(PIPE_WAIT(*inode), &wait, TASK_INTERRUPTIBLE|TASK_NONINTERACTIVE);
+ prepare_to_wait(PIPE_WAIT(*inode), &wait, TASK_INTERRUPTIBLE);
mutex_unlock(PIPE_MUTEX(*inode));
schedule();
finish_wait(PIPE_WAIT(*inode), &wait);
On Fri, 2006-02-17 at 14:45 +0100, MIke Galbraith wrote:
> Greetings,
>
> Below, please find the latest version of my task throttling attempt.
>
> What the patch addresses:
>
> The current interactivity heuristics are so heavily slanted in favor of
> tasks which sleep at all, that any task which sleeps for 5% of the time
> can use 95% cpu forever. Even if this slant were removed, became linear
> again, any task which sleeps even a tiny bit longer than it runs will
> eventually attain maximum dynamic priority.
>
> That said, the current heuristics work very well in the general case.
> When they fail, however, it can get pretty darn ugly. That is the
> problem space this patch attempts to address, and does pretty well at.
> It tries to solve the nasty corner cases.
>
...
> Comments? Suggestions? Fla^H^H^H (nah... beware what you ask)
Not many comments came back, zero actually. Probably because everyone
was really busy, not because anyone felt a sudden urge to hug the ole
porcelain Buddha after taking a look ;-)
Anyway, below is [imo] a very much nicer version. Functionally,
throttling is fairly air tight now, and interactivity is not affected in
the least afaikt. The thing works well. So well, that the nasty little
fundamental problem, while not whipped, is off in a corner licking it's
wounds ;-)
One kinda nifty feature about the way I do the throttling now, is that I
don't need to give a sometimes hog like X the keys to the city by
defining a huge window of opportunity for it's cpu use, and then trying
to figure out if I should throttle it or not when it gets busy. I can
leave a _zero_ length window for it, and it is quite happy. It usually
has such low cpu usage, that there are plenty of spare cycles that I let
it save for a rainy day. There is no free lunch though, they _will_ run
out. It's automatic. When the task's savings account is empty, it hits
the wall with a big splat. Interactive tasks naturally build a pool of
spare cycles, whereas tasks which want excessive cpu can't.
Even if you think that last patch was too ugly to risk looking again,
brace yourself, and take a peek at it anyway. There may be a nugget or
two be had even if my implementation erm... isn't the most beautiful bit
of code you've seen lately. The throttling idea works just fine, and
among the low hanging fruit, there's the fact that I don't miss the
sleep multiplier I ripped out even a tiny little bit. If anyone
actually tries this patch out [recommended] , they'll see what I mean.
Try a make -j5 kernel build on a fairly slow nfs mount. That has
absolutely nothing to do with the quite functional throttling, it's the
result of interactivity logic tweaks I was kinda-sorta-forced to do
along the way.
Cheers,
-Mike
fs/pipe.c | 6
include/linux/sched.h | 4
include/linux/sysctl.h | 2
kernel/sched.c | 332 ++++++++++++++++++++++++++++++++++++++++++++-----
kernel/sysctl.c | 36 ++++-
5 files changed, 336 insertions(+), 44 deletions(-)
diffstat looks really obese, but it's not as fat as it looks.
--- linux-2.6.16-rc4-mm1x/include/linux/sched.h.org 2006-02-20 14:07:26.000000000 +0100
+++ linux-2.6.16-rc4-mm1x/include/linux/sched.h 2006-02-23 05:05:20.000000000 +0100
@@ -712,14 +712,14 @@
unsigned short ioprio;
unsigned int btrace_seq;
- unsigned long sleep_avg;
+ unsigned long sleep_avg, last_slice, throttle;
unsigned long long timestamp, last_ran;
unsigned long long sched_time; /* sched_clock time spent running */
enum sleep_type sleep_type;
unsigned long policy;
cpumask_t cpus_allowed;
- unsigned int time_slice, first_time_slice;
+ unsigned int time_slice, slice_info;
#ifdef CONFIG_SCHEDSTATS
struct sched_info sched_info;
--- linux-2.6.16-rc4-mm1x/include/linux/sysctl.h.org 2006-02-20 14:07:36.000000000 +0100
+++ linux-2.6.16-rc4-mm1x/include/linux/sysctl.h 2006-02-20 14:08:28.000000000 +0100
@@ -147,6 +147,8 @@
KERN_SETUID_DUMPABLE=69, /* int: behaviour of dumps for setuid core */
KERN_SPIN_RETRY=70, /* int: number of spinlock retries */
KERN_ACPI_VIDEO_FLAGS=71, /* int: flags for setting up video after ACPI sleep */
+ KERN_SCHED_THROTTLE1=72, /* int: throttling grace period 1 in secs */
+ KERN_SCHED_THROTTLE2=73, /* int: throttling grace period 2 in secs */
};
--- linux-2.6.16-rc4-mm1x/kernel/sched.c.org 2006-02-20 14:07:48.000000000 +0100
+++ linux-2.6.16-rc4-mm1x/kernel/sched.c 2006-02-24 14:38:26.000000000 +0100
@@ -79,6 +79,21 @@
#define NS_TO_JIFFIES(TIME) ((TIME) / (1000000000 / HZ))
#define JIFFIES_TO_NS(TIME) ((TIME) * (1000000000 / HZ))
+#if (BITS_PER_LONG < 64)
+#define JIFFIES_TO_NS64(TIME) \
+ ((unsigned long long)(TIME) * ((unsigned long) (1000000000 / HZ)))
+
+#define NS64_TO_JIFFIES(TIME) \
+ ((((unsigned long long)((TIME)) >> BITS_PER_LONG) * \
+ (1 + NS_TO_JIFFIES(~0UL))) + NS_TO_JIFFIES((unsigned long)(TIME)))
+#else /* BITS_PER_LONG < 64 */
+
+#define NS64_TO_JIFFIES(TIME) NS_TO_JIFFIES(TIME)
+#define JIFFIES_TO_NS64(TIME) JIFFIES_TO_NS(TIME)
+
+#endif /* BITS_PER_LONG < 64 */
+
+
/*
* These are the 'tuning knobs' of the scheduler:
*
@@ -151,9 +166,159 @@
#define TASK_INTERACTIVE(p) \
((p)->prio <= (p)->static_prio - DELTA(p))
-#define INTERACTIVE_SLEEP(p) \
- (JIFFIES_TO_NS(MAX_SLEEP_AVG * \
- (MAX_BONUS / 2 + DELTA((p)) + 1) / MAX_BONUS - 1))
+/*
+ * Interactive boost can lead to starvation if the decision to
+ * boost a task turns out to be a bad one. To combat this, we
+ * compute the sane upper limit for cpu usage 'slice_avg' based
+ * upon a task's sleep_avg, and use this information combined
+ * with a timer to determine when intervention is required.
+ *
+ * When a task is behaving as it's sleep_avg indicates it should,
+ * it's throttle is moved forward, otherwise it will timeout, and
+ * the task's priority will be lowered.
+ *
+ * /proc/sys/kernel tunables.
+ *
+ * sched_g1: The amount of cpu time in seconds that a new task
+ * will run completely free, ie the head start a task
+ * has to get far enough ahead of it's timer that it
+ * can aviod being throttled. Each conforming slice
+ * thereafter increases it's lead, and vice versa.
+ *
+ * sched_g2: The maximum amount of 'good carma' a task can save
+ * for later use.
+ */
+
+int sched_g1 = 30;
+int sched_g2 = 14400;
+int sched_g2_max = 42949;
+
+#define G1 (sched_g1 * MAX_BONUS * HZ)
+#define G2 (sched_g2 * MAX_BONUS * HZ + G1)
+
+/*
+ * Depth of task hell.
+ */
+#define G3 (MAX_BONUS * G2)
+
+#define grace_expired(p, grace) \
+ (time_after(jiffies, (p)->throttle + (grace)))
+
+/*
+ * Masks for p->slice_info, formerly p->first_time_slice.
+ * SLICE_FTS: 0x80000000 Task is in it's first ever timeslice.
+ * SLICE_NEW: 0x40000000 Slice refreshed.
+ * SLICE_SPA: 0x3FFF8000 Spare bits.
+ * SLICE_LTS: 0x00007F80 Last time slice
+ * SLICE_AVG: 0x0000007F Task slice_avg stored as percentage.
+ */
+#define SLICE_AVG_BITS 7
+#define SLICE_LTS_BITS 10
+#define SLICE_SPA_BITS 13
+#define SLICE_NEW_BITS 1
+#define SLICE_FTS_BITS 1
+
+#define SLICE_AVG_SHIFT 0
+#define SLICE_LTS_SHIFT (SLICE_AVG_SHIFT + SLICE_AVG_BITS)
+#define SLICE_SPA_SHIFT (SLICE_LTS_SHIFT + SLICE_LTS_BITS)
+#define SLICE_NEW_SHIFT (SLICE_SPA_SHIFT + SLICE_SPA_BITS)
+#define SLICE_FTS_SHIFT (SLICE_NEW_SHIFT + SLICE_NEW_BITS)
+
+#define INFO_MASK(x) ((1U << (x))-1)
+#define SLICE_AVG_MASK (INFO_MASK(SLICE_AVG_BITS) << SLICE_AVG_SHIFT)
+#define SLICE_LTS_MASK (INFO_MASK(SLICE_LTS_BITS) << SLICE_LTS_SHIFT)
+#define SLICE_SPA_MASK (INFO_MASK(SLICE_SPA_BITS) << SLICE_SPA_SHIFT)
+#define SLICE_NEW_MASK (INFO_MASK(SLICE_NEW_BITS) << SLICE_NEW_SHIFT)
+#define SLICE_FTS_MASK (INFO_MASK(SLICE_FTS_BITS) << SLICE_FTS_SHIFT)
+
+/*
+ * p->slice_info access macros.
+ */
+#define first_time_slice(p) ((p)->slice_info & SLICE_FTS_MASK)
+#define set_first_time_slice(p) ((p)->slice_info |= SLICE_FTS_MASK)
+#define clr_first_time_slice(p) ((p)->slice_info &= ~SLICE_FTS_MASK)
+
+#define slice_is_new(p) ((p)->slice_info & SLICE_NEW_MASK)
+#define set_slice_is_new(p) ((p)->slice_info |= SLICE_NEW_MASK)
+#define clr_slice_is_new(p) ((p)->slice_info &= ~SLICE_NEW_MASK)
+
+#define last_slice(p) (((p)->slice_info & SLICE_LTS_MASK) >> SLICE_LTS_SHIFT)
+#define set_last_slice(p, n) ((p)->slice_info = (((p)->slice_info & \
+ ~SLICE_LTS_MASK) | (((n) << SLICE_LTS_SHIFT) & SLICE_LTS_MASK)))
+
+#define NS_SLEEP_AVG_PCNT (NS_MAX_SLEEP_AVG / 100)
+
+#define slice_avg(p) ((typeof((p)->sleep_avg)) \
+ ((((p)->slice_info & SLICE_AVG_MASK) >> SLICE_AVG_SHIFT) * \
+ NS_SLEEP_AVG_PCNT))
+#define set_slice_avg(p, n) ((p)->slice_info = (((p)->slice_info & \
+ ~SLICE_AVG_MASK) | ((((n) / NS_SLEEP_AVG_PCNT) \
+ << SLICE_AVG_SHIFT) & SLICE_AVG_MASK)))
+#define slice_avg_raw(p) \
+ (((p)->slice_info & SLICE_AVG_MASK) >> SLICE_AVG_SHIFT)
+#define set_slice_avg_raw(p, n) ((p)->slice_info = (((p)->slice_info & \
+ ~SLICE_AVG_MASK) | (((n) << SLICE_AVG_SHIFT) & SLICE_AVG_MASK)))
+
+/*
+ * cpu usage macros.
+ */
+#define cpu_avg(p) \
+ (100 - slice_avg_raw(p))
+
+#define cpu_max(p) \
+ (100 - ((p)->sleep_avg / NS_SLEEP_AVG_PCNT))
+
+#define time_this_slice(p) \
+ (jiffies - (p)->last_slice)
+
+#define cpu_this_slice(p) \
+ (100 * last_slice(p) / max((unsigned) time_this_slice(p), \
+ (unsigned) last_slice(p)))
+
+/*
+ * CURRENT_BONUS(p) with timeout controlled output.
+ */
+static int current_bonus(task_t *p, unsigned long grace)
+{
+ unsigned long sleep_avg = p->sleep_avg;
+ unsigned long slice_avg = slice_avg(p);
+
+ if (grace_expired(p, grace))
+ sleep_avg = min(sleep_avg, slice_avg);
+ return NS_TO_JIFFIES(sleep_avg) * MAX_BONUS / MAX_SLEEP_AVG;
+}
+
+#define PCNT_PER_DYNPRIO (100 / MAX_BONUS)
+#define NS_PER_DYNPRIO (PCNT_PER_DYNPRIO * NS_SLEEP_AVG_PCNT)
+
+#define SLEEP_AVG_DIVISOR(p) \
+ (grace_expired(p, G2) ? : (1 + CURRENT_BONUS(p)))
+
+#define INTERACTIVE_SLEEP_AVG(p) \
+ (min(JIFFIES_TO_NS(MAX_SLEEP_AVG * (MAX_BONUS / 2 + DELTA(p)) / \
+ MAX_BONUS), NS_MAX_SLEEP_AVG))
+
+/*
+ * Returns whether a quantity of sleep is guaranteed to promote a task to
+ * interactive status, or if already there, to the next dynamic priority
+ * or beyond.
+ */
+static int sleep_time_interactive(task_t *p, unsigned long length)
+{
+ unsigned long ceiling = INTERACTIVE_SLEEP_AVG(p);
+ unsigned long sleep_avg = p->sleep_avg;
+
+ if (length >= ceiling)
+ return 1;
+ if (sleep_avg >= ceiling) {
+ int bonus;
+ if (length >= NS_PER_DYNPRIO)
+ return 1;
+ bonus = CURRENT_BONUS(p);
+ return (sleep_avg + length) / NS_PER_DYNPRIO != bonus;
+ }
+ return sleep_avg + length >= ceiling;
+}
#define TASK_PREEMPTS_CURR(p, rq) \
((p)->prio < (rq)->curr->prio)
@@ -661,7 +826,7 @@
if (rt_task(p))
return p->prio;
- bonus = CURRENT_BONUS(p) - MAX_BONUS / 2;
+ bonus = current_bonus(p, G2) - MAX_BONUS / 2;
prio = p->static_prio - bonus;
if (prio < MAX_RT_PRIO)
@@ -767,6 +932,11 @@
unsigned long long __sleep_time = now - p->timestamp;
unsigned long sleep_time;
+ /*
+ * TSC synchronization.
+ */
+ if (unlikely(now < p->timestamp))
+ __sleep_time = 0ULL;
if (unlikely(p->policy == SCHED_BATCH))
sleep_time = 0;
else {
@@ -777,32 +947,48 @@
}
if (likely(sleep_time > 0)) {
+
/*
- * User tasks that sleep a long time are categorised as
- * idle. They will only have their sleep_avg increased to a
+ * Update throttle position.
+ */
+ p->throttle += NS64_TO_JIFFIES(__sleep_time);
+ if (time_before(jiffies, p->throttle))
+ p->throttle = jiffies;
+
+ /*
+ * Tasks that sleep a long time are categorised as idle.
+ * They will only have their sleep_avg increased to a
* level that makes them just interactive priority to stay
* active yet prevent them suddenly becoming cpu hogs and
* starving other processes.
*/
- if (p->mm && sleep_time > INTERACTIVE_SLEEP(p)) {
- unsigned long ceiling;
+ if (sleep_time_interactive(p, sleep_time)) {
+ unsigned long ceiling = INTERACTIVE_SLEEP_AVG(p);
+ unsigned long ticks = p->time_slice;
- ceiling = JIFFIES_TO_NS(MAX_SLEEP_AVG -
- DEF_TIMESLICE);
- if (p->sleep_avg < ceiling)
- p->sleep_avg = ceiling;
- } else {
+ /*
+ * Promote previously interactive tasks.
+ */
+ if (p->sleep_avg > ceiling) {
+ ceiling = p->sleep_avg / NS_PER_DYNPRIO;
+ if (ceiling < MAX_BONUS)
+ ceiling++;
+ ceiling *= NS_PER_DYNPRIO;
+ }
/*
- * The lower the sleep avg a task has the more
- * rapidly it will rise with sleep time. This enables
- * tasks to rapidly recover to a low latency priority.
- * If a task was sleeping with the noninteractive
- * label do not apply this non-linear boost
+ * Provide for sustainment.
*/
- if (p->sleep_type != SLEEP_NONINTERACTIVE || !p->mm)
- sleep_time *=
- (MAX_BONUS - CURRENT_BONUS(p)) ? : 1;
+ ticks = JIFFIES_TO_NS(ticks) / SLEEP_AVG_DIVISOR(p);
+ if (sleep_time >= ticks)
+ ceiling += ticks;
+ else ceiling += sleep_time;
+ if (ceiling > NS_MAX_SLEEP_AVG)
+ ceiling = NS_MAX_SLEEP_AVG;
+ if (p->sleep_avg < ceiling)
+ p->sleep_avg = ceiling;
+
+ } else {
/*
* This code gives a bonus to interactive tasks.
@@ -1357,7 +1543,8 @@
out_activate:
#endif /* CONFIG_SMP */
- if (old_state == TASK_UNINTERRUPTIBLE) {
+
+ if (old_state & TASK_UNINTERRUPTIBLE) {
rq->nr_uninterruptible--;
/*
* Tasks waking from uninterruptible sleep are likely
@@ -1455,9 +1642,19 @@
* The remainder of the first timeslice might be recovered by
* the parent if the child exits early enough.
*/
- p->first_time_slice = 1;
+ set_first_time_slice(p);
current->time_slice >>= 1;
p->timestamp = sched_clock();
+
+ /*
+ * Set up slice_info for the child.
+ */
+ set_slice_avg(p, p->sleep_avg);
+ set_last_slice(p, p->time_slice);
+ set_slice_is_new(p);
+ p->last_slice = jiffies;
+ p->throttle = jiffies - G2 + G1;
+
if (unlikely(!current->time_slice)) {
/*
* This case is rare, it happens when the parent has only
@@ -1571,7 +1768,7 @@
* the sleep_avg of the parent as well.
*/
rq = task_rq_lock(p->parent, &flags);
- if (p->first_time_slice && task_cpu(p) == task_cpu(p->parent)) {
+ if (first_time_slice(p) && task_cpu(p) == task_cpu(p->parent)) {
p->parent->time_slice += p->time_slice;
if (unlikely(p->parent->time_slice > task_timeslice(p)))
p->parent->time_slice = task_timeslice(p);
@@ -2680,6 +2877,65 @@
cpustat->steal = cputime64_add(cpustat->steal, tmp);
}
+
+#define CURRENT_CPU_BONUS(p) \
+ (100 - (cpu_this_slice(p) * PCNT_PER_DYNPRIO / 100))
+
+/*
+ * Refresh timeslice and associated slice information.
+ * @p: the process to refresh.
+ */
+static void refresh_timeslice(task_t *p)
+{
+ unsigned long slice_time = jiffies - p->last_slice;
+ unsigned int slice = last_slice(p);
+ unsigned int slice_avg, cpu, idle;
+ int w = MAX_BONUS, delta, bonus;
+
+ /*
+ * Update time_slice.
+ */
+ p->time_slice = task_timeslice(p);
+ if (slice != p->time_slice)
+ set_last_slice(p, p->time_slice);
+
+ /*
+ * Update slice_avg.
+ */
+ slice_avg = slice_avg_raw(p);
+ cpu = cpu_this_slice(p);
+ idle = 100 - cpu;
+ delta = max(slice_avg, idle) - min(slice_avg, idle);
+ w -= delta / w;
+ if (!w)
+ w = 1;
+ slice_avg = (w * slice_avg + idle) / (w + 1);
+ set_slice_avg_raw(p, slice_avg);
+
+ /*
+ * Update throttle position.
+ */
+ bonus = CURRENT_CPU_BONUS(p);
+ if (!grace_expired(p, G1) || cpu < cpu_max(p) + PCNT_PER_DYNPRIO)
+ p->throttle += (slice_time - slice) * bonus;
+ else if (cpu > cpu_max(p)) {
+ bonus = (cpu - cpu_max(p)) / PCNT_PER_DYNPRIO;
+ p->throttle -= slice_time * bonus;
+ }
+
+ if (time_before(jiffies, p->throttle))
+ p->throttle = jiffies;
+ else if (grace_expired(p, G3))
+ p->throttle = jiffies - G3;
+
+ /*
+ * And finally, stamp and flag the new slice.
+ */
+ clr_first_time_slice(p);
+ set_slice_is_new(p);
+ p->last_slice = jiffies;
+}
+
/*
* This function gets called by the timer code, with HZ frequency.
* We call it with interrupts disabled.
@@ -2724,8 +2980,7 @@
* FIFO tasks have no timeslices.
*/
if ((p->policy == SCHED_RR) && !--p->time_slice) {
- p->time_slice = task_timeslice(p);
- p->first_time_slice = 0;
+ refresh_timeslice(p);
set_tsk_need_resched(p);
/* put it at the end of the queue: */
@@ -2735,10 +2990,9 @@
}
if (!--p->time_slice) {
dequeue_task(p, rq->active);
- set_tsk_need_resched(p);
+ refresh_timeslice(p);
p->prio = effective_prio(p);
- p->time_slice = task_timeslice(p);
- p->first_time_slice = 0;
+ set_tsk_need_resched(p);
if (!rq->expired_timestamp)
rq->expired_timestamp = jiffies;
@@ -3049,7 +3303,7 @@
* Tasks charged proportionately less run_time at high sleep_avg to
* delay them losing their interactive status
*/
- run_time /= (CURRENT_BONUS(prev) ? : 1);
+ run_time /= SLEEP_AVG_DIVISOR(prev);
spin_lock_irq(&rq->lock);
@@ -3063,7 +3317,7 @@
unlikely(signal_pending(prev))))
prev->state = TASK_RUNNING;
else {
- if (prev->state == TASK_UNINTERRUPTIBLE)
+ if (prev->state & TASK_UNINTERRUPTIBLE)
rq->nr_uninterruptible++;
deactivate_task(prev, rq);
}
@@ -3112,6 +3366,7 @@
rq->best_expired_prio = MAX_PRIO;
}
+repeat_selection:
idx = sched_find_first_bit(array->bitmap);
queue = array->queue + idx;
next = list_entry(queue->next, task_t, run_list);
@@ -3131,6 +3386,13 @@
dequeue_task(next, array);
next->prio = new_prio;
enqueue_task(next, array);
+
+ /*
+ * We may have just been demoted below other
+ * runnable tasks in our previous queue.
+ */
+ next->sleep_type = SLEEP_NORMAL;
+ goto repeat_selection;
}
}
next->sleep_type = SLEEP_NORMAL;
@@ -3149,6 +3411,14 @@
prev->sleep_avg = 0;
prev->timestamp = prev->last_ran = now;
+ /*
+ * Tag start of execution of a new timeslice.
+ */
+ if (unlikely(slice_is_new(next))) {
+ next->last_slice = jiffies;
+ clr_slice_is_new(next);
+ }
+
sched_info_switch(prev, next);
if (likely(prev != next)) {
next->timestamp = now;
--- linux-2.6.16-rc4-mm1x/kernel/sysctl.c.org 2006-02-20 14:07:57.000000000 +0100
+++ linux-2.6.16-rc4-mm1x/kernel/sysctl.c 2006-02-24 14:41:33.000000000 +0100
@@ -72,6 +72,9 @@
extern int pid_max_min, pid_max_max;
extern int sysctl_drop_caches;
extern int percpu_pagelist_fraction;
+extern int sched_g1;
+extern int sched_g2;
+extern int sched_g2_max;
#if defined(CONFIG_X86_LOCAL_APIC) && defined(CONFIG_X86)
int unknown_nmi_panic;
@@ -225,6 +228,11 @@
{ .ctl_name = 0 }
};
+/* Constants for minimum and maximum testing in vm_table and
+ * kern_table. We use these as one-element integer vectors. */
+static int zero;
+static int one_hundred = 100;
+
static ctl_table kern_table[] = {
{
.ctl_name = KERN_OSTYPE,
@@ -669,15 +677,31 @@
.proc_handler = &proc_dointvec,
},
#endif
+ {
+ .ctl_name = KERN_SCHED_THROTTLE1,
+ .procname = "sched_g1",
+ .data = &sched_g1,
+ .maxlen = sizeof (int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec_minmax,
+ .strategy = &sysctl_intvec,
+ .extra1 = &zero,
+ .extra2 = &sched_g2_max,
+ },
+ {
+ .ctl_name = KERN_SCHED_THROTTLE2,
+ .procname = "sched_g2",
+ .data = &sched_g2,
+ .maxlen = sizeof (int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec_minmax,
+ .strategy = &sysctl_intvec,
+ .extra1 = &zero,
+ .extra2 = &sched_g2_max,
+ },
{ .ctl_name = 0 }
};
-/* Constants for minimum and maximum testing in vm_table.
- We use these as one-element integer vectors. */
-static int zero;
-static int one_hundred = 100;
-
-
static ctl_table vm_table[] = {
{
.ctl_name = VM_OVERCOMMIT_MEMORY,
--- linux-2.6.16-rc4-mm1x/fs/pipe.c.org 2006-02-20 14:08:09.000000000 +0100
+++ linux-2.6.16-rc4-mm1x/fs/pipe.c 2006-02-20 14:08:28.000000000 +0100
@@ -39,11 +39,7 @@
{
DEFINE_WAIT(wait);
- /*
- * Pipes are system-local resources, so sleeping on them
- * is considered a noninteractive wait:
- */
- prepare_to_wait(PIPE_WAIT(*inode), &wait, TASK_INTERRUPTIBLE|TASK_NONINTERACTIVE);
+ prepare_to_wait(PIPE_WAIT(*inode), &wait, TASK_INTERRUPTIBLE);
mutex_unlock(PIPE_MUTEX(*inode));
schedule();
finish_wait(PIPE_WAIT(*inode), &wait);
MIke Galbraith <[email protected]> wrote:
>
> Not many comments came back, zero actually.
>
That's because everyone's terribly busy chasing down those final bugs so we
get a really great 2.6.16 release (heh, I kill me).
I'm a bit reluctant to add changes like this until we get the smpnice stuff
settled down and validated. I guess that means once Ken's run all his
performance tests across it.
Of course, if Ken does his testing with just mainline+smpnice then any
coupling becomes less of a problem. But I would like to see some feedback
from the other sched developers first.
Andrew Morton wrote:
> MIke Galbraith <[email protected]> wrote:
>
>>Not many comments came back, zero actually.
>>
>
>
> That's because everyone's terribly busy chasing down those final bugs so we
> get a really great 2.6.16 release (heh, I kill me).
>
> I'm a bit reluctant to add changes like this until we get the smpnice stuff
> settled down and validated. I guess that means once Ken's run all his
> performance tests across it.
>
> Of course, if Ken does his testing with just mainline+smpnice then any
> coupling becomes less of a problem. But I would like to see some feedback
> from the other sched developers first.
Personally, I'd rather see PlugSched merged in and this patch be used to
create a new scheduler inside PlugSched. But I'm biased :-)
As I see it, the problem that this patch is addressing is caused by the
fact that the current scheduler is overly complicated. This patch just
makes it more complicated. Some of the schedulers in PlugSched already
handle this problem adequately and some of them are simpler than the
current scheduler -- the intersection of these two sets is not empty.
So now that it's been acknowledged that the current scheduler has
problems, I think that we should be looking at other solutions in
addition to just making the current one more complicated.
Peter
PS I agree this should wait until the current scheduler changes in -mm
have had a chance to settle down and/or migrate to the vanilla kernel.
--
Peter Williams [email protected]
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
On Sat, 2006-02-25 at 12:16 +1100, Peter Williams wrote:
> Andrew Morton wrote:
> > MIke Galbraith <[email protected]> wrote:
> >
> >>Not many comments came back, zero actually.
> >>
> >
> >
> > That's because everyone's terribly busy chasing down those final bugs so we
> > get a really great 2.6.16 release (heh, I kill me).
> >
> > I'm a bit reluctant to add changes like this until we get the smpnice stuff
> > settled down and validated. I guess that means once Ken's run all his
> > performance tests across it.
> >
> > Of course, if Ken does his testing with just mainline+smpnice then any
> > coupling becomes less of a problem. But I would like to see some feedback
> > from the other sched developers first.
>
> Personally, I'd rather see PlugSched merged in and this patch be used to
> create a new scheduler inside PlugSched. But I'm biased :-)
>
> As I see it, the problem that this patch is addressing is caused by the
> fact that the current scheduler is overly complicated. This patch just
> makes it more complicated.
What's complicated about the scheduler? I see simple/elegant when I
look in there. Interaction with the user is complex, but interactive
feel is a nebulous thing not restricted to this scheduler.
I really don't think this patch adds complexity, quite the opposite
actually. It just does a small bit of tweaking to the scheduler's weak
spot, and adds a dirt simple barrier against starvation. IMO, this
scheduler is not only quite simple, it's one weakness is generally
wonderful for throughput. It's just that it's sometimes a bit _too_
wonderful ;-)
-Mike
On Fri, 2006-02-24 at 14:15 -0800, Andrew Morton wrote:
> MIke Galbraith <[email protected]> wrote:
> >
> > Not many comments came back, zero actually.
> >
>
> That's because everyone's terribly busy chasing down those final bugs so we
> get a really great 2.6.16 release (heh, I kill me).
>
> I'm a bit reluctant to add changes like this until we get the smpnice stuff
> settled down and validated.
I agree 100% on all counts. If this, or even something like it, is
considered, it absolutely does need testing without other flavors in the
soup.
-Mike
Peter Williams wrote:
> Andrew Morton wrote:
>
>> MIke Galbraith <[email protected]> wrote:
>>
>>> Not many comments came back, zero actually.
>>>
>>
>>
>> That's because everyone's terribly busy chasing down those final bugs
>> so we
>> get a really great 2.6.16 release (heh, I kill me).
>>
>> I'm a bit reluctant to add changes like this until we get the smpnice
>> stuff
>> settled down and validated. I guess that means once Ken's run all his
>> performance tests across it.
>>
>> Of course, if Ken does his testing with just mainline+smpnice then any
>> coupling becomes less of a problem. But I would like to see some
>> feedback
>> from the other sched developers first.
>
>
> Personally, I'd rather see PlugSched merged in and this patch be used to
> create a new scheduler inside PlugSched. But I'm biased :-)
>
> As I see it, the problem that this patch is addressing is caused by the
> fact that the current scheduler is overly complicated. This patch just
> makes it more complicated. Some of the schedulers in PlugSched already
> handle this problem adequately and some of them are simpler than the
> current scheduler -- the intersection of these two sets is not empty.
>
> So now that it's been acknowledged that the current scheduler has
> problems, I think that we should be looking at other solutions in
> addition to just making the current one more complicated.
>
I tried this angle years ago and it didn't work :)
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
On Saturday 25 February 2006 13:42, Nick Piggin wrote:
> Peter Williams wrote:
> > Andrew Morton wrote:
> >> MIke Galbraith <[email protected]> wrote:
> >>> Not many comments came back, zero actually.
> >>
> >> That's because everyone's terribly busy chasing down those final bugs
> >> so we
> >> get a really great 2.6.16 release (heh, I kill me).
> >>
> >> I'm a bit reluctant to add changes like this until we get the smpnice
> >> stuff
> >> settled down and validated. I guess that means once Ken's run all his
> >> performance tests across it.
> >>
> >> Of course, if Ken does his testing with just mainline+smpnice then any
> >> coupling becomes less of a problem. But I would like to see some
> >> feedback
> >> from the other sched developers first.
> >
> > Personally, I'd rather see PlugSched merged in and this patch be used to
> > create a new scheduler inside PlugSched. But I'm biased :-)
> >
> > As I see it, the problem that this patch is addressing is caused by the
> > fact that the current scheduler is overly complicated. This patch just
> > makes it more complicated. Some of the schedulers in PlugSched already
> > handle this problem adequately and some of them are simpler than the
> > current scheduler -- the intersection of these two sets is not empty.
> >
> > So now that it's been acknowledged that the current scheduler has
> > problems, I think that we should be looking at other solutions in
> > addition to just making the current one more complicated.
>
> I tried this angle years ago and it didn't work :)
Our "2.6 forever" policy is why we're stuck with this approach. We tried
alternative implementations in -mm for a while but like all alternatives they
need truckloads more testing to see if they provide a real advantage and
don't cause any regressions. This made it impossible to seriously consider
any alternatives.
I hacked on and pushed plugsched in an attempt to make it possible to work on
an alternative implementation that would make the transition possible in a
stable series. This was vetoed by Linus and Ingo and yourself for the reason
it dilutes developer effort on the current scheduler. Which leaves us with
only continually polishing what is already in place.
None of this is news of course but it helps to set the history for outside
observers of this thread.
Cheers,
Con
Con Kolivas wrote:
> On Saturday 25 February 2006 13:42, Nick Piggin wrote:
>>I tried this angle years ago and it didn't work :)
>
>
> Our "2.6 forever" policy is why we're stuck with this approach. We tried
> alternative implementations in -mm for a while but like all alternatives they
> need truckloads more testing to see if they provide a real advantage and
> don't cause any regressions. This made it impossible to seriously consider
> any alternatives.
>
> I hacked on and pushed plugsched in an attempt to make it possible to work on
> an alternative implementation that would make the transition possible in a
> stable series. This was vetoed by Linus and Ingo and yourself for the reason
> it dilutes developer effort on the current scheduler. Which leaves us with
> only continually polishing what is already in place.
>
Yes. Hence my one-liner.
I still don't think plugsched is that good of an idea for mainline.
Not too many people seem to be unhappy with the scheduler we have,
so just because this little problem comes up I don't think that
means it's time to give up and merge plugsched and 10 other policies.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
On Sat, 2006-02-25 at 14:08 +1100, Nick Piggin wrote:
> Con Kolivas wrote:
> > On Saturday 25 February 2006 13:42, Nick Piggin wrote:
>
> >>I tried this angle years ago and it didn't work :)
> >
> >
> > Our "2.6 forever" policy is why we're stuck with this approach. We tried
> > alternative implementations in -mm for a while but like all alternatives they
> > need truckloads more testing to see if they provide a real advantage and
> > don't cause any regressions. This made it impossible to seriously consider
> > any alternatives.
> >
> > I hacked on and pushed plugsched in an attempt to make it possible to work on
> > an alternative implementation that would make the transition possible in a
> > stable series. This was vetoed by Linus and Ingo and yourself for the reason
> > it dilutes developer effort on the current scheduler. Which leaves us with
> > only continually polishing what is already in place.
> >
>
> Yes. Hence my one-liner.
>
> I still don't think plugsched is that good of an idea for mainline.
> Not too many people seem to be unhappy with the scheduler we have,
> so just because this little problem comes up I don't think that
> means it's time to give up and merge plugsched and 10 other policies.
Agreed. The problem is small. Annoying, because it refuses to go away,
but it is a small problem.
-Mike
MIke Galbraith wrote:
> On Fri, 2006-02-17 at 14:45 +0100, MIke Galbraith wrote:
> +/*
> + * Masks for p->slice_info, formerly p->first_time_slice.
> + * SLICE_FTS: 0x80000000 Task is in it's first ever timeslice.
> + * SLICE_NEW: 0x40000000 Slice refreshed.
> + * SLICE_SPA: 0x3FFF8000 Spare bits.
> + * SLICE_LTS: 0x00007F80 Last time slice
> + * SLICE_AVG: 0x0000007F Task slice_avg stored as percentage.
> + */
> +#define SLICE_AVG_BITS 7
> +#define SLICE_LTS_BITS 10
> +#define SLICE_SPA_BITS 13
> +#define SLICE_NEW_BITS 1
> +#define SLICE_FTS_BITS 1
I count 8 and 15 bits in the documentation of LTS/SPA respectively, not
10 and 13.
> }
>
> if (likely(sleep_time > 0)) {
> +
Extra line
> + {
> + .ctl_name = KERN_SCHED_THROTTLE1,
> + .procname = "sched_g1",
> + .data = &sched_g1,
> + .maxlen = sizeof (int),
> + .mode = 0644,
> + .proc_handler = &proc_dointvec_minmax,
> + .strategy = &sysctl_intvec,
> + .extra1 = &zero,
> + .extra2 = &sched_g2_max,
sched_g2_max is possibly badly named, as it is used in connection with
sched_g1 here.
> + },
> + {
> + .ctl_name = KERN_SCHED_THROTTLE2,
> + .procname = "sched_g2",
> + .data = &sched_g2,
> + .maxlen = sizeof (int),
> + .mode = 0644,
> + .proc_handler = &proc_dointvec_minmax,
> + .strategy = &sysctl_intvec,
> + .extra1 = &zero,
> + .extra2 = &sched_g2_max,
> + },
Daniel K.
On Sun, 2006-02-26 at 11:26 +0000, Daniel K. wrote:
> MIke Galbraith wrote:
> > On Fri, 2006-02-17 at 14:45 +0100, MIke Galbraith wrote:
> > +/*
> > + * Masks for p->slice_info, formerly p->first_time_slice.
> > + * SLICE_FTS: 0x80000000 Task is in it's first ever timeslice.
> > + * SLICE_NEW: 0x40000000 Slice refreshed.
> > + * SLICE_SPA: 0x3FFF8000 Spare bits.
> > + * SLICE_LTS: 0x00007F80 Last time slice
> > + * SLICE_AVG: 0x0000007F Task slice_avg stored as percentage.
> > + */
> > +#define SLICE_AVG_BITS 7
> > +#define SLICE_LTS_BITS 10
> > +#define SLICE_SPA_BITS 13
> > +#define SLICE_NEW_BITS 1
> > +#define SLICE_FTS_BITS 1
>
> I count 8 and 15 bits in the documentation of LTS/SPA respectively, not
> 10 and 13.
Dang, fixed the stupid bug, but forgot to wipe the evidence ;-) Fixed.
>
> > }
> >
> > if (likely(sleep_time > 0)) {
> > +
>
> Extra line
Fixed.
>
> > + {
> > + .ctl_name = KERN_SCHED_THROTTLE1,
> > + .procname = "sched_g1",
> > + .data = &sched_g1,
> > + .maxlen = sizeof (int),
> > + .mode = 0644,
> > + .proc_handler = &proc_dointvec_minmax,
> > + .strategy = &sysctl_intvec,
> > + .extra1 = &zero,
> > + .extra2 = &sched_g2_max,
>
> sched_g2_max is possibly badly named, as it is used in connection with
> sched_g1 here.
>
> > + },
> > + {
> > + .ctl_name = KERN_SCHED_THROTTLE2,
> > + .procname = "sched_g2",
> > + .data = &sched_g2,
> > + .maxlen = sizeof (int),
> > + .mode = 0644,
> > + .proc_handler = &proc_dointvec_minmax,
> > + .strategy = &sysctl_intvec,
> > + .extra1 = &zero,
> > + .extra2 = &sched_g2_max,
> > + },
I suppose sched_grace_max would fit better.
Thanks for taking a look.
-Mike
Greetings,
Below, please find part 1 of my latest task throttling effort. I've
very nearly completely reworked it from top to bottom, and broken it
down into separate cleanup and a throttling diffs.
Main things that this diff does:
1. Closes a generic hole in the scheduler design: due to timeslice
sample rate of HZ, tasks can and do steal time from each other.
Generally this is no big deal, because statistics more or less even
things out, but tasks with a high scheduling frequency and a low
execution duration can steal considerable time. No longer.
2. Removes overhead from the fast path. There's no need to do division
in the fast path, it's cheaper to do it at timeslice refresh time, where
it accomplishes the same thing at a fraction of the cost. Trades a
subtraction for a division, and removes the obsoleted bits that led to
the division.
I have verified that the testcase sent in by David Mosberg ages ago, and
which was the prime motivator for going to nanosecond timings in the
scheduler in the first place, is not broken by the above changes.
3. Removes the disparity in the handling of dynamic priority for kernel
threads verses user-land tasks.
4. Fixes a boot-time buglet where the TSC isn't synchronized yet,
resulting in recalc_task_prio() being called with now < p->timestamp.
If you place a WARN_ON there, the box won't even boot. With this fix,
you'll get one warning, and then all goes fine.
5. Fixes a couple of would-be bugs if anyone ever decided to use
TASK_NONINTERACTIVE thing along with TASK_UNINTERRUPTIBLE.
6. Removes sleep_avg multiplier. Back when we had 10s of dynamic range,
this was needed to help get interactive tasks up to speed. The 10 time
speedup meant that a 1s sleep put us at max priority. Worked great. As
we speak however, we have _1_ second of dynamic range, and this gets
compressed to 100ms by the multiplier. This is very bad, and to see how
bad, just try a very modest parallel kernel compile in a relatively slow
NFS mounted filesystem. In heavy testing, I can find no detriment to
removing this anachronism.
7. Assorted cleanups to the interactivity logic.
8. Whatever I forgot to mention ;-)
Comments?
-Mike
Signed-off-by Mike Galbraith <[email protected]>
include/linux/sched.h | 3 -
kernel/sched.c | 136 +++++++++++++++++++++++++++++---------------------
2 files changed, 82 insertions(+), 57 deletions(-)
--- linux-2.6.16-rc5-mm2/include/linux/sched.h.org 2006-03-01 15:06:22.000000000 +0100
+++ linux-2.6.16-rc5-mm2/include/linux/sched.h 2006-03-02 08:33:12.000000000 +0100
@@ -720,7 +720,8 @@
unsigned long policy;
cpumask_t cpus_allowed;
- unsigned int time_slice, first_time_slice;
+ int time_slice;
+ unsigned int first_time_slice;
#ifdef CONFIG_SCHEDSTATS
struct sched_info sched_info;
--- linux-2.6.16-rc5-mm2/kernel/sched.c.org 2006-03-01 15:05:56.000000000 +0100
+++ linux-2.6.16-rc5-mm2/kernel/sched.c 2006-03-02 10:05:47.000000000 +0100
@@ -99,6 +99,10 @@
#define MAX_SLEEP_AVG (DEF_TIMESLICE * MAX_BONUS)
#define STARVATION_LIMIT (MAX_SLEEP_AVG)
#define NS_MAX_SLEEP_AVG (JIFFIES_TO_NS(MAX_SLEEP_AVG))
+#define NS_MAX_SLEEP_AVG_PCNT (NS_MAX_SLEEP_AVG / 100)
+#define PCNT_PER_DYNPRIO (100 / MAX_BONUS)
+#define NS_PER_DYNPRIO (PCNT_PER_DYNPRIO * NS_MAX_SLEEP_AVG_PCNT)
+#define NS_TICK (JIFFIES_TO_NS(1))
/*
* If a task is 'interactive' then we reinsert it in the active
@@ -153,9 +157,25 @@
#define TASK_INTERACTIVE(p) \
((p)->prio <= (p)->static_prio - DELTA(p))
-#define INTERACTIVE_SLEEP(p) \
- (JIFFIES_TO_NS(MAX_SLEEP_AVG * \
- (MAX_BONUS / 2 + DELTA((p)) + 1) / MAX_BONUS - 1))
+#define SLEEP_AVG_DIVISOR(p) (1 + CURRENT_BONUS(p))
+
+#define INTERACTIVE_SLEEP_AVG(p) \
+ (min(JIFFIES_TO_NS(MAX_SLEEP_AVG * (MAX_BONUS / 2 + DELTA(p)) / \
+ MAX_BONUS), NS_MAX_SLEEP_AVG))
+
+/*
+ * Returns whether a task has been asleep long enough to be considered idle.
+ * The metric is whether this quantity of sleep would promote the task more
+ * than one priority beyond marginally interactive.
+ */
+static int task_interactive_idle(task_t *p, unsigned long sleep_time)
+{
+ unsigned long ceiling = (CURRENT_BONUS(p) + 2) * NS_PER_DYNPRIO;
+
+ if (p->sleep_avg + sleep_time < ceiling)
+ return 0;
+ return p->sleep_avg + sleep_time >= INTERACTIVE_SLEEP_AVG(p);
+}
#define TASK_PREEMPTS_CURR(p, rq) \
((p)->prio < (rq)->curr->prio)
@@ -182,7 +202,7 @@
static inline unsigned int task_timeslice(task_t *p)
{
- return static_prio_timeslice(p->static_prio);
+ return JIFFIES_TO_NS(static_prio_timeslice(p->static_prio));
}
#define task_hot(p, now, sd) ((long long) ((now) - (p)->last_ran) \
@@ -240,6 +260,7 @@
unsigned long expired_timestamp;
unsigned long long timestamp_last_tick;
+ unsigned long long timestamp_last_switch;
task_t *curr, *idle;
struct mm_struct *prev_mm;
prio_array_t *active, *expired, arrays[2];
@@ -777,6 +798,9 @@
unsigned long long __sleep_time = now - p->timestamp;
unsigned long sleep_time;
+ if (unlikely(now < p->timestamp))
+ __sleep_time = 0ULL;
+
if (unlikely(p->policy == SCHED_BATCH))
sleep_time = 0;
else {
@@ -788,32 +812,32 @@
if (likely(sleep_time > 0)) {
/*
- * User tasks that sleep a long time are categorised as
- * idle. They will only have their sleep_avg increased to a
+ * Tasks that sleep a long time are categorised as idle.
+ * They will only have their sleep_avg increased to a
* level that makes them just interactive priority to stay
* active yet prevent them suddenly becoming cpu hogs and
* starving other processes.
*/
- if (p->mm && sleep_time > INTERACTIVE_SLEEP(p)) {
- unsigned long ceiling;
-
- ceiling = JIFFIES_TO_NS(MAX_SLEEP_AVG -
- DEF_TIMESLICE);
- if (p->sleep_avg < ceiling)
- p->sleep_avg = ceiling;
- } else {
+ if (task_interactive_idle(p, sleep_time)) {
+ unsigned long ceiling = INTERACTIVE_SLEEP_AVG(p);
/*
- * The lower the sleep avg a task has the more
- * rapidly it will rise with sleep time. This enables
- * tasks to rapidly recover to a low latency priority.
- * If a task was sleeping with the noninteractive
- * label do not apply this non-linear boost
+ * Promote previously interactive task.
*/
- if (p->sleep_type != SLEEP_NONINTERACTIVE || !p->mm)
- sleep_time *=
- (MAX_BONUS - CURRENT_BONUS(p)) ? : 1;
+ if (p->sleep_avg > ceiling) {
+ ceiling = p->sleep_avg / NS_PER_DYNPRIO;
+ if (ceiling < MAX_BONUS)
+ ceiling++;
+ ceiling *= NS_PER_DYNPRIO;
+ } else {
+ ceiling += p->time_slice >> 2;
+ if (ceiling > NS_MAX_SLEEP_AVG)
+ ceiling = NS_MAX_SLEEP_AVG;
+ }
+ if (p->sleep_avg < ceiling)
+ p->sleep_avg = ceiling;
+ } else {
/*
* This code gives a bonus to interactive tasks.
*
@@ -1367,7 +1391,8 @@
out_activate:
#endif /* CONFIG_SMP */
- if (old_state == TASK_UNINTERRUPTIBLE) {
+
+ if (old_state & TASK_UNINTERRUPTIBLE) {
rq->nr_uninterruptible--;
/*
* Tasks waking from uninterruptible sleep are likely
@@ -1461,6 +1486,8 @@
*/
local_irq_disable();
p->time_slice = (current->time_slice + 1) >> 1;
+ if (unlikely(p->time_slice < NS_TICK))
+ p->time_slice = NS_TICK;
/*
* The remainder of the first timeslice might be recovered by
* the parent if the child exits early enough.
@@ -1468,13 +1495,12 @@
p->first_time_slice = 1;
current->time_slice >>= 1;
p->timestamp = sched_clock();
- if (unlikely(!current->time_slice)) {
+ if (unlikely(current->time_slice < NS_TICK)) {
/*
* This case is rare, it happens when the parent has only
* a single jiffy left from its timeslice. Taking the
* runqueue lock is not a problem.
*/
- current->time_slice = 1;
scheduler_tick();
}
local_irq_enable();
@@ -2586,6 +2612,7 @@
{
unsigned long long last = max(p->timestamp, rq->timestamp_last_tick);
p->sched_time += now - last;
+ p->time_slice -= now - last;
}
/*
@@ -2735,8 +2762,8 @@
* RR tasks need a special form of timeslice management.
* FIFO tasks have no timeslices.
*/
- if ((p->policy == SCHED_RR) && !--p->time_slice) {
- p->time_slice = task_timeslice(p);
+ if ((p->policy == SCHED_RR) && p->time_slice < NS_TICK) {
+ p->time_slice += task_timeslice(p);
p->first_time_slice = 0;
set_tsk_need_resched(p);
@@ -2745,11 +2772,21 @@
}
goto out_unlock;
}
- if (!--p->time_slice) {
+ if (p->time_slice < NS_TICK) {
+ int time_slice = task_timeslice(p);
+ int run_time = time_slice - p->time_slice;
dequeue_task(p, rq->active);
set_tsk_need_resched(p);
+ p->time_slice += time_slice;
+ /*
+ * Tasks are charged proportionately less run_time at high
+ * sleep_avg to delay them losing their interactive status
+ */
+ run_time /= SLEEP_AVG_DIVISOR(p);
+ p->sleep_avg -= run_time;
+ if ((long)p->sleep_avg < 0)
+ p->sleep_avg = 0;
p->prio = effective_prio(p);
- p->time_slice = task_timeslice(p);
p->first_time_slice = 0;
if (!rq->expired_timestamp)
@@ -2777,13 +2814,17 @@
* This only applies to tasks in the interactive
* delta range with at least TIMESLICE_GRANULARITY to requeue.
*/
- if (TASK_INTERACTIVE(p) && !((task_timeslice(p) -
- p->time_slice) % TIMESLICE_GRANULARITY(p)) &&
- (p->time_slice >= TIMESLICE_GRANULARITY(p)) &&
- (p->array == rq->active)) {
+ if (p->array == rq->active) {
+ unsigned long runtime, period;
- requeue_task(p, rq->active);
- set_tsk_need_resched(p);
+ runtime = now - rq->timestamp_last_switch;
+ period = JIFFIES_TO_NS(TIMESLICE_GRANULARITY(p));
+
+ if (runtime >= period && p->time_slice >> 1 >= period) {
+ requeue_task(p, rq->active);
+ set_tsk_need_resched(p);
+ rq->timestamp_last_switch = now;
+ }
}
}
out_unlock:
@@ -2851,7 +2892,8 @@
*/
static inline unsigned long smt_slice(task_t *p, struct sched_domain *sd)
{
- return p->time_slice * (100 - sd->per_cpu_gain) / 100;
+ int time_slice = NS_TO_JIFFIES(p->time_slice) ? : 1;
+ return time_slice * (100 - sd->per_cpu_gain) / 100;
}
static int dependent_sleeper(int this_cpu, runqueue_t *this_rq)
@@ -3014,7 +3056,6 @@
prio_array_t *array;
struct list_head *queue;
unsigned long long now;
- unsigned long run_time;
int cpu, idx, new_prio;
/*
@@ -3050,19 +3091,6 @@
schedstat_inc(rq, sched_cnt);
now = sched_clock();
- if (likely((long long)(now - prev->timestamp) < NS_MAX_SLEEP_AVG)) {
- run_time = now - prev->timestamp;
- if (unlikely((long long)(now - prev->timestamp) < 0))
- run_time = 0;
- } else
- run_time = NS_MAX_SLEEP_AVG;
-
- /*
- * Tasks charged proportionately less run_time at high sleep_avg to
- * delay them losing their interactive status
- */
- run_time /= (CURRENT_BONUS(prev) ? : 1);
-
spin_lock_irq(&rq->lock);
if (unlikely(prev->flags & PF_DEAD))
@@ -3075,7 +3103,7 @@
unlikely(signal_pending(prev))))
prev->state = TASK_RUNNING;
else {
- if (prev->state == TASK_UNINTERRUPTIBLE)
+ if (prev->state & TASK_UNINTERRUPTIBLE)
rq->nr_uninterruptible++;
deactivate_task(prev, rq);
}
@@ -3136,7 +3164,6 @@
if (next->sleep_type == SLEEP_INTERACTIVE)
delta = delta * (ON_RUNQUEUE_WEIGHT * 128 / 100) / 128;
- array = next->array;
new_prio = recalc_task_prio(next, next->timestamp + delta);
if (unlikely(next->prio != new_prio)) {
@@ -3156,14 +3183,11 @@
update_cpu_clock(prev, rq, now);
- prev->sleep_avg -= run_time;
- if ((long)prev->sleep_avg <= 0)
- prev->sleep_avg = 0;
prev->timestamp = prev->last_ran = now;
sched_info_switch(prev, next);
if (likely(prev != next)) {
- next->timestamp = now;
+ next->timestamp = rq->timestamp_last_switch = now;
rq->nr_switches++;
rq->curr = next;
++*switch_count;
Diff 2 of 2, the throttling itself. Primary changes since the last
version is that I no longer dink around with the task's priority
directly. Carefully trimming excess sleep_avg at timeslice refresh time
works just about as well, and makes the patch look a heck of a lot
better.
I've also done away with a division that was in the fast path, and fixed
a bug in the 'concession to interactivity' bits in refresh_timeslice()
which had it working more by accident than design. With this diff, I
can set grace_g0 to zero seconds, and still have a very nice interactive
kernel. grace_g2 has to be also set to zero to completely disable
interactivity logic.
Comments?
-Mike
signed-off-by: Mike Galbraith <[email protected]>
include/linux/sched.h | 4
include/linux/sysctl.h | 2
kernel/sched.c | 249 +++++++++++++++++++++++++++++++++++++++++++++----
kernel/sysctl.c | 36 +++++--
4 files changed, 267 insertions(+), 24 deletions(-)
--- linux-2.6.16-rc5-mm2/include/linux/sched.h.org 2006-02-28 06:11:17.000000000 +0100
+++ linux-2.6.16-rc5-mm2/include/linux/sched.h 2006-02-28 06:11:41.000000000 +0100
@@ -713,7 +713,7 @@
unsigned short ioprio;
unsigned int btrace_seq;
- unsigned long sleep_avg;
+ unsigned long sleep_avg, last_slice, throttle;
unsigned long long timestamp, last_ran;
unsigned long long sched_time; /* sched_clock time spent running */
enum sleep_type sleep_type;
@@ -721,7 +721,7 @@
unsigned long policy;
cpumask_t cpus_allowed;
int time_slice;
- unsigned int first_time_slice;
+ unsigned int slice_info;
#ifdef CONFIG_SCHEDSTATS
struct sched_info sched_info;
--- linux-2.6.16-rc5-mm2/include/linux/sysctl.h.org 2006-02-28 06:11:29.000000000 +0100
+++ linux-2.6.16-rc5-mm2/include/linux/sysctl.h 2006-02-28 06:11:41.000000000 +0100
@@ -148,6 +148,8 @@
KERN_SPIN_RETRY=70, /* int: number of spinlock retries */
KERN_ACPI_VIDEO_FLAGS=71, /* int: flags for setting up video after ACPI sleep */
KERN_IA64_UNALIGNED=72, /* int: ia64 unaligned userland trap enable */
+ KERN_SCHED_THROTTLE1=73, /* int: throttling grace period 1 in secs */
+ KERN_SCHED_THROTTLE2=74, /* int: throttling grace period 2 in secs */
};
--- linux-2.6.16-rc5-mm2/kernel/sched.c.org 2006-02-28 06:10:31.000000000 +0100
+++ linux-2.6.16-rc5-mm2/kernel/sched.c 2006-02-28 06:53:54.000000000 +0100
@@ -80,6 +80,21 @@
#define NS_TO_JIFFIES(TIME) ((TIME) / (1000000000 / HZ))
#define JIFFIES_TO_NS(TIME) ((TIME) * (1000000000 / HZ))
+#if (BITS_PER_LONG < 64)
+#define JIFFIES_TO_NS64(TIME) \
+ ((unsigned long long)(TIME) * ((unsigned long) (1000000000 / HZ)))
+
+#define NS64_TO_JIFFIES(TIME) \
+ ((((unsigned long long)((TIME)) >> BITS_PER_LONG) * \
+ (1 + NS_TO_JIFFIES(~0UL))) + NS_TO_JIFFIES((unsigned long)(TIME)))
+#else /* BITS_PER_LONG < 64 */
+
+#define NS64_TO_JIFFIES(TIME) NS_TO_JIFFIES(TIME)
+#define JIFFIES_TO_NS64(TIME) JIFFIES_TO_NS(TIME)
+
+#endif /* BITS_PER_LONG < 64 */
+
+
/*
* These are the 'tuning knobs' of the scheduler:
*
@@ -177,6 +192,115 @@
return p->sleep_avg + sleep_time >= INTERACTIVE_SLEEP_AVG(p);
}
+/*
+ * Interactive boost can lead to starvation if the decision to
+ * boost a task turns out to be a bad one. To combat this, we
+ * compute the sane upper limit for cpu usage 'slice_avg' based
+ * upon a task's sleep_avg, and use this information combined
+ * with a timer to determine when intervention is required.
+ *
+ * When a task is behaving as it's sleep_avg indicates it should,
+ * it's throttle is moved forward, otherwise it will timeout, and
+ * the task's priority will be lowered.
+ *
+ * /proc/sys/kernel tunables.
+ *
+ * grace_g1: The amount of cpu time in seconds that a new task
+ * will run completely free, ie the head start a task
+ * has to get far enough ahead of it's timer that it
+ * can aviod being throttled. Each conforming slice
+ * thereafter increases it's lead, and vice versa.
+ *
+ * grace_g2: The maximum amount of 'good carma' a task can save
+ * for later use.
+ */
+
+int grace_g1 = 10;
+int grace_g2 = 14400;
+int grace_max = 42949;
+
+#define G1 (grace_g1 * MAX_BONUS * HZ)
+#define G2 (grace_g2 * MAX_BONUS * HZ + G1)
+
+/*
+ * Depth of task hell.
+ */
+#define G3 (MAX_BONUS * G2)
+
+#define grace_expired(p, grace) \
+ (time_after(jiffies, (p)->throttle + (grace)))
+
+/*
+ * Masks for p->slice_info, formerly p->first_time_slice.
+ * SLICE_FTS: 0x80000000 Task is in it's first ever timeslice.
+ * SLICE_NEW: 0x40000000 Slice refreshed.
+ * SLICE_SPA: 0x3FFE0000 Spare bits.
+ * SLICE_LTS: 0x0001FF80 Last time slice
+ * SLICE_AVG: 0x0000007F Task slice_avg stored as percentage.
+ */
+#define SLICE_AVG_BITS 7
+#define SLICE_LTS_BITS 10
+#define SLICE_SPA_BITS 13
+#define SLICE_NEW_BITS 1
+#define SLICE_FTS_BITS 1
+
+#define SLICE_AVG_SHIFT 0
+#define SLICE_LTS_SHIFT (SLICE_AVG_SHIFT + SLICE_AVG_BITS)
+#define SLICE_SPA_SHIFT (SLICE_LTS_SHIFT + SLICE_LTS_BITS)
+#define SLICE_NEW_SHIFT (SLICE_SPA_SHIFT + SLICE_SPA_BITS)
+#define SLICE_FTS_SHIFT (SLICE_NEW_SHIFT + SLICE_NEW_BITS)
+
+#define INFO_MASK(x) ((1U << (x))-1)
+#define SLICE_AVG_MASK (INFO_MASK(SLICE_AVG_BITS) << SLICE_AVG_SHIFT)
+#define SLICE_LTS_MASK (INFO_MASK(SLICE_LTS_BITS) << SLICE_LTS_SHIFT)
+#define SLICE_SPA_MASK (INFO_MASK(SLICE_SPA_BITS) << SLICE_SPA_SHIFT)
+#define SLICE_NEW_MASK (INFO_MASK(SLICE_NEW_BITS) << SLICE_NEW_SHIFT)
+#define SLICE_FTS_MASK (INFO_MASK(SLICE_FTS_BITS) << SLICE_FTS_SHIFT)
+
+/*
+ * p->slice_info access macros.
+ */
+#define first_time_slice(p) ((p)->slice_info & SLICE_FTS_MASK)
+#define set_first_time_slice(p) ((p)->slice_info |= SLICE_FTS_MASK)
+#define clr_first_time_slice(p) ((p)->slice_info &= ~SLICE_FTS_MASK)
+
+#define slice_is_new(p) ((p)->slice_info & SLICE_NEW_MASK)
+#define set_slice_is_new(p) ((p)->slice_info |= SLICE_NEW_MASK)
+#define clr_slice_is_new(p) ((p)->slice_info &= ~SLICE_NEW_MASK)
+
+#define last_slice(p) (((p)->slice_info & SLICE_LTS_MASK) >> SLICE_LTS_SHIFT)
+#define set_last_slice(p, n) ((p)->slice_info = (((p)->slice_info & \
+ ~SLICE_LTS_MASK) | (((n) << SLICE_LTS_SHIFT) & SLICE_LTS_MASK)))
+
+#define NS_SLEEP_AVG_PCNT (NS_MAX_SLEEP_AVG / 100)
+
+#define slice_avg(p) ((typeof((p)->sleep_avg)) \
+ ((((p)->slice_info & SLICE_AVG_MASK) >> SLICE_AVG_SHIFT) * \
+ NS_SLEEP_AVG_PCNT))
+#define set_slice_avg(p, n) ((p)->slice_info = (((p)->slice_info & \
+ ~SLICE_AVG_MASK) | ((((n) / NS_SLEEP_AVG_PCNT) \
+ << SLICE_AVG_SHIFT) & SLICE_AVG_MASK)))
+#define slice_avg_raw(p) \
+ (((p)->slice_info & SLICE_AVG_MASK) >> SLICE_AVG_SHIFT)
+#define set_slice_avg_raw(p, n) ((p)->slice_info = (((p)->slice_info & \
+ ~SLICE_AVG_MASK) | (((n) << SLICE_AVG_SHIFT) & SLICE_AVG_MASK)))
+
+/*
+ * cpu usage macros.
+ */
+#define cpu_avg(p) \
+ (100 - slice_avg_raw(p))
+
+#define cpu_max(p) \
+ (100 - ((p)->sleep_avg / NS_SLEEP_AVG_PCNT))
+
+#define time_this_slice(p) \
+ (jiffies - (p)->last_slice)
+
+#define cpu_this_slice(p) \
+ (100 * last_slice(p) / max((unsigned) time_this_slice(p), \
+ (unsigned) last_slice(p)))
+
#define TASK_PREEMPTS_CURR(p, rq) \
((p)->prio < (rq)->curr->prio)
@@ -812,6 +936,13 @@
if (likely(sleep_time > 0)) {
/*
+ * Update throttle position.
+ */
+ p->throttle += NS64_TO_JIFFIES(__sleep_time);
+ if (time_before(jiffies, p->throttle))
+ p->throttle = jiffies;
+
+ /*
* Tasks that sleep a long time are categorised as idle.
* They will only have their sleep_avg increased to a
* level that makes them just interactive priority to stay
@@ -1492,9 +1623,19 @@
* The remainder of the first timeslice might be recovered by
* the parent if the child exits early enough.
*/
- p->first_time_slice = 1;
+ set_first_time_slice(p);
current->time_slice >>= 1;
p->timestamp = sched_clock();
+
+ /*
+ * Set up slice_info for the child.
+ */
+ set_slice_avg(p, p->sleep_avg);
+ set_last_slice(p, NS_TO_JIFFIES(p->time_slice));
+ set_slice_is_new(p);
+ p->last_slice = jiffies;
+ p->throttle = jiffies - G2 + G1;
+
if (unlikely(current->time_slice < NS_TICK)) {
/*
* This case is rare, it happens when the parent has only
@@ -1607,7 +1748,7 @@
* the sleep_avg of the parent as well.
*/
rq = task_rq_lock(p->parent, &flags);
- if (p->first_time_slice && task_cpu(p) == task_cpu(p->parent)) {
+ if (first_time_slice(p) && task_cpu(p) == task_cpu(p->parent)) {
p->parent->time_slice += p->time_slice;
if (unlikely(p->parent->time_slice > task_timeslice(p)))
p->parent->time_slice = task_timeslice(p);
@@ -2720,6 +2861,86 @@
}
/*
+ * Refresh timeslice and associated slice information.
+ * @p: the process to refresh.
+ */
+static void refresh_timeslice(task_t *p)
+{
+ unsigned long slice_time = jiffies - p->last_slice;
+ unsigned int slice = last_slice(p);
+ unsigned int slice_avg, cpu, idle;
+ int run_time = JIFFIES_TO_NS(slice) - p->time_slice;
+ int w = MAX_BONUS, delta, bonus;
+
+ /*
+ * Update time_slice.
+ */
+ p->time_slice = task_timeslice(p);
+ set_last_slice(p, NS_TO_JIFFIES(p->time_slice));
+
+ /*
+ * Update sleep_avg.
+ *
+ * Tasks charged proportionately less run_time at high
+ * sleep_avg to delay them losing their interactive status
+ */
+ run_time /= SLEEP_AVG_DIVISOR(p);
+ if (p->sleep_avg >= run_time)
+ p->sleep_avg -= run_time;
+ else p->sleep_avg = 0;
+
+ /*
+ * Update slice_avg.
+ */
+ slice_avg = slice_avg_raw(p);
+ cpu = cpu_this_slice(p);
+ idle = 100 - cpu;
+
+ delta = max(slice_avg, idle) - min(slice_avg, idle);
+ w = 1 + (delta / w);
+ slice_avg = (w * slice_avg + idle) / (w + 1);
+ set_slice_avg_raw(p, slice_avg);
+
+ /*
+ * If we've hit the timeout, we aren't draining enough sleep_avg
+ * to catch up with the task's cpu usage. Up the ante to bring
+ * the task back toward balance. This is important, because it
+ * allows interactive tasks to push their throttle back enough
+ * that they can both sustain, and rapidly recover from throttling
+ * instead of descending into G3.
+ */
+ if (grace_expired(p, G2) && p->sleep_avg > slice_avg(p)) {
+ unsigned long run_time = p->sleep_avg - slice_avg(p);
+ run_time /= w;
+ if (p->sleep_avg >= run_time)
+ p->sleep_avg -= run_time;
+ }
+
+ /*
+ * Update throttle position.
+ */
+ if (cpu < cpu_max(p) + PCNT_PER_DYNPRIO || !grace_expired(p, G1)) {
+ bonus = idle * PCNT_PER_DYNPRIO / 100;
+ p->throttle += (slice_time - slice) * bonus;
+ } else if (cpu >= cpu_max(p) + PCNT_PER_DYNPRIO) {
+ bonus = (cpu - cpu_max(p)) / PCNT_PER_DYNPRIO;
+ p->throttle -= slice_time * bonus;
+ }
+
+ if (time_before(jiffies, p->throttle))
+ p->throttle = jiffies;
+ else if (grace_expired(p, G3))
+ p->throttle = jiffies - G3;
+
+ /*
+ * And finally, stamp and flag the new slice.
+ */
+ clr_first_time_slice(p);
+ set_slice_is_new(p);
+ p->last_slice = jiffies;
+}
+
+/*
* This function gets called by the timer code, with HZ frequency.
* We call it with interrupts disabled.
*
@@ -2763,8 +2984,7 @@
* FIFO tasks have no timeslices.
*/
if ((p->policy == SCHED_RR) && p->time_slice < NS_TICK) {
- p->time_slice += task_timeslice(p);
- p->first_time_slice = 0;
+ refresh_timeslice(p);
set_tsk_need_resched(p);
/* put it at the end of the queue: */
@@ -2773,21 +2993,10 @@
goto out_unlock;
}
if (p->time_slice < NS_TICK) {
- int time_slice = task_timeslice(p);
- int run_time = time_slice - p->time_slice;
dequeue_task(p, rq->active);
set_tsk_need_resched(p);
- p->time_slice += time_slice;
- /*
- * Tasks are charged proportionately less run_time at high
- * sleep_avg to delay them losing their interactive status
- */
- run_time /= SLEEP_AVG_DIVISOR(p);
- p->sleep_avg -= run_time;
- if ((long)p->sleep_avg < 0)
- p->sleep_avg = 0;
+ refresh_timeslice(p);
p->prio = effective_prio(p);
- p->first_time_slice = 0;
if (!rq->expired_timestamp)
rq->expired_timestamp = jiffies;
@@ -3185,6 +3394,14 @@
prev->timestamp = prev->last_ran = now;
+ /*
+ * Tag start of execution of a new timeslice.
+ */
+ if (unlikely(slice_is_new(next))) {
+ next->last_slice = jiffies;
+ clr_slice_is_new(next);
+ }
+
sched_info_switch(prev, next);
if (likely(prev != next)) {
next->timestamp = rq->timestamp_last_switch = now;
--- linux-2.6.16-rc5-mm2/kernel/sysctl.c.org 2006-02-28 06:10:43.000000000 +0100
+++ linux-2.6.16-rc5-mm2/kernel/sysctl.c 2006-02-28 06:11:41.000000000 +0100
@@ -73,6 +73,9 @@
extern int pid_max_min, pid_max_max;
extern int sysctl_drop_caches;
extern int percpu_pagelist_fraction;
+extern int grace_g1;
+extern int grace_g2;
+extern int grace_max;
#if defined(CONFIG_X86_LOCAL_APIC) && defined(CONFIG_X86)
int unknown_nmi_panic;
@@ -230,6 +233,11 @@
{ .ctl_name = 0 }
};
+/* Constants for minimum and maximum testing in vm_table and
+ * kern_table. We use these as one-element integer vectors. */
+static int zero;
+static int one_hundred = 100;
+
static ctl_table kern_table[] = {
{
.ctl_name = KERN_OSTYPE,
@@ -684,15 +692,31 @@
.proc_handler = &proc_dointvec,
},
#endif
+ {
+ .ctl_name = KERN_SCHED_THROTTLE1,
+ .procname = "grace_g1",
+ .data = &grace_g1,
+ .maxlen = sizeof (int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec_minmax,
+ .strategy = &sysctl_intvec,
+ .extra1 = &zero,
+ .extra2 = &grace_max,
+ },
+ {
+ .ctl_name = KERN_SCHED_THROTTLE2,
+ .procname = "grace_g2",
+ .data = &grace_g2,
+ .maxlen = sizeof (int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec_minmax,
+ .strategy = &sysctl_intvec,
+ .extra1 = &zero,
+ .extra2 = &grace_max,
+ },
{ .ctl_name = 0 }
};
-/* Constants for minimum and maximum testing in vm_table.
- We use these as one-element integer vectors. */
-static int zero;
-static int one_hundred = 100;
-
-
static ctl_table vm_table[] = {
{
.ctl_name = VM_OVERCOMMIT_MEMORY,
Mike Galbraith wrote:
> Greetings,
>
> Below, please find part 1 of my latest task throttling effort. I've
> very nearly completely reworked it from top to bottom, and broken it
> down into separate cleanup and a throttling diffs.
>
> Main things that this diff does:
>
> 1. Closes a generic hole in the scheduler design: due to timeslice
> sample rate of HZ, tasks can and do steal time from each other.
> Generally this is no big deal, because statistics more or less even
> things out, but tasks with a high scheduling frequency and a low
> execution duration can steal considerable time. No longer.
>
> 2. Removes overhead from the fast path. There's no need to do division
> in the fast path, it's cheaper to do it at timeslice refresh time, where
> it accomplishes the same thing at a fraction of the cost. Trades a
> subtraction for a division, and removes the obsoleted bits that led to
> the division.
>
> I have verified that the testcase sent in by David Mosberg ages ago, and
> which was the prime motivator for going to nanosecond timings in the
> scheduler in the first place, is not broken by the above changes.
>
> 3. Removes the disparity in the handling of dynamic priority for kernel
> threads verses user-land tasks.
>
> 4. Fixes a boot-time buglet where the TSC isn't synchronized yet,
> resulting in recalc_task_prio() being called with now < p->timestamp.
> If you place a WARN_ON there, the box won't even boot. With this fix,
> you'll get one warning, and then all goes fine.
>
> 5. Fixes a couple of would-be bugs if anyone ever decided to use
> TASK_NONINTERACTIVE thing along with TASK_UNINTERRUPTIBLE.
>
> 6. Removes sleep_avg multiplier. Back when we had 10s of dynamic range,
> this was needed to help get interactive tasks up to speed. The 10 time
> speedup meant that a 1s sleep put us at max priority. Worked great. As
> we speak however, we have _1_ second of dynamic range, and this gets
> compressed to 100ms by the multiplier. This is very bad, and to see how
> bad, just try a very modest parallel kernel compile in a relatively slow
> NFS mounted filesystem. In heavy testing, I can find no detriment to
> removing this anachronism.
>
> 7. Assorted cleanups to the interactivity logic.
>
> 8. Whatever I forgot to mention ;-)
>
> Comments?
If you're going to manage the time slice in nanoseconds why not do it
properly? I presume you've held back a bit in case you break something?
If it helps, the smpnice balancing code's use of static_prio_timeslice()
doesn't really care what units it's return value is in as long as
DEF_TIMESLICE is in the same units and contains the size of a time slice
allocated to a nice==0 non RT task.
>
> -Mike
>
> Signed-off-by Mike Galbraith <[email protected]>
>
> include/linux/sched.h | 3 -
> kernel/sched.c | 136 +++++++++++++++++++++++++++++---------------------
> 2 files changed, 82 insertions(+), 57 deletions(-)
>
> --- linux-2.6.16-rc5-mm2/include/linux/sched.h.org 2006-03-01 15:06:22.000000000 +0100
> +++ linux-2.6.16-rc5-mm2/include/linux/sched.h 2006-03-02 08:33:12.000000000 +0100
> @@ -720,7 +720,8 @@
>
> unsigned long policy;
> cpumask_t cpus_allowed;
> - unsigned int time_slice, first_time_slice;
> + int time_slice;
> + unsigned int first_time_slice;
>
> #ifdef CONFIG_SCHEDSTATS
> struct sched_info sched_info;
> --- linux-2.6.16-rc5-mm2/kernel/sched.c.org 2006-03-01 15:05:56.000000000 +0100
> +++ linux-2.6.16-rc5-mm2/kernel/sched.c 2006-03-02 10:05:47.000000000 +0100
> @@ -99,6 +99,10 @@
> #define MAX_SLEEP_AVG (DEF_TIMESLICE * MAX_BONUS)
> #define STARVATION_LIMIT (MAX_SLEEP_AVG)
> #define NS_MAX_SLEEP_AVG (JIFFIES_TO_NS(MAX_SLEEP_AVG))
> +#define NS_MAX_SLEEP_AVG_PCNT (NS_MAX_SLEEP_AVG / 100)
> +#define PCNT_PER_DYNPRIO (100 / MAX_BONUS)
> +#define NS_PER_DYNPRIO (PCNT_PER_DYNPRIO * NS_MAX_SLEEP_AVG_PCNT)
> +#define NS_TICK (JIFFIES_TO_NS(1))
>
> /*
> * If a task is 'interactive' then we reinsert it in the active
> @@ -153,9 +157,25 @@
> #define TASK_INTERACTIVE(p) \
> ((p)->prio <= (p)->static_prio - DELTA(p))
>
> -#define INTERACTIVE_SLEEP(p) \
> - (JIFFIES_TO_NS(MAX_SLEEP_AVG * \
> - (MAX_BONUS / 2 + DELTA((p)) + 1) / MAX_BONUS - 1))
> +#define SLEEP_AVG_DIVISOR(p) (1 + CURRENT_BONUS(p))
> +
> +#define INTERACTIVE_SLEEP_AVG(p) \
> + (min(JIFFIES_TO_NS(MAX_SLEEP_AVG * (MAX_BONUS / 2 + DELTA(p)) / \
> + MAX_BONUS), NS_MAX_SLEEP_AVG))
> +
> +/*
> + * Returns whether a task has been asleep long enough to be considered idle.
> + * The metric is whether this quantity of sleep would promote the task more
> + * than one priority beyond marginally interactive.
> + */
> +static int task_interactive_idle(task_t *p, unsigned long sleep_time)
> +{
> + unsigned long ceiling = (CURRENT_BONUS(p) + 2) * NS_PER_DYNPRIO;
> +
> + if (p->sleep_avg + sleep_time < ceiling)
> + return 0;
> + return p->sleep_avg + sleep_time >= INTERACTIVE_SLEEP_AVG(p);
> +}
>
> #define TASK_PREEMPTS_CURR(p, rq) \
> ((p)->prio < (rq)->curr->prio)
> @@ -182,7 +202,7 @@
>
> static inline unsigned int task_timeslice(task_t *p)
> {
> - return static_prio_timeslice(p->static_prio);
> + return JIFFIES_TO_NS(static_prio_timeslice(p->static_prio));
> }
>
> #define task_hot(p, now, sd) ((long long) ((now) - (p)->last_ran) \
> @@ -240,6 +260,7 @@
>
> unsigned long expired_timestamp;
> unsigned long long timestamp_last_tick;
> + unsigned long long timestamp_last_switch;
> task_t *curr, *idle;
> struct mm_struct *prev_mm;
> prio_array_t *active, *expired, arrays[2];
> @@ -777,6 +798,9 @@
> unsigned long long __sleep_time = now - p->timestamp;
> unsigned long sleep_time;
>
> + if (unlikely(now < p->timestamp))
> + __sleep_time = 0ULL;
> +
> if (unlikely(p->policy == SCHED_BATCH))
> sleep_time = 0;
> else {
> @@ -788,32 +812,32 @@
>
> if (likely(sleep_time > 0)) {
> /*
> - * User tasks that sleep a long time are categorised as
> - * idle. They will only have their sleep_avg increased to a
> + * Tasks that sleep a long time are categorised as idle.
> + * They will only have their sleep_avg increased to a
> * level that makes them just interactive priority to stay
> * active yet prevent them suddenly becoming cpu hogs and
> * starving other processes.
> */
> - if (p->mm && sleep_time > INTERACTIVE_SLEEP(p)) {
> - unsigned long ceiling;
> -
> - ceiling = JIFFIES_TO_NS(MAX_SLEEP_AVG -
> - DEF_TIMESLICE);
> - if (p->sleep_avg < ceiling)
> - p->sleep_avg = ceiling;
> - } else {
> + if (task_interactive_idle(p, sleep_time)) {
> + unsigned long ceiling = INTERACTIVE_SLEEP_AVG(p);
>
> /*
> - * The lower the sleep avg a task has the more
> - * rapidly it will rise with sleep time. This enables
> - * tasks to rapidly recover to a low latency priority.
> - * If a task was sleeping with the noninteractive
> - * label do not apply this non-linear boost
> + * Promote previously interactive task.
> */
> - if (p->sleep_type != SLEEP_NONINTERACTIVE || !p->mm)
> - sleep_time *=
> - (MAX_BONUS - CURRENT_BONUS(p)) ? : 1;
> + if (p->sleep_avg > ceiling) {
> + ceiling = p->sleep_avg / NS_PER_DYNPRIO;
> + if (ceiling < MAX_BONUS)
> + ceiling++;
> + ceiling *= NS_PER_DYNPRIO;
> + } else {
> + ceiling += p->time_slice >> 2;
> + if (ceiling > NS_MAX_SLEEP_AVG)
> + ceiling = NS_MAX_SLEEP_AVG;
> + }
>
> + if (p->sleep_avg < ceiling)
> + p->sleep_avg = ceiling;
> + } else {
> /*
> * This code gives a bonus to interactive tasks.
> *
> @@ -1367,7 +1391,8 @@
>
> out_activate:
> #endif /* CONFIG_SMP */
> - if (old_state == TASK_UNINTERRUPTIBLE) {
> +
> + if (old_state & TASK_UNINTERRUPTIBLE) {
> rq->nr_uninterruptible--;
> /*
> * Tasks waking from uninterruptible sleep are likely
> @@ -1461,6 +1486,8 @@
> */
> local_irq_disable();
> p->time_slice = (current->time_slice + 1) >> 1;
> + if (unlikely(p->time_slice < NS_TICK))
> + p->time_slice = NS_TICK;
> /*
> * The remainder of the first timeslice might be recovered by
> * the parent if the child exits early enough.
> @@ -1468,13 +1495,12 @@
> p->first_time_slice = 1;
> current->time_slice >>= 1;
> p->timestamp = sched_clock();
> - if (unlikely(!current->time_slice)) {
> + if (unlikely(current->time_slice < NS_TICK)) {
> /*
> * This case is rare, it happens when the parent has only
> * a single jiffy left from its timeslice. Taking the
> * runqueue lock is not a problem.
> */
> - current->time_slice = 1;
> scheduler_tick();
> }
> local_irq_enable();
> @@ -2586,6 +2612,7 @@
> {
> unsigned long long last = max(p->timestamp, rq->timestamp_last_tick);
> p->sched_time += now - last;
> + p->time_slice -= now - last;
> }
>
> /*
> @@ -2735,8 +2762,8 @@
> * RR tasks need a special form of timeslice management.
> * FIFO tasks have no timeslices.
> */
> - if ((p->policy == SCHED_RR) && !--p->time_slice) {
> - p->time_slice = task_timeslice(p);
> + if ((p->policy == SCHED_RR) && p->time_slice < NS_TICK) {
> + p->time_slice += task_timeslice(p);
> p->first_time_slice = 0;
> set_tsk_need_resched(p);
>
> @@ -2745,11 +2772,21 @@
> }
> goto out_unlock;
> }
> - if (!--p->time_slice) {
> + if (p->time_slice < NS_TICK) {
> + int time_slice = task_timeslice(p);
> + int run_time = time_slice - p->time_slice;
> dequeue_task(p, rq->active);
> set_tsk_need_resched(p);
> + p->time_slice += time_slice;
> + /*
> + * Tasks are charged proportionately less run_time at high
> + * sleep_avg to delay them losing their interactive status
> + */
> + run_time /= SLEEP_AVG_DIVISOR(p);
> + p->sleep_avg -= run_time;
> + if ((long)p->sleep_avg < 0)
> + p->sleep_avg = 0;
> p->prio = effective_prio(p);
> - p->time_slice = task_timeslice(p);
> p->first_time_slice = 0;
>
> if (!rq->expired_timestamp)
> @@ -2777,13 +2814,17 @@
> * This only applies to tasks in the interactive
> * delta range with at least TIMESLICE_GRANULARITY to requeue.
> */
> - if (TASK_INTERACTIVE(p) && !((task_timeslice(p) -
> - p->time_slice) % TIMESLICE_GRANULARITY(p)) &&
> - (p->time_slice >= TIMESLICE_GRANULARITY(p)) &&
> - (p->array == rq->active)) {
> + if (p->array == rq->active) {
> + unsigned long runtime, period;
>
> - requeue_task(p, rq->active);
> - set_tsk_need_resched(p);
> + runtime = now - rq->timestamp_last_switch;
> + period = JIFFIES_TO_NS(TIMESLICE_GRANULARITY(p));
> +
> + if (runtime >= period && p->time_slice >> 1 >= period) {
> + requeue_task(p, rq->active);
> + set_tsk_need_resched(p);
> + rq->timestamp_last_switch = now;
> + }
> }
> }
> out_unlock:
> @@ -2851,7 +2892,8 @@
> */
> static inline unsigned long smt_slice(task_t *p, struct sched_domain *sd)
> {
> - return p->time_slice * (100 - sd->per_cpu_gain) / 100;
> + int time_slice = NS_TO_JIFFIES(p->time_slice) ? : 1;
> + return time_slice * (100 - sd->per_cpu_gain) / 100;
> }
>
> static int dependent_sleeper(int this_cpu, runqueue_t *this_rq)
> @@ -3014,7 +3056,6 @@
> prio_array_t *array;
> struct list_head *queue;
> unsigned long long now;
> - unsigned long run_time;
> int cpu, idx, new_prio;
>
> /*
> @@ -3050,19 +3091,6 @@
>
> schedstat_inc(rq, sched_cnt);
> now = sched_clock();
> - if (likely((long long)(now - prev->timestamp) < NS_MAX_SLEEP_AVG)) {
> - run_time = now - prev->timestamp;
> - if (unlikely((long long)(now - prev->timestamp) < 0))
> - run_time = 0;
> - } else
> - run_time = NS_MAX_SLEEP_AVG;
> -
> - /*
> - * Tasks charged proportionately less run_time at high sleep_avg to
> - * delay them losing their interactive status
> - */
> - run_time /= (CURRENT_BONUS(prev) ? : 1);
> -
> spin_lock_irq(&rq->lock);
>
> if (unlikely(prev->flags & PF_DEAD))
> @@ -3075,7 +3103,7 @@
> unlikely(signal_pending(prev))))
> prev->state = TASK_RUNNING;
> else {
> - if (prev->state == TASK_UNINTERRUPTIBLE)
> + if (prev->state & TASK_UNINTERRUPTIBLE)
> rq->nr_uninterruptible++;
> deactivate_task(prev, rq);
> }
> @@ -3136,7 +3164,6 @@
> if (next->sleep_type == SLEEP_INTERACTIVE)
> delta = delta * (ON_RUNQUEUE_WEIGHT * 128 / 100) / 128;
>
> - array = next->array;
> new_prio = recalc_task_prio(next, next->timestamp + delta);
>
> if (unlikely(next->prio != new_prio)) {
> @@ -3156,14 +3183,11 @@
>
> update_cpu_clock(prev, rq, now);
>
> - prev->sleep_avg -= run_time;
> - if ((long)prev->sleep_avg <= 0)
> - prev->sleep_avg = 0;
> prev->timestamp = prev->last_ran = now;
>
> sched_info_switch(prev, next);
> if (likely(prev != next)) {
> - next->timestamp = now;
> + next->timestamp = rq->timestamp_last_switch = now;
> rq->nr_switches++;
> rq->curr = next;
> ++*switch_count;
>
--
Peter Williams [email protected]
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
Mike Galbraith wrote:
> Greetings,
>
> Below, please find part 1 of my latest task throttling effort. I've
> very nearly completely reworked it from top to bottom, and broken it
> down into separate cleanup and a throttling diffs.
>
> Main things that this diff does:
>
> 1. Closes a generic hole in the scheduler design: due to timeslice
> sample rate of HZ, tasks can and do steal time from each other.
> Generally this is no big deal, because statistics more or less even
> things out, but tasks with a high scheduling frequency and a low
> execution duration can steal considerable time. No longer.
>
> 2. Removes overhead from the fast path. There's no need to do division
> in the fast path, it's cheaper to do it at timeslice refresh time, where
> it accomplishes the same thing at a fraction of the cost. Trades a
> subtraction for a division, and removes the obsoleted bits that led to
> the division.
>
> I have verified that the testcase sent in by David Mosberg ages ago, and
> which was the prime motivator for going to nanosecond timings in the
> scheduler in the first place, is not broken by the above changes.
>
> 3. Removes the disparity in the handling of dynamic priority for kernel
> threads verses user-land tasks.
>
> 4. Fixes a boot-time buglet where the TSC isn't synchronized yet,
> resulting in recalc_task_prio() being called with now < p->timestamp.
> If you place a WARN_ON there, the box won't even boot. With this fix,
> you'll get one warning, and then all goes fine.
>
> 5. Fixes a couple of would-be bugs if anyone ever decided to use
> TASK_NONINTERACTIVE thing along with TASK_UNINTERRUPTIBLE.
>
> 6. Removes sleep_avg multiplier. Back when we had 10s of dynamic range,
> this was needed to help get interactive tasks up to speed. The 10 time
> speedup meant that a 1s sleep put us at max priority. Worked great. As
> we speak however, we have _1_ second of dynamic range, and this gets
> compressed to 100ms by the multiplier. This is very bad, and to see how
> bad, just try a very modest parallel kernel compile in a relatively slow
> NFS mounted filesystem. In heavy testing, I can find no detriment to
> removing this anachronism.
>
> 7. Assorted cleanups to the interactivity logic.
>
> 8. Whatever I forgot to mention ;-)
>
> Comments?
>
> -Mike
>
> Signed-off-by Mike Galbraith <[email protected]>
>
> include/linux/sched.h | 3 -
> kernel/sched.c | 136 +++++++++++++++++++++++++++++---------------------
> 2 files changed, 82 insertions(+), 57 deletions(-)
>
> --- linux-2.6.16-rc5-mm2/include/linux/sched.h.org 2006-03-01 15:06:22.000000000 +0100
> +++ linux-2.6.16-rc5-mm2/include/linux/sched.h 2006-03-02 08:33:12.000000000 +0100
> @@ -720,7 +720,8 @@
>
> unsigned long policy;
> cpumask_t cpus_allowed;
> - unsigned int time_slice, first_time_slice;
> + int time_slice;
Can you guarantee that int is big enough to hold a time slice in
nanoseconds on all systems? I think that you'll need more than 16 bits.
> + unsigned int first_time_slice;
>
> #ifdef CONFIG_SCHEDSTATS
> struct sched_info sched_info;
> --- linux-2.6.16-rc5-mm2/kernel/sched.c.org 2006-03-01 15:05:56.000000000 +0100
> +++ linux-2.6.16-rc5-mm2/kernel/sched.c 2006-03-02 10:05:47.000000000 +0100
> @@ -99,6 +99,10 @@
> #define MAX_SLEEP_AVG (DEF_TIMESLICE * MAX_BONUS)
> #define STARVATION_LIMIT (MAX_SLEEP_AVG)
> #define NS_MAX_SLEEP_AVG (JIFFIES_TO_NS(MAX_SLEEP_AVG))
> +#define NS_MAX_SLEEP_AVG_PCNT (NS_MAX_SLEEP_AVG / 100)
> +#define PCNT_PER_DYNPRIO (100 / MAX_BONUS)
> +#define NS_PER_DYNPRIO (PCNT_PER_DYNPRIO * NS_MAX_SLEEP_AVG_PCNT)
> +#define NS_TICK (JIFFIES_TO_NS(1))
>
> /*
> * If a task is 'interactive' then we reinsert it in the active
> @@ -153,9 +157,25 @@
> #define TASK_INTERACTIVE(p) \
> ((p)->prio <= (p)->static_prio - DELTA(p))
>
> -#define INTERACTIVE_SLEEP(p) \
> - (JIFFIES_TO_NS(MAX_SLEEP_AVG * \
> - (MAX_BONUS / 2 + DELTA((p)) + 1) / MAX_BONUS - 1))
> +#define SLEEP_AVG_DIVISOR(p) (1 + CURRENT_BONUS(p))
> +
> +#define INTERACTIVE_SLEEP_AVG(p) \
> + (min(JIFFIES_TO_NS(MAX_SLEEP_AVG * (MAX_BONUS / 2 + DELTA(p)) / \
> + MAX_BONUS), NS_MAX_SLEEP_AVG))
> +
> +/*
> + * Returns whether a task has been asleep long enough to be considered idle.
> + * The metric is whether this quantity of sleep would promote the task more
> + * than one priority beyond marginally interactive.
> + */
> +static int task_interactive_idle(task_t *p, unsigned long sleep_time)
> +{
> + unsigned long ceiling = (CURRENT_BONUS(p) + 2) * NS_PER_DYNPRIO;
> +
> + if (p->sleep_avg + sleep_time < ceiling)
> + return 0;
> + return p->sleep_avg + sleep_time >= INTERACTIVE_SLEEP_AVG(p);
> +}
>
> #define TASK_PREEMPTS_CURR(p, rq) \
> ((p)->prio < (rq)->curr->prio)
> @@ -182,7 +202,7 @@
>
> static inline unsigned int task_timeslice(task_t *p)
> {
> - return static_prio_timeslice(p->static_prio);
> + return JIFFIES_TO_NS(static_prio_timeslice(p->static_prio));
> }
>
> #define task_hot(p, now, sd) ((long long) ((now) - (p)->last_ran) \
> @@ -240,6 +260,7 @@
>
> unsigned long expired_timestamp;
> unsigned long long timestamp_last_tick;
> + unsigned long long timestamp_last_switch;
> task_t *curr, *idle;
> struct mm_struct *prev_mm;
> prio_array_t *active, *expired, arrays[2];
> @@ -777,6 +798,9 @@
> unsigned long long __sleep_time = now - p->timestamp;
> unsigned long sleep_time;
>
> + if (unlikely(now < p->timestamp))
> + __sleep_time = 0ULL;
> +
> if (unlikely(p->policy == SCHED_BATCH))
> sleep_time = 0;
> else {
> @@ -788,32 +812,32 @@
>
> if (likely(sleep_time > 0)) {
> /*
> - * User tasks that sleep a long time are categorised as
> - * idle. They will only have their sleep_avg increased to a
> + * Tasks that sleep a long time are categorised as idle.
> + * They will only have their sleep_avg increased to a
> * level that makes them just interactive priority to stay
> * active yet prevent them suddenly becoming cpu hogs and
> * starving other processes.
> */
> - if (p->mm && sleep_time > INTERACTIVE_SLEEP(p)) {
> - unsigned long ceiling;
> -
> - ceiling = JIFFIES_TO_NS(MAX_SLEEP_AVG -
> - DEF_TIMESLICE);
> - if (p->sleep_avg < ceiling)
> - p->sleep_avg = ceiling;
> - } else {
> + if (task_interactive_idle(p, sleep_time)) {
> + unsigned long ceiling = INTERACTIVE_SLEEP_AVG(p);
>
> /*
> - * The lower the sleep avg a task has the more
> - * rapidly it will rise with sleep time. This enables
> - * tasks to rapidly recover to a low latency priority.
> - * If a task was sleeping with the noninteractive
> - * label do not apply this non-linear boost
> + * Promote previously interactive task.
> */
> - if (p->sleep_type != SLEEP_NONINTERACTIVE || !p->mm)
> - sleep_time *=
> - (MAX_BONUS - CURRENT_BONUS(p)) ? : 1;
> + if (p->sleep_avg > ceiling) {
> + ceiling = p->sleep_avg / NS_PER_DYNPRIO;
> + if (ceiling < MAX_BONUS)
> + ceiling++;
> + ceiling *= NS_PER_DYNPRIO;
> + } else {
> + ceiling += p->time_slice >> 2;
> + if (ceiling > NS_MAX_SLEEP_AVG)
> + ceiling = NS_MAX_SLEEP_AVG;
> + }
>
> + if (p->sleep_avg < ceiling)
> + p->sleep_avg = ceiling;
> + } else {
> /*
> * This code gives a bonus to interactive tasks.
> *
> @@ -1367,7 +1391,8 @@
>
> out_activate:
> #endif /* CONFIG_SMP */
> - if (old_state == TASK_UNINTERRUPTIBLE) {
> +
> + if (old_state & TASK_UNINTERRUPTIBLE) {
> rq->nr_uninterruptible--;
> /*
> * Tasks waking from uninterruptible sleep are likely
> @@ -1461,6 +1486,8 @@
> */
> local_irq_disable();
> p->time_slice = (current->time_slice + 1) >> 1;
> + if (unlikely(p->time_slice < NS_TICK))
> + p->time_slice = NS_TICK;
> /*
> * The remainder of the first timeslice might be recovered by
> * the parent if the child exits early enough.
> @@ -1468,13 +1495,12 @@
> p->first_time_slice = 1;
> current->time_slice >>= 1;
> p->timestamp = sched_clock();
> - if (unlikely(!current->time_slice)) {
> + if (unlikely(current->time_slice < NS_TICK)) {
> /*
> * This case is rare, it happens when the parent has only
> * a single jiffy left from its timeslice. Taking the
> * runqueue lock is not a problem.
> */
> - current->time_slice = 1;
> scheduler_tick();
> }
> local_irq_enable();
> @@ -2586,6 +2612,7 @@
> {
> unsigned long long last = max(p->timestamp, rq->timestamp_last_tick);
> p->sched_time += now - last;
> + p->time_slice -= now - last;
> }
>
> /*
> @@ -2735,8 +2762,8 @@
> * RR tasks need a special form of timeslice management.
> * FIFO tasks have no timeslices.
> */
> - if ((p->policy == SCHED_RR) && !--p->time_slice) {
> - p->time_slice = task_timeslice(p);
> + if ((p->policy == SCHED_RR) && p->time_slice < NS_TICK) {
> + p->time_slice += task_timeslice(p);
> p->first_time_slice = 0;
> set_tsk_need_resched(p);
>
> @@ -2745,11 +2772,21 @@
> }
> goto out_unlock;
> }
> - if (!--p->time_slice) {
> + if (p->time_slice < NS_TICK) {
> + int time_slice = task_timeslice(p);
> + int run_time = time_slice - p->time_slice;
> dequeue_task(p, rq->active);
> set_tsk_need_resched(p);
> + p->time_slice += time_slice;
> + /*
> + * Tasks are charged proportionately less run_time at high
> + * sleep_avg to delay them losing their interactive status
> + */
> + run_time /= SLEEP_AVG_DIVISOR(p);
> + p->sleep_avg -= run_time;
> + if ((long)p->sleep_avg < 0)
> + p->sleep_avg = 0;
> p->prio = effective_prio(p);
> - p->time_slice = task_timeslice(p);
> p->first_time_slice = 0;
>
> if (!rq->expired_timestamp)
> @@ -2777,13 +2814,17 @@
> * This only applies to tasks in the interactive
> * delta range with at least TIMESLICE_GRANULARITY to requeue.
> */
> - if (TASK_INTERACTIVE(p) && !((task_timeslice(p) -
> - p->time_slice) % TIMESLICE_GRANULARITY(p)) &&
> - (p->time_slice >= TIMESLICE_GRANULARITY(p)) &&
> - (p->array == rq->active)) {
> + if (p->array == rq->active) {
> + unsigned long runtime, period;
>
> - requeue_task(p, rq->active);
> - set_tsk_need_resched(p);
> + runtime = now - rq->timestamp_last_switch;
> + period = JIFFIES_TO_NS(TIMESLICE_GRANULARITY(p));
> +
> + if (runtime >= period && p->time_slice >> 1 >= period) {
> + requeue_task(p, rq->active);
> + set_tsk_need_resched(p);
> + rq->timestamp_last_switch = now;
> + }
> }
> }
> out_unlock:
> @@ -2851,7 +2892,8 @@
> */
> static inline unsigned long smt_slice(task_t *p, struct sched_domain *sd)
> {
> - return p->time_slice * (100 - sd->per_cpu_gain) / 100;
> + int time_slice = NS_TO_JIFFIES(p->time_slice) ? : 1;
> + return time_slice * (100 - sd->per_cpu_gain) / 100;
> }
>
> static int dependent_sleeper(int this_cpu, runqueue_t *this_rq)
> @@ -3014,7 +3056,6 @@
> prio_array_t *array;
> struct list_head *queue;
> unsigned long long now;
> - unsigned long run_time;
> int cpu, idx, new_prio;
>
> /*
> @@ -3050,19 +3091,6 @@
>
> schedstat_inc(rq, sched_cnt);
> now = sched_clock();
> - if (likely((long long)(now - prev->timestamp) < NS_MAX_SLEEP_AVG)) {
> - run_time = now - prev->timestamp;
> - if (unlikely((long long)(now - prev->timestamp) < 0))
> - run_time = 0;
> - } else
> - run_time = NS_MAX_SLEEP_AVG;
> -
> - /*
> - * Tasks charged proportionately less run_time at high sleep_avg to
> - * delay them losing their interactive status
> - */
> - run_time /= (CURRENT_BONUS(prev) ? : 1);
> -
> spin_lock_irq(&rq->lock);
>
> if (unlikely(prev->flags & PF_DEAD))
> @@ -3075,7 +3103,7 @@
> unlikely(signal_pending(prev))))
> prev->state = TASK_RUNNING;
> else {
> - if (prev->state == TASK_UNINTERRUPTIBLE)
> + if (prev->state & TASK_UNINTERRUPTIBLE)
> rq->nr_uninterruptible++;
> deactivate_task(prev, rq);
> }
> @@ -3136,7 +3164,6 @@
> if (next->sleep_type == SLEEP_INTERACTIVE)
> delta = delta * (ON_RUNQUEUE_WEIGHT * 128 / 100) / 128;
>
> - array = next->array;
> new_prio = recalc_task_prio(next, next->timestamp + delta);
>
> if (unlikely(next->prio != new_prio)) {
> @@ -3156,14 +3183,11 @@
>
> update_cpu_clock(prev, rq, now);
>
> - prev->sleep_avg -= run_time;
> - if ((long)prev->sleep_avg <= 0)
> - prev->sleep_avg = 0;
> prev->timestamp = prev->last_ran = now;
>
> sched_info_switch(prev, next);
> if (likely(prev != next)) {
> - next->timestamp = now;
> + next->timestamp = rq->timestamp_last_switch = now;
> rq->nr_switches++;
> rq->curr = next;
> ++*switch_count;
>
--
Peter Williams [email protected]
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
On Sat, 2006-03-04 at 10:58 +1100, Peter Williams wrote:
> If you're going to manage the time slice in nanoseconds why not do it
> properly? I presume you've held back a bit in case you break something?
>
Do you mean the < NS_TICK thing? The spare change doesn't go away.
> If it helps, the smpnice balancing code's use of static_prio_timeslice()
> doesn't really care what units it's return value is in as long as
> DEF_TIMESLICE is in the same units and contains the size of a time slice
> allocated to a nice==0 non RT task.
Ok, thanks. I wanted to make very certain I couldn't screw it up.
Still, it's simpler to just leave it in ticks.
-Mike
On Saturday 04 March 2006 16:20, Mike Galbraith wrote:
> On Sat, 2006-03-04 at 13:33 +1100, Peter Williams wrote:
> > > include/linux/sched.h | 3 -
> > > kernel/sched.c | 136
> > > +++++++++++++++++++++++++++++--------------------- 2 files changed, 82
> > > insertions(+), 57 deletions(-)
> > >
> > > --- linux-2.6.16-rc5-mm2/include/linux/sched.h.org 2006-03-01
> > > 15:06:22.000000000 +0100 +++
> > > linux-2.6.16-rc5-mm2/include/linux/sched.h 2006-03-02
> > > 08:33:12.000000000 +0100 @@ -720,7 +720,8 @@
> > >
> > > unsigned long policy;
> > > cpumask_t cpus_allowed;
> > > - unsigned int time_slice, first_time_slice;
> > > + int time_slice;
> >
> > Can you guarantee that int is big enough to hold a time slice in
> > nanoseconds on all systems? I think that you'll need more than 16 bits.
>
> Nope, that's a big fat bug.
Most ints are 32bit anyway, but even a 32 bit unsigned int overflows with
nanoseconds at 4.2 seconds. A signed one at about half that. Our timeslices
are never that large, but then int isn't always 32bit either.
Cheers,
Con
On Sat, 2006-03-04 at 13:33 +1100, Peter Williams wrote:
> > include/linux/sched.h | 3 -
> > kernel/sched.c | 136 +++++++++++++++++++++++++++++---------------------
> > 2 files changed, 82 insertions(+), 57 deletions(-)
> >
> > --- linux-2.6.16-rc5-mm2/include/linux/sched.h.org 2006-03-01 15:06:22.000000000 +0100
> > +++ linux-2.6.16-rc5-mm2/include/linux/sched.h 2006-03-02 08:33:12.000000000 +0100
> > @@ -720,7 +720,8 @@
> >
> > unsigned long policy;
> > cpumask_t cpus_allowed;
> > - unsigned int time_slice, first_time_slice;
> > + int time_slice;
>
> Can you guarantee that int is big enough to hold a time slice in
> nanoseconds on all systems? I think that you'll need more than 16 bits.
Nope, that's a big fat bug.
I need to reconsider the nanosecond tracking a bit anyway. I was too
quick on the draw with the granularity change. It doesn't do what the
original does, and won't work at all when interrupts become tasks.
To do this properly, I need to maintain separate tick hit count (a.k.a.
time_slice;) and run_time. If I had slice_info in the first patch, I
could store granularity there and not have to add anything to the task
struct, but alas...
-Mike
On Sat, 2006-03-04 at 16:24 +1100, Con Kolivas wrote:
> On Saturday 04 March 2006 16:20, Mike Galbraith wrote:
> > On Sat, 2006-03-04 at 13:33 +1100, Peter Williams wrote:
> > > > include/linux/sched.h | 3 -
> > > > kernel/sched.c | 136
> > > > +++++++++++++++++++++++++++++--------------------- 2 files changed, 82
> > > > insertions(+), 57 deletions(-)
> > > >
> > > > --- linux-2.6.16-rc5-mm2/include/linux/sched.h.org 2006-03-01
> > > > 15:06:22.000000000 +0100 +++
> > > > linux-2.6.16-rc5-mm2/include/linux/sched.h 2006-03-02
> > > > 08:33:12.000000000 +0100 @@ -720,7 +720,8 @@
> > > >
> > > > unsigned long policy;
> > > > cpumask_t cpus_allowed;
> > > > - unsigned int time_slice, first_time_slice;
> > > > + int time_slice;
> > >
> > > Can you guarantee that int is big enough to hold a time slice in
> > > nanoseconds on all systems? I think that you'll need more than 16 bits.
> >
> > Nope, that's a big fat bug.
>
> Most ints are 32bit anyway, but even a 32 bit unsigned int overflows with
> nanoseconds at 4.2 seconds. A signed one at about half that. Our timeslices
> are never that large, but then int isn't always 32bit either.
Yup. I just didn't realize that there were 16 bit integers out there.
-Mike
On Sat, 04 Mar 2006 06:29:47 +0100 Mike Galbraith wrote:
> On Sat, 2006-03-04 at 16:24 +1100, Con Kolivas wrote:
> > On Saturday 04 March 2006 16:20, Mike Galbraith wrote:
> > > On Sat, 2006-03-04 at 13:33 +1100, Peter Williams wrote:
> > > > > include/linux/sched.h | 3 -
> > > > > kernel/sched.c | 136
> > > > > +++++++++++++++++++++++++++++--------------------- 2 files changed, 82
> > > > > insertions(+), 57 deletions(-)
> > > > >
> > > > > --- linux-2.6.16-rc5-mm2/include/linux/sched.h.org 2006-03-01
> > > > > 15:06:22.000000000 +0100 +++
> > > > > linux-2.6.16-rc5-mm2/include/linux/sched.h 2006-03-02
> > > > > 08:33:12.000000000 +0100 @@ -720,7 +720,8 @@
> > > > >
> > > > > unsigned long policy;
> > > > > cpumask_t cpus_allowed;
> > > > > - unsigned int time_slice, first_time_slice;
> > > > > + int time_slice;
> > > >
> > > > Can you guarantee that int is big enough to hold a time slice in
> > > > nanoseconds on all systems? I think that you'll need more than 16 bits.
> > >
> > > Nope, that's a big fat bug.
> >
> > Most ints are 32bit anyway, but even a 32 bit unsigned int overflows with
> > nanoseconds at 4.2 seconds. A signed one at about half that. Our timeslices
> > are never that large, but then int isn't always 32bit either.
>
> Yup. I just didn't realize that there were 16 bit integers out there.
LDD 3rd ed. doesn't know about them either. Same for me.
---
~Randy
On Saturday 04 March 2006 16:40, Randy.Dunlap wrote:
> On Sat, 04 Mar 2006 06:29:47 +0100 Mike Galbraith wrote:
> > On Sat, 2006-03-04 at 16:24 +1100, Con Kolivas wrote:
> > > On Saturday 04 March 2006 16:20, Mike Galbraith wrote:
> > > > On Sat, 2006-03-04 at 13:33 +1100, Peter Williams wrote:
> > > > > > include/linux/sched.h | 3 -
> > > > > > kernel/sched.c | 136
> > > > > > +++++++++++++++++++++++++++++--------------------- 2 files
> > > > > > changed, 82 insertions(+), 57 deletions(-)
> > > > > >
> > > > > > --- linux-2.6.16-rc5-mm2/include/linux/sched.h.org 2006-03-01
> > > > > > 15:06:22.000000000 +0100 +++
> > > > > > linux-2.6.16-rc5-mm2/include/linux/sched.h 2006-03-02
> > > > > > 08:33:12.000000000 +0100 @@ -720,7 +720,8 @@
> > > > > >
> > > > > > unsigned long policy;
> > > > > > cpumask_t cpus_allowed;
> > > > > > - unsigned int time_slice, first_time_slice;
> > > > > > + int time_slice;
> > > > >
> > > > > Can you guarantee that int is big enough to hold a time slice in
> > > > > nanoseconds on all systems? I think that you'll need more than 16
> > > > > bits.
> > > >
> > > > Nope, that's a big fat bug.
> > >
> > > Most ints are 32bit anyway, but even a 32 bit unsigned int overflows
> > > with nanoseconds at 4.2 seconds. A signed one at about half that. Our
> > > timeslices are never that large, but then int isn't always 32bit
> > > either.
> >
> > Yup. I just didn't realize that there were 16 bit integers out there.
>
> LDD 3rd ed. doesn't know about them either. Same for me.
Alright I made that up, but it might not be one day :P
Cheers,
Con
On Sat, 4 Mar 2006 16:54:58 +1100 Con Kolivas wrote:
> On Saturday 04 March 2006 16:40, Randy.Dunlap wrote:
> > On Sat, 04 Mar 2006 06:29:47 +0100 Mike Galbraith wrote:
> > > On Sat, 2006-03-04 at 16:24 +1100, Con Kolivas wrote:
> > > > On Saturday 04 March 2006 16:20, Mike Galbraith wrote:
> > > > > On Sat, 2006-03-04 at 13:33 +1100, Peter Williams wrote:
> > > > > > > include/linux/sched.h | 3 -
> > > > > > > kernel/sched.c | 136
> > > > > > > +++++++++++++++++++++++++++++--------------------- 2 files
> > > > > > > changed, 82 insertions(+), 57 deletions(-)
> > > > > > >
> > > > > > > --- linux-2.6.16-rc5-mm2/include/linux/sched.h.org 2006-03-01
> > > > > > > 15:06:22.000000000 +0100 +++
> > > > > > > linux-2.6.16-rc5-mm2/include/linux/sched.h 2006-03-02
> > > > > > > 08:33:12.000000000 +0100 @@ -720,7 +720,8 @@
> > > > > > >
> > > > > > > unsigned long policy;
> > > > > > > cpumask_t cpus_allowed;
> > > > > > > - unsigned int time_slice, first_time_slice;
> > > > > > > + int time_slice;
> > > > > >
> > > > > > Can you guarantee that int is big enough to hold a time slice in
> > > > > > nanoseconds on all systems? I think that you'll need more than 16
> > > > > > bits.
> > > > >
> > > > > Nope, that's a big fat bug.
> > > >
> > > > Most ints are 32bit anyway, but even a 32 bit unsigned int overflows
> > > > with nanoseconds at 4.2 seconds. A signed one at about half that. Our
> > > > timeslices are never that large, but then int isn't always 32bit
> > > > either.
> > >
> > > Yup. I just didn't realize that there were 16 bit integers out there.
> >
> > LDD 3rd ed. doesn't know about them either. Same for me.
>
> Alright I made that up, but it might not be one day :P
It *was* one day, long ago. Hopefully not in the future,
but who knows.
---
~Randy
On Sat, 2006-03-04 at 16:54 +1100, Con Kolivas wrote:
> On Saturday 04 March 2006 16:40, Randy.Dunlap wrote:
> > On Sat, 04 Mar 2006 06:29:47 +0100 Mike Galbraith wrote:
> > > On Sat, 2006-03-04 at 16:24 +1100, Con Kolivas wrote:
> > > > On Saturday 04 March 2006 16:20, Mike Galbraith wrote:
> > > > > On Sat, 2006-03-04 at 13:33 +1100, Peter Williams wrote:
> > > > > > > include/linux/sched.h | 3 -
> > > > > > > kernel/sched.c | 136
> > > > > > > +++++++++++++++++++++++++++++--------------------- 2 files
> > > > > > > changed, 82 insertions(+), 57 deletions(-)
> > > > > > >
> > > > > > > --- linux-2.6.16-rc5-mm2/include/linux/sched.h.org 2006-03-01
> > > > > > > 15:06:22.000000000 +0100 +++
> > > > > > > linux-2.6.16-rc5-mm2/include/linux/sched.h 2006-03-02
> > > > > > > 08:33:12.000000000 +0100 @@ -720,7 +720,8 @@
> > > > > > >
> > > > > > > unsigned long policy;
> > > > > > > cpumask_t cpus_allowed;
> > > > > > > - unsigned int time_slice, first_time_slice;
> > > > > > > + int time_slice;
> > > > > >
> > > > > > Can you guarantee that int is big enough to hold a time slice in
> > > > > > nanoseconds on all systems? I think that you'll need more than 16
> > > > > > bits.
> > > > >
> > > > > Nope, that's a big fat bug.
> > > >
> > > > Most ints are 32bit anyway, but even a 32 bit unsigned int overflows
> > > > with nanoseconds at 4.2 seconds. A signed one at about half that. Our
> > > > timeslices are never that large, but then int isn't always 32bit
> > > > either.
> > >
> > > Yup. I just didn't realize that there were 16 bit integers out there.
> >
> > LDD 3rd ed. doesn't know about them either. Same for me.
>
> Alright I made that up, but it might not be one day :P
Well Fudgecicles. Now you guys have gotten me aaaaall confused. Are
there cpus out there (in generic linux land) that have 16 bit integers
or not? 16 bit integers existing in a 32 bit cpu OS seems like an alien
concept to me, but I'm not a twisted cpu designer... I'll just go with
the flow ;-)
-Mike
On Saturday 04 March 2006 17:50, Mike Galbraith wrote:
> On Sat, 2006-03-04 at 16:54 +1100, Con Kolivas wrote:
> > On Saturday 04 March 2006 16:40, Randy.Dunlap wrote:
> > > On Sat, 04 Mar 2006 06:29:47 +0100 Mike Galbraith wrote:
> > > > On Sat, 2006-03-04 at 16:24 +1100, Con Kolivas wrote:
> > > > > On Saturday 04 March 2006 16:20, Mike Galbraith wrote:
> > > > > > On Sat, 2006-03-04 at 13:33 +1100, Peter Williams wrote:
> > > > > > > > include/linux/sched.h | 3 -
> > > > > > > > kernel/sched.c | 136
> > > > > > > > +++++++++++++++++++++++++++++--------------------- 2 files
> > > > > > > > changed, 82 insertions(+), 57 deletions(-)
> > > > > > > >
> > > > > > > > --- linux-2.6.16-rc5-mm2/include/linux/sched.h.org 2006-03-01
> > > > > > > > 15:06:22.000000000 +0100 +++
> > > > > > > > linux-2.6.16-rc5-mm2/include/linux/sched.h 2006-03-02
> > > > > > > > 08:33:12.000000000 +0100 @@ -720,7 +720,8 @@
> > > > > > > >
> > > > > > > > unsigned long policy;
> > > > > > > > cpumask_t cpus_allowed;
> > > > > > > > - unsigned int time_slice, first_time_slice;
> > > > > > > > + int time_slice;
> > > > > > >
> > > > > > > Can you guarantee that int is big enough to hold a time slice
> > > > > > > in nanoseconds on all systems? I think that you'll need more
> > > > > > > than 16 bits.
> > > > > >
> > > > > > Nope, that's a big fat bug.
> > > > >
> > > > > Most ints are 32bit anyway, but even a 32 bit unsigned int
> > > > > overflows with nanoseconds at 4.2 seconds. A signed one at about
> > > > > half that. Our timeslices are never that large, but then int isn't
> > > > > always 32bit either.
> > > >
> > > > Yup. I just didn't realize that there were 16 bit integers out
> > > > there.
> > >
> > > LDD 3rd ed. doesn't know about them either. Same for me.
> >
> > Alright I made that up, but it might not be one day :P
>
> Well Fudgecicles. Now you guys have gotten me aaaaall confused. Are
> there cpus out there (in generic linux land) that have 16 bit integers
> or not? 16 bit integers existing in a 32 bit cpu OS seems like an alien
> concept to me, but I'm not a twisted cpu designer... I'll just go with
> the flow ;-)
All supported architectures on linux currently use 32bits for int. That should
give you 2.1 seconds in nanoseconds. Sorry my legacy of remembering when ints
were 8 bits coloured me.
Cheers,
Con
On Sat, 2006-03-04 at 17:50 +1100, Con Kolivas wrote:
> On Saturday 04 March 2006 17:50, Mike Galbraith wrote:
> > Well Fudgecicles. Now you guys have gotten me aaaaall confused. Are
> > there cpus out there (in generic linux land) that have 16 bit integers
> > or not? 16 bit integers existing in a 32 bit cpu OS seems like an alien
> > concept to me, but I'm not a twisted cpu designer... I'll just go with
> > the flow ;-)
>
> All supported architectures on linux currently use 32bits for int. That should
> give you 2.1 seconds in nanoseconds. Sorry my legacy of remembering when ints
> were 8 bits coloured me.
Well that's a relief. I was getting all kinds of frazzled trying to
imagine the kernel not exploding on such a cpu.
-Mike
On Sat, 2006-03-04 at 06:21 +0100, Mike Galbraith wrote:
> I need to reconsider the nanosecond tracking a bit anyway. I was too
> quick on the draw with the granularity change. It doesn't do what the
> original does, and won't work at all when interrupts become tasks.
>
> To do this properly, I need to maintain separate tick hit count (a.k.a.
> time_slice;) and run_time. If I had slice_info in the first patch, I
> could store granularity there and not have to add anything to the task
> struct, but alas...
Ala the below. Leave the ticked timeslice available for those places
where nanoseconds are useless, and do precise counting only where that
is useful. I need to come up with a better name that slice_ticks, and
time_slice is already taken.
-Mike
--- linux-2.6.16-rc5-mm2/include/linux/sched.h.org 2006-03-01 15:06:22.000000000 +0100
+++ linux-2.6.16-rc5-mm2/include/linux/sched.h 2006-03-02 08:33:12.000000000 +0100
@@ -720,7 +720,8 @@
unsigned long policy;
cpumask_t cpus_allowed;
- unsigned int time_slice, first_time_slice;
+ int time_slice;
+ unsigned int slice_ticks, first_time_slice;
#ifdef CONFIG_SCHEDSTATS
struct sched_info sched_info;
--- linux-2.6.16-rc5-mm2/kernel/sched.c.org 2006-03-01 15:05:56.000000000 +0100
+++ linux-2.6.16-rc5-mm2/kernel/sched.c 2006-03-02 10:05:47.000000000 +0100
@@ -99,6 +99,10 @@
#define MAX_SLEEP_AVG (DEF_TIMESLICE * MAX_BONUS)
#define STARVATION_LIMIT (MAX_SLEEP_AVG)
#define NS_MAX_SLEEP_AVG (JIFFIES_TO_NS(MAX_SLEEP_AVG))
+#define NS_MAX_SLEEP_AVG_PCNT (NS_MAX_SLEEP_AVG / 100)
+#define PCNT_PER_DYNPRIO (100 / MAX_BONUS)
+#define NS_PER_DYNPRIO (PCNT_PER_DYNPRIO * NS_MAX_SLEEP_AVG_PCNT)
+#define NS_TICK (JIFFIES_TO_NS(1))
/*
* If a task is 'interactive' then we reinsert it in the active
@@ -153,9 +157,25 @@
#define TASK_INTERACTIVE(p) \
((p)->prio <= (p)->static_prio - DELTA(p))
-#define INTERACTIVE_SLEEP(p) \
- (JIFFIES_TO_NS(MAX_SLEEP_AVG * \
- (MAX_BONUS / 2 + DELTA((p)) + 1) / MAX_BONUS - 1))
+#define SLEEP_AVG_DIVISOR(p) (1 + CURRENT_BONUS(p))
+
+#define INTERACTIVE_SLEEP_AVG(p) \
+ (min(JIFFIES_TO_NS(MAX_SLEEP_AVG * (MAX_BONUS / 2 + DELTA(p)) / \
+ MAX_BONUS), NS_MAX_SLEEP_AVG))
+
+/*
+ * Returns whether a task has been asleep long enough to be considered idle.
+ * The metric is whether this quantity of sleep would promote the task more
+ * than one priority beyond marginally interactive.
+ */
+static int task_interactive_idle(task_t *p, unsigned long sleep_time)
+{
+ unsigned long ceiling = (CURRENT_BONUS(p) + 2) * NS_PER_DYNPRIO;
+
+ if (p->sleep_avg + sleep_time < ceiling)
+ return 0;
+ return p->sleep_avg + sleep_time >= INTERACTIVE_SLEEP_AVG(p);
+}
#define TASK_PREEMPTS_CURR(p, rq) \
((p)->prio < (rq)->curr->prio)
@@ -182,7 +202,7 @@
static inline unsigned int task_timeslice(task_t *p)
{
- return static_prio_timeslice(p->static_prio);
+ return JIFFIES_TO_NS(static_prio_timeslice(p->static_prio));
}
#define task_hot(p, now, sd) ((long long) ((now) - (p)->last_ran) \
@@ -777,6 +797,9 @@
unsigned long long __sleep_time = now - p->timestamp;
unsigned long sleep_time;
+ if (unlikely(now < p->timestamp))
+ __sleep_time = 0ULL;
+
if (unlikely(p->policy == SCHED_BATCH))
sleep_time = 0;
else {
@@ -788,32 +811,32 @@
if (likely(sleep_time > 0)) {
/*
- * User tasks that sleep a long time are categorised as
- * idle. They will only have their sleep_avg increased to a
+ * Tasks that sleep a long time are categorised as idle.
+ * They will only have their sleep_avg increased to a
* level that makes them just interactive priority to stay
* active yet prevent them suddenly becoming cpu hogs and
* starving other processes.
*/
- if (p->mm && sleep_time > INTERACTIVE_SLEEP(p)) {
- unsigned long ceiling;
-
- ceiling = JIFFIES_TO_NS(MAX_SLEEP_AVG -
- DEF_TIMESLICE);
- if (p->sleep_avg < ceiling)
- p->sleep_avg = ceiling;
- } else {
+ if (task_interactive_idle(p, sleep_time)) {
+ unsigned long ceiling = INTERACTIVE_SLEEP_AVG(p);
/*
- * The lower the sleep avg a task has the more
- * rapidly it will rise with sleep time. This enables
- * tasks to rapidly recover to a low latency priority.
- * If a task was sleeping with the noninteractive
- * label do not apply this non-linear boost
+ * Promote previously interactive task.
*/
- if (p->sleep_type != SLEEP_NONINTERACTIVE || !p->mm)
- sleep_time *=
- (MAX_BONUS - CURRENT_BONUS(p)) ? : 1;
+ if (p->sleep_avg > ceiling) {
+ ceiling = p->sleep_avg / NS_PER_DYNPRIO;
+ if (ceiling < MAX_BONUS)
+ ceiling++;
+ ceiling *= NS_PER_DYNPRIO;
+ } else {
+ ceiling += p->time_slice >> 2;
+ if (ceiling > NS_MAX_SLEEP_AVG)
+ ceiling = NS_MAX_SLEEP_AVG;
+ }
+ if (p->sleep_avg < ceiling)
+ p->sleep_avg = ceiling;
+ } else {
/*
* This code gives a bonus to interactive tasks.
*
@@ -1367,7 +1390,8 @@
out_activate:
#endif /* CONFIG_SMP */
- if (old_state == TASK_UNINTERRUPTIBLE) {
+
+ if (old_state & TASK_UNINTERRUPTIBLE) {
rq->nr_uninterruptible--;
/*
* Tasks waking from uninterruptible sleep are likely
@@ -1461,6 +1485,8 @@
*/
local_irq_disable();
p->time_slice = (current->time_slice + 1) >> 1;
+ if (unlikely(p->time_slice < NS_TICK))
+ p->time_slice = NS_TICK;
/*
* The remainder of the first timeslice might be recovered by
* the parent if the child exits early enough.
@@ -1468,13 +1494,12 @@
p->first_time_slice = 1;
current->time_slice >>= 1;
p->timestamp = sched_clock();
- if (unlikely(!current->time_slice)) {
+ if (unlikely(current->time_slice < NS_TICK)) {
/*
* This case is rare, it happens when the parent has only
* a single jiffy left from its timeslice. Taking the
* runqueue lock is not a problem.
*/
- current->time_slice = 1;
scheduler_tick();
}
local_irq_enable();
@@ -2586,6 +2611,7 @@
{
unsigned long long last = max(p->timestamp, rq->timestamp_last_tick);
p->sched_time += now - last;
+ p->time_slice -= now - last;
}
/*
@@ -2730,13 +2756,16 @@
* timeslice. This makes it possible for interactive tasks
* to use up their timeslices at their highest priority levels.
*/
+ if (p->slice_ticks > 1)
+ p->slice_ticks--;
if (rt_task(p)) {
/*
* RR tasks need a special form of timeslice management.
* FIFO tasks have no timeslices.
*/
- if ((p->policy == SCHED_RR) && !--p->time_slice) {
- p->time_slice = task_timeslice(p);
+ if ((p->policy == SCHED_RR) && p->time_slice < NS_TICK) {
+ p->time_slice += task_timeslice(p);
+ p->slice_ticks = NS_TO_JIFFIES(p->time_slice);
p->first_time_slice = 0;
set_tsk_need_resched(p);
@@ -2745,11 +2774,22 @@
}
goto out_unlock;
}
- if (!--p->time_slice) {
+ if (p->time_slice < NS_TICK) {
+ unsigned int time_slice = task_timeslice(p);
+ int run_time = time_slice - p->time_slice;
dequeue_task(p, rq->active);
set_tsk_need_resched(p);
+ p->time_slice += time_slice;
+ p->slice_ticks = NS_TO_JIFFIES(p->time_slice);
+ /*
+ * Tasks are charged proportionately less run_time at high
+ * sleep_avg to delay them losing their interactive status
+ */
+ run_time /= SLEEP_AVG_DIVISOR(p);
+ p->sleep_avg -= run_time;
+ if ((long)p->sleep_avg < 0)
+ p->sleep_avg = 0;
p->prio = effective_prio(p);
- p->time_slice = task_timeslice(p);
p->first_time_slice = 0;
if (!rq->expired_timestamp)
@@ -2777,13 +2817,12 @@
* This only applies to tasks in the interactive
* delta range with at least TIMESLICE_GRANULARITY to requeue.
*/
- if (TASK_INTERACTIVE(p) && !((task_timeslice(p) -
- p->time_slice) % TIMESLICE_GRANULARITY(p)) &&
- (p->time_slice >= TIMESLICE_GRANULARITY(p)) &&
+ if (TASK_INTERACTIVE(p) && !((NS_TO_JIFFIES(task_timeslice(p)) -
+ p->slice_ticks) % TIMESLICE_GRANULARITY(p)) &&
+ (p->slice_ticks >= TIMESLICE_GRANULARITY(p)) &&
(p->array == rq->active)) {
-
- requeue_task(p, rq->active);
- set_tsk_need_resched(p);
+ requeue_task(p, rq->active);
+ set_tsk_need_resched(p);
}
}
out_unlock:
@@ -2851,7 +2890,7 @@
*/
static inline unsigned long smt_slice(task_t *p, struct sched_domain *sd)
{
- return p->time_slice * (100 - sd->per_cpu_gain) / 100;
+ return p->slice_ticks * (100 - sd->per_cpu_gain) / 100;
}
static int dependent_sleeper(int this_cpu, runqueue_t *this_rq)
@@ -3014,7 +3053,6 @@
prio_array_t *array;
struct list_head *queue;
unsigned long long now;
- unsigned long run_time;
int cpu, idx, new_prio;
/*
@@ -3050,19 +3088,6 @@
schedstat_inc(rq, sched_cnt);
now = sched_clock();
- if (likely((long long)(now - prev->timestamp) < NS_MAX_SLEEP_AVG)) {
- run_time = now - prev->timestamp;
- if (unlikely((long long)(now - prev->timestamp) < 0))
- run_time = 0;
- } else
- run_time = NS_MAX_SLEEP_AVG;
-
- /*
- * Tasks charged proportionately less run_time at high sleep_avg to
- * delay them losing their interactive status
- */
- run_time /= (CURRENT_BONUS(prev) ? : 1);
-
spin_lock_irq(&rq->lock);
if (unlikely(prev->flags & PF_DEAD))
@@ -3075,7 +3100,7 @@
unlikely(signal_pending(prev))))
prev->state = TASK_RUNNING;
else {
- if (prev->state == TASK_UNINTERRUPTIBLE)
+ if (prev->state & TASK_UNINTERRUPTIBLE)
rq->nr_uninterruptible++;
deactivate_task(prev, rq);
}
@@ -3136,7 +3161,6 @@
if (next->sleep_type == SLEEP_INTERACTIVE)
delta = delta * (ON_RUNQUEUE_WEIGHT * 128 / 100) / 128;
- array = next->array;
new_prio = recalc_task_prio(next, next->timestamp + delta);
if (unlikely(next->prio != new_prio)) {
@@ -3156,9 +3180,6 @@
update_cpu_clock(prev, rq, now);
- prev->sleep_avg -= run_time;
- if ((long)prev->sleep_avg <= 0)
- prev->sleep_avg = 0;
prev->timestamp = prev->last_ran = now;
sched_info_switch(prev, next);
Mike Galbraith wrote:
> On Sat, 2006-03-04 at 10:58 +1100, Peter Williams wrote:
>
>
>>If you're going to manage the time slice in nanoseconds why not do it
>>properly? I presume you've held back a bit in case you break something?
>>
>
>
> Do you mean the < NS_TICK thing? The spare change doesn't go away.
Not exactly. I mean "Why calculate time slice in jiffies and convert to
nanoseconds? Why not just do the calculation in nanoseconds?"
>
>
>>If it helps, the smpnice balancing code's use of static_prio_timeslice()
>>doesn't really care what units it's return value is in as long as
>>DEF_TIMESLICE is in the same units and contains the size of a time slice
>>allocated to a nice==0 non RT task.
>
>
> Ok, thanks. I wanted to make very certain I couldn't screw it up.
> Still, it's simpler to just leave it in ticks.
>
> -Mike
Peter
--
Peter Williams [email protected]
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
Mike Galbraith wrote:
> On Sat, 2006-03-04 at 16:54 +1100, Con Kolivas wrote:
>
>>On Saturday 04 March 2006 16:40, Randy.Dunlap wrote:
>>
>>>On Sat, 04 Mar 2006 06:29:47 +0100 Mike Galbraith wrote:
>>>
>>>>On Sat, 2006-03-04 at 16:24 +1100, Con Kolivas wrote:
>>>>
>>>>>On Saturday 04 March 2006 16:20, Mike Galbraith wrote:
>>>>>
>>>>>>On Sat, 2006-03-04 at 13:33 +1100, Peter Williams wrote:
>>>>>>
>>>>>>>> include/linux/sched.h | 3 -
>>>>>>>> kernel/sched.c | 136
>>>>>>>>+++++++++++++++++++++++++++++--------------------- 2 files
>>>>>>>>changed, 82 insertions(+), 57 deletions(-)
>>>>>>>>
>>>>>>>>--- linux-2.6.16-rc5-mm2/include/linux/sched.h.org 2006-03-01
>>>>>>>>15:06:22.000000000 +0100 +++
>>>>>>>>linux-2.6.16-rc5-mm2/include/linux/sched.h 2006-03-02
>>>>>>>>08:33:12.000000000 +0100 @@ -720,7 +720,8 @@
>>>>>>>>
>>>>>>>> unsigned long policy;
>>>>>>>> cpumask_t cpus_allowed;
>>>>>>>>- unsigned int time_slice, first_time_slice;
>>>>>>>>+ int time_slice;
>>>>>>>
>>>>>>>Can you guarantee that int is big enough to hold a time slice in
>>>>>>>nanoseconds on all systems? I think that you'll need more than 16
>>>>>>>bits.
>>>>>>
>>>>>>Nope, that's a big fat bug.
>>>>>
>>>>>Most ints are 32bit anyway, but even a 32 bit unsigned int overflows
>>>>>with nanoseconds at 4.2 seconds. A signed one at about half that. Our
>>>>>timeslices are never that large, but then int isn't always 32bit
>>>>>either.
>>>>
>>>>Yup. I just didn't realize that there were 16 bit integers out there.
>>>
>>>LDD 3rd ed. doesn't know about them either. Same for me.
>>
>>Alright I made that up, but it might not be one day :P
>
>
> Well Fudgecicles. Now you guys have gotten me aaaaall confused. Are
> there cpus out there (in generic linux land) that have 16 bit integers
> or not? 16 bit integers existing in a 32 bit cpu OS seems like an alien
> concept to me, but I'm not a twisted cpu designer... I'll just go with
> the flow ;-)
I'm not sure which is why I asked. But it seems to be a convention (in
the kernel code) to use long or unsigned long when you want to ensure at
least 32 bits. This may be a legacy from the days when there were
systems with 16 bit integers but it seems (to me) to be alive and well.
Peter
--
Peter Williams [email protected]
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
On Sun, 2006-03-05 at 08:37 +1100, Peter Williams wrote:
> Mike Galbraith wrote:
> > On Sat, 2006-03-04 at 10:58 +1100, Peter Williams wrote:
> >
> >
> >>If you're going to manage the time slice in nanoseconds why not do it
> >>properly? I presume you've held back a bit in case you break something?
> >>
> >
> >
> > Do you mean the < NS_TICK thing? The spare change doesn't go away.
>
> Not exactly. I mean "Why calculate time slice in jiffies and convert to
> nanoseconds? Why not just do the calculation in nanoseconds?"
Duh. Good question. Thanks :)
-Mike
On Sun, 2006-03-05 at 08:37 +1100, Peter Williams wrote:
> Mike Galbraith wrote:
> > On Sat, 2006-03-04 at 10:58 +1100, Peter Williams wrote:
> >
> >
> >>If you're going to manage the time slice in nanoseconds why not do it
> >>properly? I presume you've held back a bit in case you break something?
> >>
> >
> >
> > Do you mean the < NS_TICK thing? The spare change doesn't go away.
>
> Not exactly. I mean "Why calculate time slice in jiffies and convert to
> nanoseconds? Why not just do the calculation in nanoseconds?"
Turns out that my first instinct was right, and there is a good reason
not to. It doesn't improve readability nor do anything functional, it
only adds clutter. I much prefer the look of plain old ticks, and
having nanoseconds only intrude where they're required. I did change
NS_TICK to the less obfuscated (1000000000 / HZ), with task_timeslice()
returning a more readable ticks * NS_TICK conversion.
-Mike
Con Kolivas wrote:
> On Saturday 04 March 2006 17:50, Mike Galbraith wrote:
>
>>On Sat, 2006-03-04 at 16:54 +1100, Con Kolivas wrote:
>>
>>>On Saturday 04 March 2006 16:40, Randy.Dunlap wrote:
>>>
>>>>On Sat, 04 Mar 2006 06:29:47 +0100 Mike Galbraith wrote:
>>>>
>>>>>On Sat, 2006-03-04 at 16:24 +1100, Con Kolivas wrote:
>>>>>
>>>>>>On Saturday 04 March 2006 16:20, Mike Galbraith wrote:
>>>>>>
>>>>>>>On Sat, 2006-03-04 at 13:33 +1100, Peter Williams wrote:
>>>>>>>
>>>>>>>>> include/linux/sched.h | 3 -
>>>>>>>>> kernel/sched.c | 136
>>>>>>>>>+++++++++++++++++++++++++++++--------------------- 2 files
>>>>>>>>>changed, 82 insertions(+), 57 deletions(-)
>>>>>>>>>
>>>>>>>>>--- linux-2.6.16-rc5-mm2/include/linux/sched.h.org 2006-03-01
>>>>>>>>>15:06:22.000000000 +0100 +++
>>>>>>>>>linux-2.6.16-rc5-mm2/include/linux/sched.h 2006-03-02
>>>>>>>>>08:33:12.000000000 +0100 @@ -720,7 +720,8 @@
>>>>>>>>>
>>>>>>>>> unsigned long policy;
>>>>>>>>> cpumask_t cpus_allowed;
>>>>>>>>>- unsigned int time_slice, first_time_slice;
>>>>>>>>>+ int time_slice;
>>>>>>>>
>>>>>>>>Can you guarantee that int is big enough to hold a time slice
>>>>>>>>in nanoseconds on all systems? I think that you'll need more
>>>>>>>>than 16 bits.
>>>>>>>
>>>>>>>Nope, that's a big fat bug.
>>>>>>
>>>>>>Most ints are 32bit anyway, but even a 32 bit unsigned int
>>>>>>overflows with nanoseconds at 4.2 seconds. A signed one at about
>>>>>>half that. Our timeslices are never that large, but then int isn't
>>>>>>always 32bit either.
>>>>>
>>>>>Yup. I just didn't realize that there were 16 bit integers out
>>>>>there.
>>>>
>>>>LDD 3rd ed. doesn't know about them either. Same for me.
>>>
>>>Alright I made that up, but it might not be one day :P
>>
>>Well Fudgecicles. Now you guys have gotten me aaaaall confused. Are
>>there cpus out there (in generic linux land) that have 16 bit integers
>>or not? 16 bit integers existing in a 32 bit cpu OS seems like an alien
>>concept to me, but I'm not a twisted cpu designer... I'll just go with
>>the flow ;-)
>
>
> All supported architectures on linux currently use 32bits for int. That should
> give you 2.1 seconds in nanoseconds. Sorry my legacy of remembering when ints
> were 8 bits coloured me.
>
> Cheers,
> Con
The size of int isn't just a function of the architecture it's also a
function of the C compiler used. C requires that longs be at least 32
bits but only requires that ints be at least 16 bits. If the
architecture supports 16 bit integer operations there's nothing to stop
a VALID compiler from make ints only 16 bits. Since everyone uses gcc
(at the moment) it's probably not an (urgent) issue but it seems to me
that the safe option is to use longs when you want to ensure that you
get at least 32 bits.
Peter
--
Peter Williams [email protected]
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce