2006-03-24 11:02:43

by Mike Galbraith

[permalink] [raw]
Subject: [2.6.16-mm1 patch] throttling tree patches

Greetings,

I've broken down my throttling tree into 6 patches, which I'll send as
replies to this start-point.

Patch 1/6

Ignore timewarps caused by SMP timestamp rounding. Also, don't stamp a
task with a computed timestamp, stamp with the already called clock.

Signed-off-by: Mike Galbraith <[email protected]>

--- linux-2.6.16-mm1/kernel/sched.c.org 2006-03-23 15:01:41.000000000 +0100
+++ linux-2.6.16-mm1/kernel/sched.c 2006-03-23 15:02:25.000000000 +0100
@@ -805,6 +805,16 @@
unsigned long long __sleep_time = now - p->timestamp;
unsigned long sleep_time;

+ /*
+ * On SMP systems, a task can go to sleep on one CPU and
+ * wake up on another. When this happens, the timestamp
+ * is rounded to the nearest tick, which can lead to now
+ * being less than p->timestamp for short sleeps. Ignore
+ * these, they're insignificant.
+ */
+ if (unlikely(now < p->timestamp))
+ __sleep_time = 0ULL;
+
if (batch_task(p))
sleep_time = 0;
else {
@@ -871,20 +881,20 @@
*/
static void activate_task(task_t *p, runqueue_t *rq, int local)
{
- unsigned long long now;
+ unsigned long long now, comp;

- now = sched_clock();
+ now = comp = sched_clock();
#ifdef CONFIG_SMP
if (!local) {
/* Compensate for drifting sched_clock */
runqueue_t *this_rq = this_rq();
- now = (now - this_rq->timestamp_last_tick)
+ comp = (now - this_rq->timestamp_last_tick)
+ rq->timestamp_last_tick;
}
#endif

if (!rt_task(p))
- p->prio = recalc_task_prio(p, now);
+ p->prio = recalc_task_prio(p, comp);

/*
* This checks to make sure it's not an uninterruptible task



2006-03-24 11:07:05

by Mike Galbraith

[permalink] [raw]
Subject: Re: [2.6.16-mm1 patch] throttling tree patches

patch 2/6

This patch just fixes a bug waiting for a place to happen. If anyone
ever decides to use TASK_NONINTERACTIVE along with TASK_UNINTERRUPTIBLE,
bad things will happen.

Signed-off-by: Mike Galbraith <[email protected]>

--- linux-2.6.16-mm1/kernel/sched.c-1.timewarp 2006-03-23 15:04:42.000000000 +0100
+++ linux-2.6.16-mm1/kernel/sched.c 2006-03-23 15:07:08.000000000 +0100
@@ -1418,7 +1418,7 @@

out_activate:
#endif /* CONFIG_SMP */
- if (old_state == TASK_UNINTERRUPTIBLE) {
+ if (old_state & TASK_UNINTERRUPTIBLE) {
rq->nr_uninterruptible--;
/*
* Tasks on involuntary sleep don't earn
@@ -3125,7 +3125,7 @@
unlikely(signal_pending(prev))))
prev->state = TASK_RUNNING;
else {
- if (prev->state == TASK_UNINTERRUPTIBLE)
+ if (prev->state & TASK_UNINTERRUPTIBLE)
rq->nr_uninterruptible++;
deactivate_task(prev, rq);
}


2006-03-24 11:15:22

by Mike Galbraith

[permalink] [raw]
Subject: Re: [2.6.16-mm1 patch] throttling tree patches

patch 3/6

This patch does various interactivity cleanups.

1. Removes the barrier between kernel threads and user tasks wrt
dynamic priority handling.

2. Removes the priority barrier for IO.

3. Treats TASK_INTERACTIVE as a transition point that all tasks must
stop at prior to being promoted further.

4. Moves division out of the fast path and into scheduler_tick().
While doing so, tightens timeslice accounting, since it's free, and is
the flip-side of the reason for having nanosecond sched_clock().

Signed-off-by: Mike Galbraith <[email protected]>

--- linux-2.6.16-mm1/include/linux/sched.h.org 2006-03-19 05:42:00.000000000 +0100
+++ linux-2.6.16-mm1/include/linux/sched.h 2006-03-19 08:40:34.000000000 +0100
@@ -741,6 +741,7 @@
unsigned long policy;
cpumask_t cpus_allowed;
unsigned int time_slice, first_time_slice;
+ long slice_time_ns;

#ifdef CONFIG_SCHEDSTATS
struct sched_info sched_info;
--- linux-2.6.16-mm1/kernel/sched.c-2.uninterruptible 2006-03-23 15:07:26.000000000 +0100
+++ linux-2.6.16-mm1/kernel/sched.c 2006-03-23 15:17:26.000000000 +0100
@@ -99,6 +99,10 @@
#define MAX_SLEEP_AVG (DEF_TIMESLICE * MAX_BONUS)
#define STARVATION_LIMIT (MAX_SLEEP_AVG)
#define NS_MAX_SLEEP_AVG (JIFFIES_TO_NS(MAX_SLEEP_AVG))
+#define NS_MAX_SLEEP_AVG_PCNT (NS_MAX_SLEEP_AVG / 100)
+#define PCNT_PER_DYNPRIO (100 / MAX_BONUS)
+#define NS_PER_DYNPRIO (PCNT_PER_DYNPRIO * NS_MAX_SLEEP_AVG_PCNT)
+#define NS_TICK (1000000000 / HZ)

/*
* If a task is 'interactive' then we reinsert it in the active
@@ -153,9 +157,25 @@
#define TASK_INTERACTIVE(p) \
((p)->prio <= (p)->static_prio - DELTA(p))

-#define INTERACTIVE_SLEEP(p) \
- (JIFFIES_TO_NS(MAX_SLEEP_AVG * \
- (MAX_BONUS / 2 + DELTA((p)) + 1) / MAX_BONUS - 1))
+#define SLEEP_AVG_DIVISOR(p) (1 + CURRENT_BONUS(p))
+
+#define INTERACTIVE_SLEEP_AVG(p) \
+ (min(JIFFIES_TO_NS(MAX_SLEEP_AVG * (MAX_BONUS / 2 + DELTA(p)) / \
+ MAX_BONUS), NS_MAX_SLEEP_AVG))
+
+/*
+ * Returns whether a task has been asleep long enough to be considered idle.
+ * The metric is whether this quantity of sleep would promote the task more
+ * than one priority beyond marginally interactive.
+ */
+static int task_interactive_idle(task_t *p, unsigned long sleep_time)
+{
+ unsigned long ceiling = (CURRENT_BONUS(p) + 2) * NS_PER_DYNPRIO;
+
+ if (p->sleep_avg + sleep_time < ceiling)
+ return 0;
+ return p->sleep_avg + sleep_time >= INTERACTIVE_SLEEP_AVG(p);
+}

#define TASK_PREEMPTS_CURR(p, rq) \
((p)->prio < (rq)->curr->prio)
@@ -185,6 +205,11 @@
return static_prio_timeslice(p->static_prio);
}

+static inline unsigned int task_timeslice_ns(task_t *p)
+{
+ return static_prio_timeslice(p->static_prio) * NS_TICK;
+}
+
#define task_hot(p, now, sd) ((long long) ((now) - (p)->last_ran) \
< (long long) (sd)->cache_hot_time)

@@ -826,35 +851,32 @@

if (likely(sleep_time > 0)) {
/*
- * User tasks that sleep a long time are categorised as
- * idle. They will only have their sleep_avg increased to a
+ * Tasks that sleep a long time are categorised as idle.
+ * They will only have their sleep_avg increased to a
* level that makes them just interactive priority to stay
* active yet prevent them suddenly becoming cpu hogs and
* starving other processes.
*/
- if (p->mm && sleep_time > INTERACTIVE_SLEEP(p)) {
- unsigned long ceiling;
+ if (task_interactive_idle(p, sleep_time)) {
+ unsigned long ceiling = INTERACTIVE_SLEEP_AVG(p);

- ceiling = JIFFIES_TO_NS(MAX_SLEEP_AVG -
- DEF_TIMESLICE);
- if (p->sleep_avg < ceiling)
- p->sleep_avg = ceiling;
- } else {
/*
- * Tasks waking from uninterruptible sleep are
- * limited in their sleep_avg rise as they
- * are likely to be waiting on I/O
+ * Promote previously interactive task.
*/
- if (p->sleep_type == SLEEP_NONINTERACTIVE && p->mm) {
- if (p->sleep_avg >= INTERACTIVE_SLEEP(p))
- sleep_time = 0;
- else if (p->sleep_avg + sleep_time >=
- INTERACTIVE_SLEEP(p)) {
- p->sleep_avg = INTERACTIVE_SLEEP(p);
- sleep_time = 0;
- }
+ if (p->sleep_avg > ceiling) {
+ ceiling = p->sleep_avg / NS_PER_DYNPRIO;
+ if (ceiling < MAX_BONUS)
+ ceiling++;
+ ceiling *= NS_PER_DYNPRIO;
+ } else {
+ ceiling += p->slice_time_ns >> 2;
+ if (ceiling > NS_MAX_SLEEP_AVG)
+ ceiling = NS_MAX_SLEEP_AVG;
}

+ if (p->sleep_avg < ceiling)
+ p->sleep_avg = ceiling;
+ } else {
/*
* This code gives a bonus to interactive tasks.
*
@@ -1510,21 +1532,23 @@
* resulting in more scheduling fairness.
*/
local_irq_disable();
- p->time_slice = (current->time_slice + 1) >> 1;
+ p->slice_time_ns = current->slice_time_ns >> 1;
+ if (unlikely(p->slice_time_ns < NS_TICK))
+ p->slice_time_ns = NS_TICK;
+ p->time_slice = NS_TO_JIFFIES(p->slice_time_ns);
/*
* The remainder of the first timeslice might be recovered by
* the parent if the child exits early enough.
*/
p->first_time_slice = 1;
- current->time_slice >>= 1;
p->timestamp = sched_clock();
- if (unlikely(!current->time_slice)) {
+ current->slice_time_ns >>= 1;
+ if (unlikely(current->slice_time_ns < NS_TICK)) {
/*
* This case is rare, it happens when the parent has only
* a single jiffy left from its timeslice. Taking the
* runqueue lock is not a problem.
*/
- current->time_slice = 1;
scheduler_tick();
}
local_irq_enable();
@@ -1632,9 +1656,9 @@
*/
rq = task_rq_lock(p->parent, &flags);
if (p->first_time_slice && task_cpu(p) == task_cpu(p->parent)) {
- p->parent->time_slice += p->time_slice;
- if (unlikely(p->parent->time_slice > task_timeslice(p)))
- p->parent->time_slice = task_timeslice(p);
+ p->parent->slice_time_ns += p->slice_time_ns;
+ if (unlikely(p->parent->slice_time_ns > task_timeslice_ns(p)))
+ p->parent->slice_time_ns = task_timeslice_ns(p);
}
if (p->sleep_avg < p->parent->sleep_avg)
p->parent->sleep_avg = p->parent->sleep_avg /
@@ -2653,7 +2677,9 @@
unsigned long long now)
{
unsigned long long last = max(p->timestamp, rq->timestamp_last_tick);
- p->sched_time += now - last;
+ long run_time = now - last;
+ p->sched_time += run_time;
+ p->slice_time_ns -= run_time;
}

/*
@@ -2781,13 +2807,21 @@
* priority until it either goes to sleep or uses up its
* timeslice. This makes it possible for interactive tasks
* to use up their timeslices at their highest priority levels.
- */
+ *
+ * We don't want to update every task at each tick, so we
+ * update a task's time_slice according to how much cpu
+ * time it has received since it was last ticked.
+ */
+ if (p->slice_time_ns < NS_TICK)
+ p->time_slice = 0;
+ else p->time_slice = NS_TO_JIFFIES(p->slice_time_ns);
if (rt_task(p)) {
/*
* RR tasks need a special form of timeslice management.
* FIFO tasks have no timeslices.
*/
- if ((p->policy == SCHED_RR) && !--p->time_slice) {
+ if ((p->policy == SCHED_RR) && !p->time_slice) {
+ p->slice_time_ns = task_timeslice_ns(p);
p->time_slice = task_timeslice(p);
p->first_time_slice = 0;
set_tsk_need_resched(p);
@@ -2797,11 +2831,22 @@
}
goto out_unlock;
}
- if (!--p->time_slice) {
+ if (!p->time_slice) {
+ long slice_time_ns = task_timeslice_ns(p);
+ long run_time = (-1 * p->slice_time_ns) + slice_time_ns;
dequeue_task(p, rq->active);
set_tsk_need_resched(p);
- p->prio = effective_prio(p);
+ p->slice_time_ns = slice_time_ns;
p->time_slice = task_timeslice(p);
+ /*
+ * Tasks are charged proportionately less run_time at high
+ * sleep_avg to delay them losing their interactive status
+ */
+ run_time /= SLEEP_AVG_DIVISOR(p);
+ if (p->sleep_avg >= run_time)
+ p->sleep_avg -= run_time;
+ else p->sleep_avg = 0;
+ p->prio = effective_prio(p);
p->first_time_slice = 0;

if (!rq->expired_timestamp)
@@ -3066,7 +3111,6 @@
prio_array_t *array;
struct list_head *queue;
unsigned long long now;
- unsigned long run_time;
int cpu, idx, new_prio;

/*
@@ -3100,19 +3144,6 @@

schedstat_inc(rq, sched_cnt);
now = sched_clock();
- if (likely((long long)(now - prev->timestamp) < NS_MAX_SLEEP_AVG)) {
- run_time = now - prev->timestamp;
- if (unlikely((long long)(now - prev->timestamp) < 0))
- run_time = 0;
- } else
- run_time = NS_MAX_SLEEP_AVG;
-
- /*
- * Tasks charged proportionately less run_time at high sleep_avg to
- * delay them losing their interactive status
- */
- run_time /= (CURRENT_BONUS(prev) ? : 1);
-
spin_lock_irq(&rq->lock);

if (unlikely(prev->flags & PF_DEAD))
@@ -3186,7 +3217,6 @@
if (next->sleep_type == SLEEP_INTERACTIVE)
delta = delta * (ON_RUNQUEUE_WEIGHT * 128 / 100) / 128;

- array = next->array;
new_prio = recalc_task_prio(next, next->timestamp + delta);

if (unlikely(next->prio != new_prio)) {
@@ -3206,9 +3236,6 @@

update_cpu_clock(prev, rq, now);

- prev->sleep_avg -= run_time;
- if ((long)prev->sleep_avg <= 0)
- prev->sleep_avg = 0;
prev->timestamp = prev->last_ran = now;

sched_info_switch(prev, next);


2006-03-24 11:20:55

by Mike Galbraith

[permalink] [raw]
Subject: Re: [2.6.16-mm1 patch] throttling tree patches

patch 4/6

This patch implements the throttling.

Throttling is done via computing a slice_avg, which is the upper limit
of what a task's sleep_avg may be and be sane. When a task begins to
consume more CPU than it's sleep_avg indicates it should, the task will
be throttled. A task which conforms to expectations can save credit for
later use, which allows interactive tasks to do a burst of activity
without being throttled. When their reserve is exhausted however,
that's the end of high ussage at high priority.

Signed-off-by: Mike Galbraith <[email protected]>

--- linux-2.6.16-mm1/include/linux/sched.h.org 2006-02-28 06:11:17.000000000 +0100
+++ linux-2.6.16-mm1/include/linux/sched.h 2006-02-28 06:11:41.000000000 +0100
@@ -733,14 +733,14 @@
unsigned short ioprio;
unsigned int btrace_seq;

- unsigned long sleep_avg;
+ unsigned long sleep_avg, last_slice, throttle;
unsigned long long timestamp, last_ran;
unsigned long long sched_time; /* sched_clock time spent running */
enum sleep_type sleep_type;

unsigned long policy;
cpumask_t cpus_allowed;
- unsigned int time_slice, first_time_slice;
+ unsigned int time_slice, slice_info;
long slice_time_ns;

#ifdef CONFIG_SCHEDSTATS
--- linux-2.6.16-mm1/kernel/sched.c-3.interactivity 2006-03-23 15:18:48.000000000 +0100
+++ linux-2.6.16-mm1/kernel/sched.c 2006-03-23 16:28:04.000000000 +0100
@@ -80,6 +80,21 @@
#define NS_TO_JIFFIES(TIME) ((TIME) / (1000000000 / HZ))
#define JIFFIES_TO_NS(TIME) ((TIME) * (1000000000 / HZ))

+#if (BITS_PER_LONG < 64)
+#define JIFFIES_TO_NS64(TIME) \
+ ((unsigned long long)(TIME) * ((unsigned long) (1000000000 / HZ)))
+
+#define NS64_TO_JIFFIES(TIME) \
+ ((((unsigned long long)((TIME)) >> BITS_PER_LONG) * \
+ (1 + NS_TO_JIFFIES(~0UL))) + NS_TO_JIFFIES((unsigned long)(TIME)))
+#else /* BITS_PER_LONG < 64 */
+
+#define NS64_TO_JIFFIES(TIME) NS_TO_JIFFIES(TIME)
+#define JIFFIES_TO_NS64(TIME) JIFFIES_TO_NS(TIME)
+
+#endif /* BITS_PER_LONG < 64 */
+
+
/*
* These are the 'tuning knobs' of the scheduler:
*
@@ -177,6 +192,115 @@
return p->sleep_avg + sleep_time >= INTERACTIVE_SLEEP_AVG(p);
}

+/*
+ * Interactive boost can lead to starvation if the decision to
+ * boost a task turns out to be a bad one. To combat this, we
+ * compute the sane upper limit for cpu usage 'slice_avg' based
+ * upon a task's sleep_avg, and use this information combined
+ * with a timer to determine when intervention is required.
+ *
+ * When a task is behaving as it's sleep_avg indicates it should,
+ * it's throttle is moved forward, otherwise it will timeout, and
+ * the task's priority will be lowered.
+ *
+ * Throttling tunables.
+ *
+ * CREDIT_C1: The amount of cpu time in seconds that a new task
+ * will run completely free, ie the head start a task
+ * has before it has to push it's timer forward to avoid
+ * being throttled. Each conforming slice thereafter
+ * increases it's stored credit, and vice versa.
+ *
+ * CREDIT_C2: The maximum amount of CPU time in seconds a task
+ * can store for later use. When a task has no stored
+ * credit left, now is time C2. Tasks begin life with
+ * C1 seconds credit, ie C2 is C1 seconds in front of
+ * them, and the 'buffer' will grow in front of them
+ * if they perform in a conformant manner. The maximum
+ * credit that fits in 32 bits jiffies is 42949 seconds.
+ */
+
+#define CREDIT_C1 10
+#define CREDIT_C2 14400
+
+#define C1 (CREDIT_C1 * MAX_BONUS * HZ)
+#define C2 (CREDIT_C2 * MAX_BONUS * HZ + C1)
+#define C3 (MAX_BONUS * C2)
+
+#define credit_exhausted(p, credit) \
+ (time_after_eq(jiffies, (p)->throttle + (credit)))
+
+/*
+ * Masks for p->slice_info, formerly p->first_time_slice.
+ * SLICE_FTS: 0x80000000 Task is in it's first ever timeslice.
+ * SLICE_NEW: 0x40000000 Slice refreshed.
+ * SLICE_SPA: 0x3FFE0000 Spare bits.
+ * SLICE_LTS: 0x0001FF80 Last time slice
+ * SLICE_AVG: 0x0000007F Task slice_avg stored as percentage.
+ */
+#define SLICE_AVG_BITS 7
+#define SLICE_LTS_BITS 10
+#define SLICE_SPA_BITS 13
+#define SLICE_NEW_BITS 1
+#define SLICE_FTS_BITS 1
+
+#define SLICE_AVG_SHIFT 0
+#define SLICE_LTS_SHIFT (SLICE_AVG_SHIFT + SLICE_AVG_BITS)
+#define SLICE_SPA_SHIFT (SLICE_LTS_SHIFT + SLICE_LTS_BITS)
+#define SLICE_NEW_SHIFT (SLICE_SPA_SHIFT + SLICE_SPA_BITS)
+#define SLICE_FTS_SHIFT (SLICE_NEW_SHIFT + SLICE_NEW_BITS)
+
+#define INFO_MASK(x) ((1U << (x))-1)
+#define SLICE_AVG_MASK (INFO_MASK(SLICE_AVG_BITS) << SLICE_AVG_SHIFT)
+#define SLICE_LTS_MASK (INFO_MASK(SLICE_LTS_BITS) << SLICE_LTS_SHIFT)
+#define SLICE_SPA_MASK (INFO_MASK(SLICE_SPA_BITS) << SLICE_SPA_SHIFT)
+#define SLICE_NEW_MASK (INFO_MASK(SLICE_NEW_BITS) << SLICE_NEW_SHIFT)
+#define SLICE_FTS_MASK (INFO_MASK(SLICE_FTS_BITS) << SLICE_FTS_SHIFT)
+
+/*
+ * p->slice_info access macros.
+ */
+#define first_time_slice(p) ((p)->slice_info & SLICE_FTS_MASK)
+#define set_first_time_slice(p) ((p)->slice_info |= SLICE_FTS_MASK)
+#define clr_first_time_slice(p) ((p)->slice_info &= ~SLICE_FTS_MASK)
+
+#define slice_is_new(p) ((p)->slice_info & SLICE_NEW_MASK)
+#define set_slice_is_new(p) ((p)->slice_info |= SLICE_NEW_MASK)
+#define clr_slice_is_new(p) ((p)->slice_info &= ~SLICE_NEW_MASK)
+
+#define last_slice(p) (((p)->slice_info & SLICE_LTS_MASK) >> SLICE_LTS_SHIFT)
+#define set_last_slice(p, n) ((p)->slice_info = (((p)->slice_info & \
+ ~SLICE_LTS_MASK) | (((n) << SLICE_LTS_SHIFT) & SLICE_LTS_MASK)))
+
+#define NS_SLEEP_AVG_PCNT (NS_MAX_SLEEP_AVG / 100)
+
+#define slice_avg(p) ((typeof((p)->sleep_avg)) \
+ ((((p)->slice_info & SLICE_AVG_MASK) >> SLICE_AVG_SHIFT) * \
+ NS_SLEEP_AVG_PCNT))
+#define set_slice_avg(p, n) ((p)->slice_info = (((p)->slice_info & \
+ ~SLICE_AVG_MASK) | ((((n) / NS_SLEEP_AVG_PCNT) \
+ << SLICE_AVG_SHIFT) & SLICE_AVG_MASK)))
+#define slice_avg_raw(p) \
+ (((p)->slice_info & SLICE_AVG_MASK) >> SLICE_AVG_SHIFT)
+#define set_slice_avg_raw(p, n) ((p)->slice_info = (((p)->slice_info & \
+ ~SLICE_AVG_MASK) | (((n) << SLICE_AVG_SHIFT) & SLICE_AVG_MASK)))
+
+/*
+ * cpu usage macros.
+ */
+#define cpu_avg(p) \
+ (100 - slice_avg_raw(p))
+
+#define cpu_max(p) \
+ (100 - ((p)->sleep_avg / NS_SLEEP_AVG_PCNT))
+
+#define time_this_slice(p) \
+ (jiffies - (p)->last_slice)
+
+#define cpu_this_slice(p) \
+ (100 * last_slice(p) / max((unsigned) time_this_slice(p), \
+ (unsigned) last_slice(p)))
+
#define TASK_PREEMPTS_CURR(p, rq) \
((p)->prio < (rq)->curr->prio)

@@ -851,13 +975,20 @@

if (likely(sleep_time > 0)) {
/*
+ * Update throttle position.
+ */
+ p->throttle += NS64_TO_JIFFIES(__sleep_time);
+ if (time_before(jiffies, p->throttle))
+ p->throttle = jiffies;
+
+ /*
* Tasks that sleep a long time are categorised as idle.
* They will only have their sleep_avg increased to a
* level that makes them just interactive priority to stay
* active yet prevent them suddenly becoming cpu hogs and
* starving other processes.
*/
- if (task_interactive_idle(p, sleep_time)) {
+ if (C2 && task_interactive_idle(p, sleep_time)) {
unsigned long ceiling = INTERACTIVE_SLEEP_AVG(p);

/*
@@ -912,6 +1043,7 @@
runqueue_t *this_rq = this_rq();
comp = (now - this_rq->timestamp_last_tick)
+ rq->timestamp_last_tick;
+
}
#endif

@@ -1532,16 +1664,28 @@
* resulting in more scheduling fairness.
*/
local_irq_disable();
- p->slice_time_ns = current->slice_time_ns >> 1;
- if (unlikely(p->slice_time_ns < NS_TICK))
- p->slice_time_ns = NS_TICK;
+ p->slice_time_ns = (current->slice_time_ns + NS_TICK) >> 1;
p->time_slice = NS_TO_JIFFIES(p->slice_time_ns);
+ if (unlikely(p->slice_time_ns < NS_TICK)) {
+ p->slice_time_ns = NS_TICK;
+ p->time_slice = 1;
+ }
/*
* The remainder of the first timeslice might be recovered by
* the parent if the child exits early enough.
*/
- p->first_time_slice = 1;
+ set_first_time_slice(p);
p->timestamp = sched_clock();
+
+ /*
+ * Set up slice_info for the child.
+ */
+ set_slice_avg(p, p->sleep_avg);
+ set_last_slice(p, p->time_slice);
+ set_slice_is_new(p);
+ p->last_slice = jiffies;
+ p->throttle = jiffies - C2 + C1;
+
current->slice_time_ns >>= 1;
if (unlikely(current->slice_time_ns < NS_TICK)) {
/*
@@ -1655,7 +1799,7 @@
* the sleep_avg of the parent as well.
*/
rq = task_rq_lock(p->parent, &flags);
- if (p->first_time_slice && task_cpu(p) == task_cpu(p->parent)) {
+ if (first_time_slice(p) && task_cpu(p) == task_cpu(p->parent)) {
p->parent->slice_time_ns += p->slice_time_ns;
if (unlikely(p->parent->slice_time_ns > task_timeslice_ns(p)))
p->parent->slice_time_ns = task_timeslice_ns(p);
@@ -2771,6 +2915,89 @@
}

/*
+ * Refresh timeslice and associated slice information.
+ * @p: the process to refresh.
+ */
+static void refresh_timeslice(task_t *p)
+{
+ unsigned long slice_time = jiffies - p->last_slice;
+ unsigned int slice = last_slice(p);
+ unsigned int slice_avg, cpu, idle;
+ long run_time = -1 * p->slice_time_ns;
+ int w = MAX_BONUS, delta, bonus;
+
+ /*
+ * Update time_slice.
+ */
+ p->slice_time_ns = task_timeslice_ns(p);
+ p->time_slice = task_timeslice(p);
+ set_last_slice(p, p->time_slice);
+
+ /*
+ * Update sleep_avg.
+ *
+ * Tasks charged proportionately less run_time at high
+ * sleep_avg to delay them losing their interactive status
+ */
+ run_time += JIFFIES_TO_NS(slice);
+ if (C2)
+ run_time /= SLEEP_AVG_DIVISOR(p);
+ if (p->sleep_avg >= run_time)
+ p->sleep_avg -= run_time;
+ else p->sleep_avg = 0;
+
+ /*
+ * Update slice_avg.
+ */
+ slice_avg = slice_avg_raw(p);
+ cpu = cpu_this_slice(p);
+ idle = 100 - cpu;
+
+ delta = max(slice_avg, idle) - min(slice_avg, idle);
+ w = 1 + (delta / w);
+ slice_avg = (w * slice_avg + idle) / (w + 1);
+ set_slice_avg_raw(p, slice_avg);
+
+ /*
+ * If we've hit the timeout, we aren't draining enough sleep_avg
+ * to catch up with the task's cpu usage. Up the ante to bring
+ * the task back toward balance. This is important, because it
+ * allows interactive tasks to push their throttle back enough
+ * that they can both sustain, and rapidly recover from throttling
+ * instead of descending into C3.
+ */
+ if (credit_exhausted(p, C2) && p->sleep_avg > slice_avg(p)) {
+ unsigned long run_time = p->sleep_avg - slice_avg(p);
+ run_time /= w;
+ if (p->sleep_avg >= run_time)
+ p->sleep_avg -= run_time;
+ }
+
+ /*
+ * Update throttle position.
+ */
+ if (cpu < cpu_max(p) + PCNT_PER_DYNPRIO || !credit_exhausted(p, C1)) {
+ bonus = idle * PCNT_PER_DYNPRIO / 100;
+ p->throttle += (slice_time - slice) * bonus;
+ } else if (cpu >= cpu_max(p) + PCNT_PER_DYNPRIO) {
+ bonus = (cpu - cpu_max(p)) / PCNT_PER_DYNPRIO;
+ p->throttle -= slice_time * bonus;
+ }
+
+ if (time_before(jiffies, p->throttle))
+ p->throttle = jiffies;
+ else if (credit_exhausted(p, C3))
+ p->throttle = jiffies - C3;
+
+ /*
+ * And finally, stamp and flag the new slice.
+ */
+ clr_first_time_slice(p);
+ set_slice_is_new(p);
+ p->last_slice = jiffies;
+}
+
+/*
* This function gets called by the timer code, with HZ frequency.
* We call it with interrupts disabled.
*
@@ -2821,9 +3048,7 @@
* FIFO tasks have no timeslices.
*/
if ((p->policy == SCHED_RR) && !p->time_slice) {
- p->slice_time_ns = task_timeslice_ns(p);
- p->time_slice = task_timeslice(p);
- p->first_time_slice = 0;
+ refresh_timeslice(p);
set_tsk_need_resched(p);

/* put it at the end of the queue: */
@@ -2832,22 +3057,10 @@
goto out_unlock;
}
if (!p->time_slice) {
- long slice_time_ns = task_timeslice_ns(p);
- long run_time = (-1 * p->slice_time_ns) + slice_time_ns;
dequeue_task(p, rq->active);
set_tsk_need_resched(p);
- p->slice_time_ns = slice_time_ns;
- p->time_slice = task_timeslice(p);
- /*
- * Tasks are charged proportionately less run_time at high
- * sleep_avg to delay them losing their interactive status
- */
- run_time /= SLEEP_AVG_DIVISOR(p);
- if (p->sleep_avg >= run_time)
- p->sleep_avg -= run_time;
- else p->sleep_avg = 0;
+ refresh_timeslice(p);
p->prio = effective_prio(p);
- p->first_time_slice = 0;

if (!rq->expired_timestamp)
rq->expired_timestamp = jiffies;
@@ -3238,6 +3451,14 @@

prev->timestamp = prev->last_ran = now;

+ /*
+ * Tag start of execution of a new timeslice.
+ */
+ if (unlikely(slice_is_new(next))) {
+ next->last_slice = jiffies;
+ clr_slice_is_new(next);
+ }
+
sched_info_switch(prev, next);
if (likely(prev != next)) {
next->timestamp = now;


2006-03-24 11:24:11

by Mike Galbraith

[permalink] [raw]
Subject: Re: [2.6.16-mm1 patch] throttling tree patches

patch 5/6

This patch tightens timeslice accounting the rest of the way, such that
a task which has received more than it's slice due to missing with the
timer interrupt will have the excess deducted from their next slice.

signed-off-by: Mike Galbraith <[email protected]>

--- linux-2.6.16-mm1/kernel/sched.c-4.throttle 2006-03-24 09:36:08.000000000 +0100
+++ linux-2.6.16-mm1/kernel/sched.c 2006-03-24 09:40:33.000000000 +0100
@@ -2924,13 +2924,28 @@
unsigned int slice = last_slice(p);
unsigned int slice_avg, cpu, idle;
long run_time = -1 * p->slice_time_ns;
+ long slice_time_ns = task_timeslice_ns(p);
int w = MAX_BONUS, delta, bonus;

/*
- * Update time_slice.
+ * Update time_slice. Account for unused fragment,
+ * or excess time received due to missed tick.
*/
- p->slice_time_ns = task_timeslice_ns(p);
- p->time_slice = task_timeslice(p);
+ p->slice_time_ns += slice_time_ns;
+ /*
+ * Not common, but this does happen on SMP systems.
+ * Timeslice theft of this magnitude has never been
+ * observed in the wild, so assume that this is BS,
+ * and give the poor task it's full slice. Theory:
+ * mostly idle task migrates between CPUs numerous
+ * times during it's slice, timestamp rounding leads
+ * to wildly inaccurate calculation. Rounding has
+ * maximum effect on those who stretch their slice,
+ * but is also fairly meaningless, so ignore it.
+ */
+ if (unlikely(p->slice_time_ns < NS_TICK))
+ p->slice_time_ns = slice_time_ns;
+ p->time_slice = NS_TO_JIFFIES(p->slice_time_ns);
set_last_slice(p, p->time_slice);

/*


2006-03-24 11:28:06

by Mike Galbraith

[permalink] [raw]
Subject: Re: [2.6.16-mm1 patch] throttling tree patches

patch 6/6

This patch exports throttling tuning knobs to userland. Should be
useful for testing, and in easily discarded should my throttling stuff
make it into mm, and survive the experience.

Signed-off-by: Mike Galbraith <[email protected]>

--- linux-2.6.16-mm1/include/linux/sysctl.h.org 2006-03-24 09:43:37.000000000 +0100
+++ linux-2.6.16-mm1/include/linux/sysctl.h 2006-03-24 09:47:49.000000000 +0100
@@ -148,6 +148,8 @@
KERN_SPIN_RETRY=70, /* int: number of spinlock retries */
KERN_ACPI_VIDEO_FLAGS=71, /* int: flags for setting up video after ACPI sleep */
KERN_IA64_UNALIGNED=72, /* int: ia64 unaligned userland trap enable */
+ KERN_SCHED_THROTTLE1=73, /* int: throttling credit period 1 in secs */
+ KERN_SCHED_THROTTLE2=74, /* int: throttling credit period 2 in secs */
};


--- linux-2.6.16-mm1/kernel/sysctl.c.org 2006-03-24 09:43:14.000000000 +0100
+++ linux-2.6.16-mm1/kernel/sysctl.c 2006-03-24 09:47:49.000000000 +0100
@@ -73,6 +73,9 @@
extern int pid_max_min, pid_max_max;
extern int sysctl_drop_caches;
extern int percpu_pagelist_fraction;
+extern int credit_c1;
+extern int credit_c2;
+extern int credit_max;

#if defined(CONFIG_X86_LOCAL_APIC) && defined(CONFIG_X86)
int unknown_nmi_panic;
@@ -230,6 +233,11 @@
{ .ctl_name = 0 }
};

+/* Constants for minimum and maximum testing in vm_table and
+ * kern_table. We use these as one-element integer vectors. */
+static int zero;
+static int one_hundred = 100;
+
static ctl_table kern_table[] = {
{
.ctl_name = KERN_OSTYPE,
@@ -684,15 +692,31 @@
.proc_handler = &proc_dointvec,
},
#endif
+ {
+ .ctl_name = KERN_SCHED_THROTTLE1,
+ .procname = "credit_c1",
+ .data = &credit_c1,
+ .maxlen = sizeof (int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec_minmax,
+ .strategy = &sysctl_intvec,
+ .extra1 = &zero,
+ .extra2 = &credit_max,
+ },
+ {
+ .ctl_name = KERN_SCHED_THROTTLE2,
+ .procname = "credit_c2",
+ .data = &credit_c2,
+ .maxlen = sizeof (int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec_minmax,
+ .strategy = &sysctl_intvec,
+ .extra1 = &zero,
+ .extra2 = &credit_max,
+ },
{ .ctl_name = 0 }
};

-/* Constants for minimum and maximum testing in vm_table.
- We use these as one-element integer vectors. */
-static int zero;
-static int one_hundred = 100;
-
-
static ctl_table vm_table[] = {
{
.ctl_name = VM_OVERCOMMIT_MEMORY,
--- linux-2.6.16-mm1/kernel/sched.c-5.slice_accounting 2006-03-24 09:40:33.000000000 +0100
+++ linux-2.6.16-mm1/kernel/sched.c 2006-03-24 09:51:06.000000000 +0100
@@ -220,11 +220,12 @@
* credit that fits in 32 bits jiffies is 42949 seconds.
*/

-#define CREDIT_C1 10
-#define CREDIT_C2 14400
+int credit_c1 = 10;
+int credit_c2 = 14400;
+int credit_max = 42949;

-#define C1 (CREDIT_C1 * MAX_BONUS * HZ)
-#define C2 (CREDIT_C2 * MAX_BONUS * HZ + C1)
+#define C1 (credit_c1 * MAX_BONUS * HZ)
+#define C2 (credit_c2 * MAX_BONUS * HZ + C1)
#define C3 (MAX_BONUS * C2)

#define credit_exhausted(p, credit) \


2006-03-24 11:38:13

by Con Kolivas

[permalink] [raw]
Subject: Re: [2.6.16-mm1 patch] throttling tree patches

On Friday 24 March 2006 22:03, Mike Galbraith wrote:
> Greetings,

/me waves

> Ignore timewarps caused by SMP timestamp rounding. Also, don't stamp a
> task with a computed timestamp, stamp with the already called clock.

Looks good. Actually now < p->timestamp is not going to only happen on SMP.
Once every I don't know how often the sched_clock seems to return a value
that appears to have been in the past (I believe Peter has instrumented
this).

> Signed-off-by: Mike Galbraith <[email protected]>

> + __sleep_time = 0ULL;

I don't think the ULL is necessary.

> - unsigned long long now;
> + unsigned long long now, comp;
>
> - now = sched_clock();
> + now = comp = sched_clock();
> #ifdef CONFIG_SMP
> if (!local) {
> /* Compensate for drifting sched_clock */
> runqueue_t *this_rq = this_rq();
> - now = (now - this_rq->timestamp_last_tick)
> + comp = (now - this_rq->timestamp_last_tick)
> + rq->timestamp_last_tick;
> }
> #endif
>
> if (!rt_task(p))
> - p->prio = recalc_task_prio(p, now);
> + p->prio = recalc_task_prio(p, comp);

Seems wasteful of a very expensive (on 32bit) unsigned long long on
uniprocessor builds.

Cheers,
Con

2006-03-24 11:39:27

by Con Kolivas

[permalink] [raw]
Subject: Re: [2.6.16-mm1 patch] throttling tree patches

On Friday 24 March 2006 22:07, Mike Galbraith wrote:
> patch 2/6
>
> This patch just fixes a bug waiting for a place to happen. If anyone
> ever decides to use TASK_NONINTERACTIVE along with TASK_UNINTERRUPTIBLE,
> bad things will happen.

> Signed-off-by: Mike Galbraith <[email protected]>

Acked-by: Con Kolivas <[email protected]>

2006-03-24 11:54:35

by Con Kolivas

[permalink] [raw]
Subject: Re: [2.6.16-mm1 patch] throttling tree patches

On Friday 24 March 2006 22:21, Mike Galbraith wrote:
> patch 4/6
>
> This patch implements the throttling.
>
> Throttling is done via computing a slice_avg, which is the upper limit
> of what a task's sleep_avg may be and be sane. When a task begins to
> consume more CPU than it's sleep_avg indicates it should, the task will
> be throttled. A task which conforms to expectations can save credit for
> later use, which allows interactive tasks to do a burst of activity
> without being throttled. When their reserve is exhausted however,
> that's the end of high ussage at high priority.

Looks ok. The description of credit still sounds cryptic.

> +#define C1 (CREDIT_C1 * MAX_BONUS * HZ)
> +#define C2 (CREDIT_C2 * MAX_BONUS * HZ + C1)
> +#define C3 (MAX_BONUS * C2)

Macro names that short are asking for trouble...

...
else looks good. After we've cleaned out all the sched patches from -mm it
would be nice to get this work in. The values of C1 and particularly C2
_sound_ large but may well be appropriate since you've been hard at work on
this. I'll have to have a play for myself (if I ever find spare cycles on my
miniscule selection of hardware) with them when they hit -mm.

Cheers,
Con

2006-03-24 11:56:30

by Con Kolivas

[permalink] [raw]
Subject: Re: [2.6.16-mm1 patch] throttling tree patches

On Friday 24 March 2006 22:28, Mike Galbraith wrote:
> patch 6/6
>
> This patch exports throttling tuning knobs to userland. Should be
> useful for testing, and in easily discarded should my throttling stuff
> make it into mm, and survive the experience.
>
> Signed-off-by: Mike Galbraith <[email protected]>

Ok.

Cheers,
Con

2006-03-24 11:56:00

by Con Kolivas

[permalink] [raw]
Subject: Re: [2.6.16-mm1 patch] throttling tree patches

On Friday 24 March 2006 22:24, Mike Galbraith wrote:
> patch 5/6
>
> This patch tightens timeslice accounting the rest of the way, such that
> a task which has received more than it's slice due to missing with the
> timer interrupt will have the excess deducted from their next slice.
>
> signed-off-by: Mike Galbraith <[email protected]>

Looks ok.

Cheers,
Con

2006-03-24 11:57:16

by Con Kolivas

[permalink] [raw]
Subject: Re: [2.6.16-mm1 patch] throttling tree patches

On Friday 24 March 2006 22:16, Mike Galbraith wrote:
> This patch does various interactivity cleanups.

I have trouble with this patch. By your own admission this patch does 4
different things which one patch shouldn't.

> 1. Removes the barrier between kernel threads and user tasks wrt
> dynamic priority handling.

This is a bad idea. Subjecting a priority ceiling to kernel threads because
they spend a long time idle is not the same as a user task that may be an
idle bomb. Most kernel threads do sleep for extended periods and will always
end up hitting this ceiling. That could lead to some difficult to understand
latencies in scheduling of kernel threads, even if they are nice -5 because
they'll expire very easily.

> 2. Removes the priority barrier for IO.

Bad again. This caused the biggest detriment on interbench numbers and is by
far the most palpable interactivity killer in linux. I/O hurts us lots and
this change will be very noticeable.

> 3. Treats TASK_INTERACTIVE as a transition point that all tasks must
> stop at prior to being promoted further.

Why? Makes no sense. You end up getting hiccups in the rise of priority of
tasks instead of it happening smoothly with sleep.

> 4. Moves division out of the fast path and into scheduler_tick().
> While doing so, tightens timeslice accounting, since it's free, and is
> the flip-side of the reason for having nanosecond sched_clock().

Seems fine.

Cheers,
Con

2006-03-24 12:20:25

by Mike Galbraith

[permalink] [raw]
Subject: Re: [2.6.16-mm1 patch] throttling tree patches

On Fri, 2006-03-24 at 22:56 +1100, Con Kolivas wrote:
> On Friday 24 March 2006 22:16, Mike Galbraith wrote:
> > This patch does various interactivity cleanups.
>
> I have trouble with this patch. By your own admission this patch does 4
> different things which one patch shouldn't.

They're all part of the same thing, they're just cleanup.

> > 1. Removes the barrier between kernel threads and user tasks wrt
> > dynamic priority handling.
>
> This is a bad idea. Subjecting a priority ceiling to kernel threads because
> they spend a long time idle is not the same as a user task that may be an
> idle bomb.

Kernel threads don't make the transition, they're sitting there with a
fully loaded sleep_avg. You stop on the way up, sure, but once there, a
long sleep doesn't truncate you. Try it.

> Most kernel threads do sleep for extended periods and will always
> end up hitting this ceiling. That could lead to some difficult to understand
> latencies in scheduling of kernel threads, even if they are nice -5 because
> they'll expire very easily.

No, they won't. Furthermore, any kernel thread which cannot tolerate
dynamic priority semantics should not use them, they should be RT.

> > 2. Removes the priority barrier for IO.
>
> Bad again. This caused the biggest detriment on interbench numbers and is by
> far the most palpable interactivity killer in linux. I/O hurts us lots and
> this change will be very noticeable.

This barrier is artificial, and has been demonstrated to be harmful to
some loads. Being practical, if this is demonstrated to still cause
trouble, I'll happily re-introduce it with a follow-up patch.

> > 3. Treats TASK_INTERACTIVE as a transition point that all tasks must
> > stop at prior to being promoted further.
>
> Why? Makes no sense. You end up getting hiccups in the rise of priority of
> tasks instead of it happening smoothly with sleep.

Quite the opposite, it makes perfect sense. Taking the long sleeper to
the artificial barrier of prio 16 as stock does is the very reason that
thud totally destroys the interactive experience. I'd love to remove
this barrier too, and have 'purity', but OTOH, the interactive border
_is_ a transition point where the scheduler changes behavior. This is a
transition point in fact, so treating it as such is indeed correct.

-Mike

2006-03-24 12:34:47

by Con Kolivas

[permalink] [raw]
Subject: Re: [2.6.16-mm1 patch] throttling tree patches

On Friday 24 March 2006 23:21, Mike Galbraith wrote:
> On Fri, 2006-03-24 at 22:56 +1100, Con Kolivas wrote:
> > On Friday 24 March 2006 22:16, Mike Galbraith wrote:
> > > This patch does various interactivity cleanups.
> >
> > I have trouble with this patch. By your own admission this patch does 4
> > different things which one patch shouldn't.
>
> They're all part of the same thing, they're just cleanup.

Cleanup? Something that changes behaviour is not cleanup.
>
> > > 1. Removes the barrier between kernel threads and user tasks wrt
> > > dynamic priority handling.
> >
> > This is a bad idea. Subjecting a priority ceiling to kernel threads
> > because they spend a long time idle is not the same as a user task that
> > may be an idle bomb.
>
> Kernel threads don't make the transition, they're sitting there with a
> fully loaded sleep_avg. You stop on the way up, sure, but once there, a
> long sleep doesn't truncate you. Try it.

Threads like kswapd do not have a fully loaded sleep_avg.

> > Most kernel threads do sleep for extended periods and will always
> > end up hitting this ceiling. That could lead to some difficult to
> > understand latencies in scheduling of kernel threads, even if they are
> > nice -5 because they'll expire very easily.
>
> No, they won't. Furthermore, any kernel thread which cannot tolerate
> dynamic priority semantics should not use them, they should be RT.

Very few kernel threads should require RT just to work as you just said, yet
you'll make it harder for them with your changed dynamic priority semantics.
We already know they are not going to be misbehaving userspace tasks and they
deserve simpler semantics than the full interactivity estimator gives.

> > > 2. Removes the priority barrier for IO.
> >
> > Bad again. This caused the biggest detriment on interbench numbers and is
> > by far the most palpable interactivity killer in linux. I/O hurts us lots
> > and this change will be very noticeable.
>
> This barrier is artificial, and has been demonstrated to be harmful to
> some loads.

Which?

> Being practical, if this is demonstrated to still cause
> trouble, I'll happily re-introduce it with a follow-up patch.

Trouble occurs months afterwards this hits mainline since you'll get almost
noone testing it in -mm. It was put there because it hurt interactivity and
you're removing it because you don't like that we have to put a special case
there?

> > > 3. Treats TASK_INTERACTIVE as a transition point that all tasks must
> > > stop at prior to being promoted further.
> >
> > Why? Makes no sense. You end up getting hiccups in the rise of priority
> > of tasks instead of it happening smoothly with sleep.
>
> Quite the opposite, it makes perfect sense. Taking the long sleeper to
> the artificial barrier of prio 16 as stock does is the very reason that
> thud totally destroys the interactive experience. I'd love to remove
> this barrier too, and have 'purity', but OTOH, the interactive border
> _is_ a transition point where the scheduler changes behavior. This is a
> transition point in fact, so treating it as such is indeed correct.

And what does it actually achieve making this change is my question?

I feel this discussion may degenerate beyond this point. Should we say to
agree to disagree at this point? I don't like these changes.

Cheers,
Con

2006-03-24 13:01:39

by Mike Galbraith

[permalink] [raw]
Subject: Re: [2.6.16-mm1 patch] throttling tree patches

On Fri, 2006-03-24 at 23:34 +1100, Con Kolivas wrote:
> On Friday 24 March 2006 23:21, Mike Galbraith wrote:
> > On Fri, 2006-03-24 at 22:56 +1100, Con Kolivas wrote:
> > > On Friday 24 March 2006 22:16, Mike Galbraith wrote:
> > > > This patch does various interactivity cleanups.
> > >
> > > I have trouble with this patch. By your own admission this patch does 4
> > > different things which one patch shouldn't.
> >
> > They're all part of the same thing, they're just cleanup.
>
> Cleanup? Something that changes behaviour is not cleanup.

A semantic cleanup remains a cleanup.

> > > > 1. Removes the barrier between kernel threads and user tasks wrt
> > > > dynamic priority handling.
> > >
> > > This is a bad idea. Subjecting a priority ceiling to kernel threads
> > > because they spend a long time idle is not the same as a user task that
> > > may be an idle bomb.
> >
> > Kernel threads don't make the transition, they're sitting there with a
> > fully loaded sleep_avg. You stop on the way up, sure, but once there, a
> > long sleep doesn't truncate you. Try it.
>
> Threads like kswapd do not have a fully loaded sleep_avg.

Huh? I've yet to see kswapd non-interactive. If I do, I'll have
grounds to party, because the box would otherwise have been a doorstop.

> > > Most kernel threads do sleep for extended periods and will always
> > > end up hitting this ceiling. That could lead to some difficult to
> > > understand latencies in scheduling of kernel threads, even if they are
> > > nice -5 because they'll expire very easily.
> >
> > No, they won't. Furthermore, any kernel thread which cannot tolerate
> > dynamic priority semantics should not use them, they should be RT.
>
> Very few kernel threads should require RT just to work as you just said, yet
> you'll make it harder for them with your changed dynamic priority semantics.
> We already know they are not going to be misbehaving userspace tasks and they
> deserve simpler semantics than the full interactivity estimator gives.

They don't need to be treated special, and it is flat wrong to do so.
Try the patch set and you'll see that they continue to work just fine.

Consider this: why should 10 copies of knfsd have any more right than
10 copies of apache to utterly monopolize my cpu? Fact is, there is no
difference. Both are acting on behalf other user tasks. The fact that
one is a kernel thread means nothing.

> > > > 2. Removes the priority barrier for IO.
> > >
> > > Bad again. This caused the biggest detriment on interbench numbers and is
> > > by far the most palpable interactivity killer in linux. I/O hurts us lots
> > > and this change will be very noticeable.
> >
> > This barrier is artificial, and has been demonstrated to be harmful to
> > some loads.
>
> Which?

Remember the demonstration of dd being starved?

> > Being practical, if this is demonstrated to still cause
> > trouble, I'll happily re-introduce it with a follow-up patch.
>
> Trouble occurs months afterwards this hits mainline since you'll get almost
> noone testing it in -mm. It was put there because it hurt interactivity and
> you're removing it because you don't like that we have to put a special case
> there?

Throttling changes this. Believe it or not, I didn't just run around
with a machete looking for things to chop to ribbons.

> > > > 3. Treats TASK_INTERACTIVE as a transition point that all tasks must
> > > > stop at prior to being promoted further.
> > >
> > > Why? Makes no sense. You end up getting hiccups in the rise of priority
> > > of tasks instead of it happening smoothly with sleep.
> >
> > Quite the opposite, it makes perfect sense. Taking the long sleeper to
> > the artificial barrier of prio 16 as stock does is the very reason that
> > thud totally destroys the interactive experience. I'd love to remove
> > this barrier too, and have 'purity', but OTOH, the interactive border
> > _is_ a transition point where the scheduler changes behavior. This is a
> > transition point in fact, so treating it as such is indeed correct.
>
> And what does it actually achieve making this change is my question?

It achieves precisely what the comment you placed there some odd years
ago says it should achieve. Thud is a pure cpu hog that sleeps for a
while to charge it's battery for an intentional attack. It was written
specifically to demonstrate the problem with allowing a long sleep to
rocket you to the top of the world as stock code does. Try running the
stock kernel with irman2 and thud running, then you'll certainly
understand what this change achieves.

> I feel this discussion may degenerate beyond this point. Should we say to
> agree to disagree at this point? I don't like these changes.

Why should it degenerate? You express your concerns, I answer them.
You don't have to agree with me, and I don't have to agree with you.
That's no reason for the discussion to degenerate.

-Mike



2006-03-24 13:53:07

by Con Kolivas

[permalink] [raw]
Subject: Re: [2.6.16-mm1 patch] throttling tree patches

On Saturday 25 March 2006 00:02, Mike Galbraith wrote:
> On Fri, 2006-03-24 at 23:34 +1100, Con Kolivas wrote:
> > I feel this discussion may degenerate beyond this point. Should we say to
> > agree to disagree at this point? I don't like these changes.
>
> Why should it degenerate? You express your concerns, I answer them.
> You don't have to agree with me, and I don't have to agree with you.
> That's no reason for the discussion to degenerate.

By degenerate I mean get stuck in an endless loop. I don't have the energy for
cyclical discussions where no conclusive endpoint can ever be reached. That
is the nature of what we're discussing. Everything is some sort of compromise
and you're moving one set of compromises for another which means this can be
debated forever without a right or wrong.

The merits of throttling seem obvious with a sliding scale between
interactivity and fairness. These other changes, to me, do not. On the
interactivity side I only have interbench for hard data, and it is showing me
regressions consistent with my concerns from these other "cleanups". You'll
trade some other advantage for these and we'll repeat the discussion all over
again. Rinse and repeat.

Cheers,
Con

2006-03-24 14:09:28

by Mike Galbraith

[permalink] [raw]
Subject: Re: [2.6.16-mm1 patch] throttling tree patches

On Sat, 2006-03-25 at 00:52 +1100, Con Kolivas wrote:
> On Saturday 25 March 2006 00:02, Mike Galbraith wrote:
> > On Fri, 2006-03-24 at 23:34 +1100, Con Kolivas wrote:
> > > I feel this discussion may degenerate beyond this point. Should we say to
> > > agree to disagree at this point? I don't like these changes.
> >
> > Why should it degenerate? You express your concerns, I answer them.
> > You don't have to agree with me, and I don't have to agree with you.
> > That's no reason for the discussion to degenerate.
>
> By degenerate I mean get stuck in an endless loop. I don't have the energy for
> cyclical discussions where no conclusive endpoint can ever be reached. That
> is the nature of what we're discussing. Everything is some sort of compromise
> and you're moving one set of compromises for another which means this can be
> debated forever without a right or wrong.
>
> The merits of throttling seem obvious with a sliding scale between
> interactivity and fairness. These other changes, to me, do not. On the
> interactivity side I only have interbench for hard data, and it is showing me
> regressions consistent with my concerns from these other "cleanups". You'll
> trade some other advantage for these and we'll repeat the discussion all over
> again. Rinse and repeat.

So be it. We shall agree to disagree.

I'll close with this. I'm not basing any of my changes on a benchmark,
I'm basing them on real life applications and exploits written by people
who were experiencing problems with their real life applications. It is
my carefully considered opinion that those changes which you so strongly
oppose are important.

-Mike

2006-03-25 00:25:18

by Peter Williams

[permalink] [raw]
Subject: Re: [2.6.16-mm1 patch] throttling tree patches

Con Kolivas wrote:
> On Friday 24 March 2006 22:03, Mike Galbraith wrote:
>> Greetings,
>
> /me waves
>
>> Ignore timewarps caused by SMP timestamp rounding. Also, don't stamp a
>> task with a computed timestamp, stamp with the already called clock.
>
> Looks good. Actually now < p->timestamp is not going to only happen on SMP.
> Once every I don't know how often the sched_clock seems to return a value
> that appears to have been in the past (I believe Peter has instrumented
> this).

I haven't bothered to check if it still occurs for quite a long while.
I just check for time deltas being negative and if they are negative I
make them zero and move on. As far as I can remember I only ever saw
this when measuring "delay" (i.e. time on the run queue waiting to get
on to a CPU which can be quite short :-) when the systems not heavily
loaded) as other time intervals that I measure (i.e. time on CPU and
sleep time) are generally long enough for the error in the delta not
being big enough to make the value negative.

>
>> Signed-off-by: Mike Galbraith <[email protected]>
>
>> + __sleep_time = 0ULL;
>
> I don't think the ULL is necessary.
>
>> - unsigned long long now;
>> + unsigned long long now, comp;
>>
>> - now = sched_clock();
>> + now = comp = sched_clock();
>> #ifdef CONFIG_SMP
>> if (!local) {
>> /* Compensate for drifting sched_clock */
>> runqueue_t *this_rq = this_rq();
>> - now = (now - this_rq->timestamp_last_tick)
>> + comp = (now - this_rq->timestamp_last_tick)
>> + rq->timestamp_last_tick;
>> }
>> #endif
>>
>> if (!rt_task(p))
>> - p->prio = recalc_task_prio(p, now);
>> + p->prio = recalc_task_prio(p, comp);
>
> Seems wasteful of a very expensive (on 32bit) unsigned long long on
> uniprocessor builds.

Unsigned long long is necessary in order to avoid overflow when dealing
with nano seconds but (if you reorganized the expressions and made the
desired precedence explicit) you could probably use something smaller
for the difference between the two timestamp_lats_tick values. More
importantly, I think that the original code which used the computed
"now" was correct as otherwise the task's timestamp will not have the
correct time for its CPU.

Of course, this all hinges on the differences between the run queues'
timestamp_last_tick fields being a true measure of the time drift
between them. I've never been wholly convinced of that but as long as
any error is much smaller than the drift it's probably worth doing.

Peter
PS I think that some inline functions to handle timestamp adjustment
wouldn't hurt.
PPS I'm not sure that the timstamp adjustment in __migrate_task() is
completely valid as the timestamp will be modified in activate_task()
using the wrong clock. I need to study this more to see if I convince
myself one way or the other.
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2006-03-25 00:37:37

by Peter Williams

[permalink] [raw]
Subject: Re: [2.6.16-mm1 patch] throttling tree patches

Mike Galbraith wrote:
> Greetings,
>
> I've broken down my throttling tree into 6 patches, which I'll send as
> replies to this start-point.
>
> Patch 1/6
>
> Ignore timewarps caused by SMP timestamp rounding. Also, don't stamp a
> task with a computed timestamp, stamp with the already called clock.
>
> Signed-off-by: Mike Galbraith <[email protected]>
>
> --- linux-2.6.16-mm1/kernel/sched.c.org 2006-03-23 15:01:41.000000000 +0100
> +++ linux-2.6.16-mm1/kernel/sched.c 2006-03-23 15:02:25.000000000 +0100
> @@ -805,6 +805,16 @@
> unsigned long long __sleep_time = now - p->timestamp;
> unsigned long sleep_time;
>
> + /*
> + * On SMP systems, a task can go to sleep on one CPU and
> + * wake up on another. When this happens, the timestamp
> + * is rounded to the nearest tick,

Is this true? There's no rounding that I can see. An attempt is made
to adjust the timestamp for the drift between time as seen from the two
CPUs but there's no deliberate rounding involved. Of course, that's not
to say that the adjustment is accurate as I'm not convinced that the
difference between the run queues' timestamp_last_tick is a always
totally accurate measure of the drift in their clocks (due to possible
races -- I may be wrong).

Of course, that doesn't mean that this chunk of code isn't required just
that the comment is misleading.

> which can lead to now
> + * being less than p->timestamp for short sleeps. Ignore
> + * these, they're insignificant.
> + */
> + if (unlikely(now < p->timestamp))
> + __sleep_time = 0ULL;
> +
> if (batch_task(p))
> sleep_time = 0;
> else {
> @@ -871,20 +881,20 @@
> */
> static void activate_task(task_t *p, runqueue_t *rq, int local)
> {
> - unsigned long long now;
> + unsigned long long now, comp;
>
> - now = sched_clock();
> + now = comp = sched_clock();
> #ifdef CONFIG_SMP
> if (!local) {
> /* Compensate for drifting sched_clock */
> runqueue_t *this_rq = this_rq();
> - now = (now - this_rq->timestamp_last_tick)
> + comp = (now - this_rq->timestamp_last_tick)
> + rq->timestamp_last_tick;
> }
> #endif
>
> if (!rt_task(p))
> - p->prio = recalc_task_prio(p, now);
> + p->prio = recalc_task_prio(p, comp);
>
> /*
> * This checks to make sure it's not an uninterruptible task

I think that this will end up with p->timestamp being set with a time
appropriate to the current task's CPU rather than its own.

Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2006-03-25 05:05:30

by Mike Galbraith

[permalink] [raw]
Subject: Re: [2.6.16-mm1 patch] throttling tree patches

On Sat, 2006-03-25 at 11:25 +1100, Peter Williams wrote:
> Con Kolivas wrote:

> >> if (!rt_task(p))
> >> - p->prio = recalc_task_prio(p, now);
> >> + p->prio = recalc_task_prio(p, comp);
> >
> > Seems wasteful of a very expensive (on 32bit) unsigned long long on
> > uniprocessor builds.
>
> Unsigned long long is necessary in order to avoid overflow when dealing
> with nano seconds but (if you reorganized the expressions and made the
> desired precedence explicit) you could probably use something smaller
> for the difference between the two timestamp_lats_tick values. More
> importantly, I think that the original code which used the computed
> "now" was correct as otherwise the task's timestamp will not have the
> correct time for its CPU.

I can live with it either way. On my SMT box, the rounding is much
worse than the actual drift, that's microseconds, but the rounding turns
it into milliseconds.

> Of course, this all hinges on the differences between the run queues'
> timestamp_last_tick fields being a true measure of the time drift
> between them. I've never been wholly convinced of that but as long as
> any error is much smaller than the drift it's probably worth doing.

On a real SMP, this adjusting of timestamps probably helps (dunno, no
have), on my box it's doomed to do more harm than good. To me, the only
thing that really matters is ignoring the bogus transition.

-Mike

2006-03-25 05:11:14

by Mike Galbraith

[permalink] [raw]
Subject: Re: [2.6.16-mm1 patch] throttling tree patches

On Sat, 2006-03-25 at 11:37 +1100, Peter Williams wrote:
> Mike Galbraith wrote:
> > Greetings,
> >
> > I've broken down my throttling tree into 6 patches, which I'll send as
> > replies to this start-point.
> >
> > Patch 1/6
> >
> > Ignore timewarps caused by SMP timestamp rounding. Also, don't stamp a
> > task with a computed timestamp, stamp with the already called clock.
> >
> > Signed-off-by: Mike Galbraith <[email protected]>
> >
> > --- linux-2.6.16-mm1/kernel/sched.c.org 2006-03-23 15:01:41.000000000 +0100
> > +++ linux-2.6.16-mm1/kernel/sched.c 2006-03-23 15:02:25.000000000 +0100
> > @@ -805,6 +805,16 @@
> > unsigned long long __sleep_time = now - p->timestamp;
> > unsigned long sleep_time;
> >
> > + /*
> > + * On SMP systems, a task can go to sleep on one CPU and
> > + * wake up on another. When this happens, the timestamp
> > + * is rounded to the nearest tick,
>
> Is this true? There's no rounding that I can see.

Instrumenting it looked the same as rounding down, putting now in the
past was the result.

> Of course, that doesn't mean that this chunk of code isn't required just
> that the comment is misleading.

I'm not attached to the comment.

-Mike

2006-03-25 06:18:04

by Mike Galbraith

[permalink] [raw]
Subject: [2.6.16-mm1 patch] ignore timewarps

Greetings,

The patch below is a correction to patch 1 of my throttling tree series.

The only thing that really matters is that timewarps are ignored,
whatever the cause. This patch does that, and only that.

-Mike

Signed-off-by: Mike Galbraith <[email protected]>

--- linux-2.6.16-mm1/kernel/sched.c.org 2006-03-23 15:01:41.000000000 +0100
+++ linux-2.6.16-mm1/kernel/sched.c 2006-03-23 15:02:25.000000000 +0100
@@ -805,6 +805,15 @@
unsigned long long __sleep_time = now - p->timestamp;
unsigned long sleep_time;

+ /*
+ * On SMP systems, a task can go to sleep on one CPU
+ * and wake up on another. When this happens, now can
+ * end up being less than p->timestamp for short sleeps.
+ * Ignore these, they're insignificant.
+ */
+ if (unlikely(now < p->timestamp))
+ __sleep_time = 0;
+
if (batch_task(p))
sleep_time = 0;
else {