Linus, please pull the latest scheduler git tree from:
git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched.git
three bugfixes:
- a nice fix from eagle-eye Oleg for a subtle typo in the balancing
code, the effect of this bug was more agressive idle balancing. This
bug was introduced by one of the original CFS commits.
- a round of global->static fixes from Adrian Bunk - this change,
besides the cleanup effect, chops 100 bytes off sched.o.
- Peter Zijlstra noticed a sleeper-bonus bug. I kept this patch under
observation and testing this past week and saw no ill effects so far.
It could fix two suspected regressions. (It could improve Kasper
Sandberg's workload and it could improve the sleeper/runner
problem/bug Roman Zippel was seeing.)
test-built and test-booted on x86-32 and x86-64, and did a dozen of
randconfig builds for good measure (which uncovered two new build errors
in latest -git).
Thanks,
Ingo
--------------->
Adrian Bunk (1):
sched: make global code static
Ingo Molnar (1):
sched: fix sleeper bonus
Oleg Nesterov (1):
sched: run_rebalance_domains: s/SCHED_IDLE/CPU_IDLE/
include/linux/cpu.h | 2 --
kernel/sched.c | 48 ++++++++++++++++++++++++------------------------
kernel/sched_fair.c | 12 ++++++------
3 files changed, 30 insertions(+), 32 deletions(-)
This is a 2nd try with correct email address, sorry for the duplicates.
Am Sonntag, 12. August 2007 schrieb Ingo Molnar:
> Linus, please pull the latest scheduler git tree from:
Hello Ingo,
this is a followup to the discussion in
http://lkml.org/lkml/2007/7/19/538
Since 2.6.12, s390 already does precise accouting for system and user time.
Depending on CONFIG_VIRT_CPU_ACCOUNTING, we use two 64bit hardware timers on
s390: the first returns the wall clock time and is stepped even if the
virtual cpu is not backed by a physical cpu. The second timer is only
stepped, when the virtual cpu is backed by a physical cpu. The timers have a
very high accurancy, and the architecture guarantees that bit 51 is increased
by one/microsecond. We store both timers on each context switch, irq,
syscall, and machinecheck in entry.S. The calculation are made in
arch/s390/kernel/vtime.c in accouting_system_vtime and friends with
microsecond accurracy. This is also used for irq accouting (see the
definition of irq_enter). It basically boils down to precise numbers in the
cpu stat and the utime/stime for processes as well as knowledge about time
stolen by the hypervisor.
With CFS the accounting was changed, and everything is now based on
sum_exec_runtime. There is now an accounting regression on s390 (and maybe
ppc64), as the default jiffy implemenation does not know anything about
virtual cpus.
While looking for a solution, I started with a very quick hack and reverted
b27f03d4bdc145a09fb7b0c0e004b29f1ee555fa for the procfs related changes. If I
revert that commit, it seems that I get the old behaviour - but of course
this is just a hack.
I see some options now:
1. Jan could finish his sched_clock implementation for s390 and we would get
close to the precise numbers. This would also let CFS make better decisions.
Downside: its not as precise as before as we do some math on the numbers and
it will burn cycles to compute numbers we already have
(utime=sum*utime/stime).
2. set sum_exec_runtime based on the precise utime and stime. Dont know enough
about CFS if this would show different scheduling behaviour than 1
3. ifdef fs/proc/array.c depending on CONFIG_VIRT_CPU_ACCOUNTING. This will
save some cycles, and the numbers are precise to a microsecond. Downside: the
scheduler gets no information about virtual cpus and steal time so its
probably not completely fair
4. implement sched_clock AND reuse the exisiting utime and stime numbers.
5. other clever solutions I cannot see
Any suggestions?
Christian
Ingo,
this patch fixes the accounting regression for CONFIG_VIRT_CPU_ACCOUNTING. It
reverts parts of commit b27f03d4bdc145a09fb7b0c0e004b29f1ee555fa by converting
fs/proc/array.c back to cputime_t. The new functions task_utime and
task_stime now return cputime_t instead of clock_t. If
CONFIG_VIRT_CPU_ACCOUTING is set, task->utime and task->stime are returned
directly instead of using sum_exec_runtime.
Patch is tested on s390x with and without VIRT_CPU_ACCOUTING as well as on
i386. Not tested on ppc64 - Paul?
Feedback is welcome.
Signed-Off-By: Christian Borntraeger <[email protected]>
---
fs/proc/array.c | 41 ++++++++++++++++++++++++++---------------
1 file changed, 26 insertions(+), 15 deletions(-)
Index: linux-2.6/fs/proc/array.c
===================================================================
--- linux-2.6.orig/fs/proc/array.c
+++ linux-2.6/fs/proc/array.c
@@ -320,7 +320,18 @@ int proc_pid_status(struct task_struct *
return buffer - orig;
}
-static clock_t task_utime(struct task_struct *p)
+#if defined(CONFIG_VIRT_CPU_ACCOUNTING)
+static cputime_t task_utime(struct task_struct *p)
+{
+ return p->utime;
+}
+
+static cputime_t task_stime(struct task_struct *p)
+{
+ return p->stime;
+}
+#else
+static cputime_t task_utime(struct task_struct *p)
{
clock_t utime = cputime_to_clock_t(p->utime),
total = utime + cputime_to_clock_t(p->stime);
@@ -337,10 +348,10 @@ static clock_t task_utime(struct task_st
}
utime = (clock_t)temp;
- return utime;
+ return clock_t_to_cputime(utime);
}
-static clock_t task_stime(struct task_struct *p)
+static cputime_t task_stime(struct task_struct *p)
{
clock_t stime;
@@ -349,10 +360,12 @@ static clock_t task_stime(struct task_st
* the total, to make sure the total observed by userspace
* grows monotonically - apps rely on that):
*/
- stime = nsec_to_clock_t(p->se.sum_exec_runtime) - task_utime(p);
+ stime = nsec_to_clock_t(p->se.sum_exec_runtime) -
+ cputime_to_clock_t(task_utime(p));
- return stime;
+ return clock_t_to_cputime(stime);
}
+#endif
static int do_task_stat(struct task_struct *task, char *buffer, int whole)
{
@@ -368,8 +381,7 @@ static int do_task_stat(struct task_stru
unsigned long long start_time;
unsigned long cmin_flt = 0, cmaj_flt = 0;
unsigned long min_flt = 0, maj_flt = 0;
- cputime_t cutime, cstime;
- clock_t utime, stime;
+ cputime_t cutime, cstime, utime, stime;
unsigned long rsslim = 0;
char tcomm[sizeof(task->comm)];
unsigned long flags;
@@ -387,8 +399,7 @@ static int do_task_stat(struct task_stru
sigemptyset(&sigign);
sigemptyset(&sigcatch);
- cutime = cstime = cputime_zero;
- utime = stime = 0;
+ cutime = cstime = utime = stime = cputime_zero;
rcu_read_lock();
if (lock_task_sighand(task, &flags)) {
@@ -414,15 +425,15 @@ static int do_task_stat(struct task_stru
do {
min_flt += t->min_flt;
maj_flt += t->maj_flt;
- utime += task_utime(t);
- stime += task_stime(t);
+ utime = cputime_add(utime, task_utime(t));
+ stime = cputime_add(stime, task_stime(t));
t = next_thread(t);
} while (t != task);
min_flt += sig->min_flt;
maj_flt += sig->maj_flt;
- utime += cputime_to_clock_t(sig->utime);
- stime += cputime_to_clock_t(sig->stime);
+ utime = cputime_add(utime, sig->utime);
+ stime = cputime_add(stime, sig->stime);
}
sid = signal_session(sig);
@@ -471,8 +482,8 @@ static int do_task_stat(struct task_stru
cmin_flt,
maj_flt,
cmaj_flt,
- utime,
- stime,
+ cputime_to_clock_t(utime),
+ cputime_to_clock_t(stime),
cputime_to_clock_t(cutime),
cputime_to_clock_t(cstime),
priority,
* Christian Borntraeger <[email protected]> wrote:
> 1. Jan could finish his sched_clock implementation for s390 and we
> would get close to the precise numbers. This would also let CFS make
> better decisions. [...]
i think this is the best option and it should give us the same /proc
accuracy on s390 as before, plus improved scheduler precision. (and
improved tracing accuracy, etc. etc.) Note that for architectures that
already have sched_clock() at least as precise as the stime/utime stats
there's no problem - and that seems to include all architectures except
s390.
could you send that precise sched_clock() patch? It should be an order
of magnitude simpler than the high-precision stime/utime tracking you
already do, and it's needed for quality scheduling anyway.
> [...] Downside: its not as precise as before as we do some math on the
> numbers and it will burn cycles to compute numbers we already have
> (utime=sum*utime/stime).
i can see no real downside to it: if all of stime, utime and
sum_exec_clock are precise, then the numbers we present via /proc are
precise too:
sum_exec * utime / stime;
there should be no loss of precision on s390 because the
multiplication/division rounding is not accumulating - we keep the
precise sum_exec, utime and stime values untouched.
on x86 we dont really want to slow down every irq and syscall event with
precise stime/utime stats for 'top' to display. On s390 the
multiplication and division is indeed superfluous but it keeps the code
generic for arches where utime/stime is less precise and irq-sampled -
while the sum is always precise. It also animates architectures that
have an imprecise sched_clock() implementation to improve its accuracy.
Accessing the /proc files alone is many orders of magnitude more
expensive than this simple multiplication and division.
Ingo
On Mon, 2007-08-20 at 17:45 +0200, Ingo Molnar wrote:
> * Christian Borntraeger <[email protected]> wrote:
>
> > 1. Jan could finish his sched_clock implementation for s390 and we
> > would get close to the precise numbers. This would also let CFS make
> > better decisions. [...]
>
> i think this is the best option and it should give us the same /proc
> accuracy on s390 as before, plus improved scheduler precision. (and
> improved tracing accuracy, etc. etc.) Note that for architectures that
> already have sched_clock() at least as precise as the stime/utime stats
> there's no problem - and that seems to include all architectures except
> s390.
For far we have used the TOD clock for sched_clock. This clocks measures
real time with an accuracy of 1usec or better. The [us]time accounting
with CONFIG_VIRT_CPU_ACCOUNTING=y is done using the CPU timer. This
timer measures virtual time with an accuracy of 1usec of better. Without
CONFIG_VIRT_CPU_ACCOUNTING the [us]time accounting is done with HZ
ticks. Which means that sched_clock() is at least as precise as [us]time
on s390 as well, only that we distinguish between real time / virtual
time if the improved accounting is used.
> could you send that precise sched_clock() patch? It should be an order
> of magnitude simpler than the high-precision stime/utime tracking you
> already do, and it's needed for quality scheduling anyway.
Sure if you can explain what it should do. This is still unclear to me,
for a non-idle CPU the virtual cpu time should be used but for an idle
CPU the real time should be used ? That seems rather ill-defined to me.
On s390 we have three times to consider, real time, virtual cpu time and
steal time. For a given period we have real = virtual + steal. And if a
cpu is idle we have real = steal, virtual = 0. My best interpretation of
what you want is that sched_clock should progress with virtual cpu time
if the current process is not idle and with the real time if it is. No ?
> > [...] Downside: its not as precise as before as we do some math on the
> > numbers and it will burn cycles to compute numbers we already have
> > (utime=sum*utime/stime).
>
> i can see no real downside to it: if all of stime, utime and
> sum_exec_clock are precise, then the numbers we present via /proc are
> precise too:
>
> sum_exec * utime / stime;
>
> there should be no loss of precision on s390 because the
> multiplication/division rounding is not accumulating - we keep the
> precise sum_exec, utime and stime values untouched.
But then sched_clock() has to return the virtual cpu time only,
otherwise it will be hard to make sum_exec exact, wouldn't it?
And why should we jump through all these loops to come up with values
that are only as good as the values we already have?
> on x86 we dont really want to slow down every irq and syscall event with
> precise stime/utime stats for 'top' to display. On s390 the
> multiplication and division is indeed superfluous but it keeps the code
> generic for arches where utime/stime is less precise and irq-sampled -
> while the sum is always precise. It also animates architectures that
> have an imprecise sched_clock() implementation to improve its accuracy.
> Accessing the /proc files alone is many orders of magnitude more
> expensive than this simple multiplication and division.
Yes, I can understand why you don't want to have the exact cpu
accounting scheme on x86 since it will slow down every context switch
quite a bit (that includes user <-> kernel, softirq <-> hardirq <->
process context, ..). On s390 the cost is acceptable, for an empty
system call it is about 40 additional cycles for the precise accounting.
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
* Martin Schwidefsky <[email protected]> wrote:
> > could you send that precise sched_clock() patch? It should be an
> > order of magnitude simpler than the high-precision stime/utime
> > tracking you already do, and it's needed for quality scheduling
> > anyway.
>
> Sure if you can explain what it should do. This is still unclear to
> me, for a non-idle CPU the virtual cpu time should be used but for an
> idle CPU the real time should be used ? That seems rather ill-defined
> to me. On s390 we have three times to consider, real time, virtual cpu
> time and steal time. For a given period we have real = virtual +
> steal. And if a cpu is idle we have real = steal, virtual = 0. My best
> interpretation of what you want is that sched_clock should progress
> with virtual cpu time if the current process is not idle and with the
> real time if it is. No ?
The core scheduler is invariant to sched_clock()'s behavior during idle
periods [i.e. sched_clock() can do _anything_ during idle periods, and
scheduling would not/schould not change], and that's the source of the
uncertainty you noted.
So the best way to proceed is still a bit unclear to me - but in any
case, a few boundary conditions are already cast into stone: we
definitely dont want to make life harder for s390 (without giving any
tangible benefits) and we dont want to regress any existing precision of
s390 either.
We seem to agree wrt. sched_clock()'s behavior while the virtual CPU is
busy: sched_clock() very much wants to track virtual time. (real time is
pretty much meaningless and coupling sched_clock() to real time would
make the virtual machine's behavior dependent on the host's load, which
breaks the "seemless virtualization to inside observers" common-sense
requirement of virtual-CPU scheduling.)
For sched_clock()'s behavior while the virtual CPU is idle: my current
idea for that is the patch below (a loosely analoguous problem exists
with nohz/dynticks): it makes sched_clock() valid across idle periods
too and uses wall-clock time for that.
If a virtual CPU is idle then i think the "real = steal, virtual = 0"
way of thinking about idle looks a bit unnatural to me - wouldnt it be
better to think in terms of "steal = 0, virtual = real" ? Basically a
virtual CPU can idle at "perfect speed", without the host "stealing" any
cycles from it. And with that way of thinking, if s390 passed in the
real-idle-time value to the new callbacks below it would all fall into
place. Hm?
that way we'd have a meaningful sched_clock() across idle periods too,
useful for tracers, better scheduler debug-statistics, etc.
Ingo
---------->
Subject: sched: sched_clock_idle_[sleep|wakeup]_event()
From: Ingo Molnar <[email protected]>
construct a more or less wall-clock time out of sched_clock(), by
using ACPI-idle's existing knowledge about how much time we spent
idling. This allows the rq clock to work around TSC-stops-in-C2,
TSC-gets-corrupted-in-C3 type of problems.
( Besides the scheduler's statistics this also benefits blktrace and
printk-timestamps as well. )
Furthermore, the precise before-C2/C3-sleep and after-C2/C3-wakeup
callbacks allow the scheduler to get out the most of the period where
the CPU has a reliable TSC. This results in slightly more precise
task statistics.
the ACPI bits were acked by Len.
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: Len Brown <[email protected]>
---
arch/i386/kernel/tsc.c | 1 -
drivers/acpi/processor_idle.c | 32 +++++++++++++++++++++++++-------
include/linux/sched.h | 3 ++-
kernel/sched.c | 41 ++++++++++++++++++++++++++++++++---------
kernel/sched_debug.c | 3 ++-
5 files changed, 61 insertions(+), 19 deletions(-)
Index: linux/arch/i386/kernel/tsc.c
===================================================================
--- linux.orig/arch/i386/kernel/tsc.c
+++ linux/arch/i386/kernel/tsc.c
@@ -292,7 +292,6 @@ static struct clocksource clocksource_ts
void mark_tsc_unstable(char *reason)
{
- sched_clock_unstable_event();
if (!tsc_unstable) {
tsc_unstable = 1;
tsc_enabled = 0;
Index: linux/drivers/acpi/processor_idle.c
===================================================================
--- linux.orig/drivers/acpi/processor_idle.c
+++ linux/drivers/acpi/processor_idle.c
@@ -63,6 +63,7 @@
ACPI_MODULE_NAME("processor_idle");
#define ACPI_PROCESSOR_FILE_POWER "power"
#define US_TO_PM_TIMER_TICKS(t) ((t * (PM_TIMER_FREQUENCY/1000)) / 1000)
+#define PM_TIMER_TICK_NS (1000000000ULL/PM_TIMER_FREQUENCY)
#define C2_OVERHEAD 4 /* 1us (3.579 ticks per us) */
#define C3_OVERHEAD 4 /* 1us (3.579 ticks per us) */
static void (*pm_idle_save) (void) __read_mostly;
@@ -462,6 +463,9 @@ static void acpi_processor_idle(void)
* TBD: Can't get time duration while in C1, as resumes
* go to an ISR rather than here. Need to instrument
* base interrupt handler.
+ *
+ * Note: the TSC better not stop in C1, sched_clock() will
+ * skew otherwise.
*/
sleep_ticks = 0xFFFFFFFF;
break;
@@ -469,6 +473,8 @@ static void acpi_processor_idle(void)
case ACPI_STATE_C2:
/* Get start time (ticks) */
t1 = inl(acpi_gbl_FADT.xpm_timer_block.address);
+ /* Tell the scheduler that we are going deep-idle: */
+ sched_clock_idle_sleep_event();
/* Invoke C2 */
acpi_state_timer_broadcast(pr, cx, 1);
acpi_cstate_enter(cx);
@@ -479,17 +485,22 @@ static void acpi_processor_idle(void)
/* TSC halts in C2, so notify users */
mark_tsc_unstable("possible TSC halt in C2");
#endif
+ /* Compute time (ticks) that we were actually asleep */
+ sleep_ticks = ticks_elapsed(t1, t2);
+
+ /* Tell the scheduler how much we idled: */
+ sched_clock_idle_wakeup_event(sleep_ticks*PM_TIMER_TICK_NS);
+
/* Re-enable interrupts */
local_irq_enable();
+ /* Do not account our idle-switching overhead: */
+ sleep_ticks -= cx->latency_ticks + C2_OVERHEAD;
+
current_thread_info()->status |= TS_POLLING;
- /* Compute time (ticks) that we were actually asleep */
- sleep_ticks =
- ticks_elapsed(t1, t2) - cx->latency_ticks - C2_OVERHEAD;
acpi_state_timer_broadcast(pr, cx, 0);
break;
case ACPI_STATE_C3:
-
/*
* disable bus master
* bm_check implies we need ARB_DIS
@@ -518,6 +529,8 @@ static void acpi_processor_idle(void)
t1 = inl(acpi_gbl_FADT.xpm_timer_block.address);
/* Invoke C3 */
acpi_state_timer_broadcast(pr, cx, 1);
+ /* Tell the scheduler that we are going deep-idle: */
+ sched_clock_idle_sleep_event();
acpi_cstate_enter(cx);
/* Get end time (ticks) */
t2 = inl(acpi_gbl_FADT.xpm_timer_block.address);
@@ -531,12 +544,17 @@ static void acpi_processor_idle(void)
/* TSC halts in C3, so notify users */
mark_tsc_unstable("TSC halts in C3");
#endif
+ /* Compute time (ticks) that we were actually asleep */
+ sleep_ticks = ticks_elapsed(t1, t2);
+ /* Tell the scheduler how much we idled: */
+ sched_clock_idle_wakeup_event(sleep_ticks*PM_TIMER_TICK_NS);
+
/* Re-enable interrupts */
local_irq_enable();
+ /* Do not account our idle-switching overhead: */
+ sleep_ticks -= cx->latency_ticks + C3_OVERHEAD;
+
current_thread_info()->status |= TS_POLLING;
- /* Compute time (ticks) that we were actually asleep */
- sleep_ticks =
- ticks_elapsed(t1, t2) - cx->latency_ticks - C3_OVERHEAD;
acpi_state_timer_broadcast(pr, cx, 0);
break;
Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -1388,7 +1388,8 @@ extern void sched_exec(void);
#define sched_exec() {}
#endif
-extern void sched_clock_unstable_event(void);
+extern void sched_clock_idle_sleep_event(void);
+extern void sched_clock_idle_wakeup_event(u64 delta_ns);
#ifdef CONFIG_HOTPLUG_CPU
extern void idle_task_exit(void);
Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -262,7 +262,8 @@ struct rq {
s64 clock_max_delta;
unsigned int clock_warps, clock_overflows;
- unsigned int clock_unstable_events;
+ u64 idle_clock;
+ unsigned int clock_deep_idle_events;
u64 tick_timestamp;
atomic_t nr_iowait;
@@ -556,18 +557,40 @@ static inline struct rq *this_rq_lock(vo
}
/*
- * CPU frequency is/was unstable - start new by setting prev_clock_raw:
+ * We are going deep-idle (irqs are disabled):
*/
-void sched_clock_unstable_event(void)
+void sched_clock_idle_sleep_event(void)
{
- unsigned long flags;
- struct rq *rq;
+ struct rq *rq = cpu_rq(smp_processor_id());
- rq = task_rq_lock(current, &flags);
- rq->prev_clock_raw = sched_clock();
- rq->clock_unstable_events++;
- task_rq_unlock(rq, &flags);
+ spin_lock(&rq->lock);
+ __update_rq_clock(rq);
+ spin_unlock(&rq->lock);
+ rq->clock_deep_idle_events++;
+}
+EXPORT_SYMBOL_GPL(sched_clock_idle_sleep_event);
+
+/*
+ * We just idled delta nanoseconds (called with irqs disabled):
+ */
+void sched_clock_idle_wakeup_event(u64 delta_ns)
+{
+ struct rq *rq = cpu_rq(smp_processor_id());
+ u64 now = sched_clock();
+
+ rq->idle_clock += delta_ns;
+ /*
+ * Override the previous timestamp and ignore all
+ * sched_clock() deltas that occured while we idled,
+ * and use the PM-provided delta_ns to advance the
+ * rq clock:
+ */
+ spin_lock(&rq->lock);
+ rq->prev_clock_raw = now;
+ rq->clock += delta_ns;
+ spin_unlock(&rq->lock);
}
+EXPORT_SYMBOL_GPL(sched_clock_idle_wakeup_event);
/*
* resched_task - mark a task 'to be rescheduled now'.
Index: linux/kernel/sched_debug.c
===================================================================
--- linux.orig/kernel/sched_debug.c
+++ linux/kernel/sched_debug.c
@@ -154,10 +154,11 @@ static void print_cpu(struct seq_file *m
P(next_balance);
P(curr->pid);
P(clock);
+ P(idle_clock);
P(prev_clock_raw);
P(clock_warps);
P(clock_overflows);
- P(clock_unstable_events);
+ P(clock_deep_idle_events);
P(clock_max_delta);
P(cpu_load[0]);
P(cpu_load[1]);
On Mon, 2007-08-20 at 20:08 +0200, Ingo Molnar wrote:
> For sched_clock()'s behavior while the virtual CPU is idle: my current
> idea for that is the patch below (a loosely analoguous problem exists
> with nohz/dynticks): it makes sched_clock() valid across idle periods
> too and uses wall-clock time for that.
Ok, that would mean that sched_clock can just return the virtual cpu
time and the two hooks starts and stops the idle periods as far as the
scheduler is concerned. In this case we can use the patch from Jan with
the new implementation for sched_clock and add the two hooks to the
places where the cpu-idle notifiers are done (do_monitor_call and
default_idle). In fact this could be an idle-notifier. Hmm, I take a
closer look tomorrow when I'm back at the office.
> If a virtual CPU is idle then i think the "real = steal, virtual = 0"
> way of thinking about idle looks a bit unnatural to me - wouldnt it be
> better to think in terms of "steal = 0, virtual = real" ? Basically a
> virtual CPU can idle at "perfect speed", without the host "stealing" any
> cycles from it. And with that way of thinking, if s390 passed in the
> real-idle-time value to the new callbacks below it would all fall into
> place. Hm?
How you think about an idle cpu depends on your viewpoint. The source
for the virtual cpu time on s390 is the cpu timer. This timer is stopped
when a virtual cpu looses the physical cpu, so it seems natural to me to
think that real=steal, virtual=0 because the cpu timer is stopped while
the cpu is idle. The other way of thinking about it is as valid though.
> that way we'd have a meaningful sched_clock() across idle periods too,
> useful for tracers, better scheduler debug-statistics, etc.
That would be good.
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
Martin Schwidefsky wrote:
> On Mon, 2007-08-20 at 20:08 +0200, Ingo Molnar wrote:
>> For sched_clock()'s behavior while the virtual CPU is idle: my current
>> idea for that is the patch below (a loosely analoguous problem exists
>> with nohz/dynticks): it makes sched_clock() valid across idle periods
>> too and uses wall-clock time for that.
>
> Ok, that would mean that sched_clock can just return the virtual cpu
> time and the two hooks starts and stops the idle periods as far as the
> scheduler is concerned. In this case we can use the patch from Jan with
> the new implementation for sched_clock and add the two hooks to the
> places where the cpu-idle notifiers are done (do_monitor_call and
> default_idle). In fact this could be an idle-notifier. Hmm, I take a
> closer look tomorrow when I'm back at the office.
>
<snip>
I am partially responsible for the regression. While working on the
CPU accounting change, I for some unknown reason always assumed
that sched_clock() was virtualized. I should have taken a closer look.
Ingo, with this new approach, sched_clock() although not virtualized,
advances as if it is (due to the idle state change accounting).
I have one question though, what if the underlying CPU is forcefully
scheduled out from the virtual CPU?
--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
* Martin Schwidefsky <[email protected]> wrote:
> On Mon, 2007-08-20 at 20:08 +0200, Ingo Molnar wrote:
> > For sched_clock()'s behavior while the virtual CPU is idle: my current
> > idea for that is the patch below (a loosely analoguous problem exists
> > with nohz/dynticks): it makes sched_clock() valid across idle periods
> > too and uses wall-clock time for that.
>
> Ok, that would mean that sched_clock can just return the virtual cpu
> time and the two hooks starts and stops the idle periods as far as the
> scheduler is concerned. In this case we can use the patch from Jan
> with the new implementation for sched_clock and add the two hooks to
> the places where the cpu-idle notifiers are done (do_monitor_call and
> default_idle). In fact this could be an idle-notifier. Hmm, I take a
> closer look tomorrow when I'm back at the office.
ok. Just to make it sure wrt. release-management: you said s390
sched_clock() is currently at least as precise as stime/utime - so this
would suggest that there is no regression over v2.6.22? Regardless of
whether it's a live regression or not, i think we want the nohz
improvement (and the s390 patch if the callbacks are OK to you) in .23,
and we want to migrate all users of "raw" sched_clock() [blktrace,
softlockup-detector, print-timestamps, etc.] over to the better
cpu_clock() interface.
Ingo
* Martin Schwidefsky <[email protected]> wrote:
> > If a virtual CPU is idle then i think the "real = steal, virtual =
> > 0" way of thinking about idle looks a bit unnatural to me - wouldnt
> > it be better to think in terms of "steal = 0, virtual = real" ?
> > Basically a virtual CPU can idle at "perfect speed", without the
> > host "stealing" any cycles from it. And with that way of thinking,
> > if s390 passed in the real-idle-time value to the new callbacks
> > below it would all fall into place. Hm?
>
> How you think about an idle cpu depends on your viewpoint. The source
> for the virtual cpu time on s390 is the cpu timer. This timer is
> stopped when a virtual cpu looses the physical cpu, so it seems
> natural to me to think that real=steal, virtual=0 because the cpu
> timer is stopped while the cpu is idle. The other way of thinking
> about it is as valid though.
my thinking is this: the structure of "idle time" only matters if it can
be observed from "within" a virtual machine - via timers. Are on s390
any of the typical app-visible timers (timer_list, etc.) driven by the
virtual tick? [which slows down if a virtual CPU is scheduled away by
the host/monitor/hypervisor?] Or is the virtual tick only affecting
scheduling/cpu-accounting statistics in essence?
Ingo
Ingo Molnar writes:
> We seem to agree wrt. sched_clock()'s behavior while the virtual CPU is
> busy: sched_clock() very much wants to track virtual time. (real time is
> pretty much meaningless and coupling sched_clock() to real time would
> make the virtual machine's behavior dependent on the host's load, which
> breaks the "seemless virtualization to inside observers" common-sense
> requirement of virtual-CPU scheduling.)
OK, that would imply that virtualized powerpc machines want to use the
PURR register for sched_clock(). The PURR basically counts time that
this virtual CPU (SMT thread) spends dispatching instructions, so it
excludes both the time taken by the hypervisor and the time (dispatch
cycles) taken by the other thread.
> For sched_clock()'s behavior while the virtual CPU is idle: my current
> idea for that is the patch below (a loosely analoguous problem exists
> with nohz/dynticks): it makes sched_clock() valid across idle periods
> too and uses wall-clock time for that.
The straightforward thing is just to use the PURR all the time, even
during idle. What that means is this:
* If the other thread is active then sched_clock() will continue to
advance at a slow rate during idle (reflecting the fact that the
active thread is getting almost all of the dispatch cycles).
* If the other thread is idle and the partition is a "shared
processor" partition then sched_clock() not advance since in that
case we idle in the hypervisor.
* If the other thread is idle and the partition is a "dedicated
processor" partition then sched_clock() will advance at half speed
on both threads, since the two threads each get half of the dispatch
cycles.
It sounds like this behaviour should be OK - do you agree?
> If a virtual CPU is idle then i think the "real = steal, virtual = 0"
> way of thinking about idle looks a bit unnatural to me - wouldnt it be
> better to think in terms of "steal = 0, virtual = real" ? Basically a
Stolen time while the virtual CPU is idle gets (or at least, used to
get :) accounted as idle time.
Paul.
Ingo Molnar <[email protected]> writes:
You should just be using idle notifiers instead instead of adding more
and more custom hooks (like NOHZ has already) x86-64 still has them
and there is a old patch around to add them to i386.
-Andi
Am Montag, 20. August 2007 schrieb Martin Schwidefsky:
> On Mon, 2007-08-20 at 20:08 +0200, Ingo Molnar wrote:
> Ok, that would mean that sched_clock can just return the virtual cpu
> time and the two hooks starts and stops the idle periods as far as the
> scheduler is concerned. In this case we can use the patch from Jan with
> the new implementation for sched_clock and add the two hooks to the
> places where the cpu-idle notifiers are done (do_monitor_call and
> default_idle). In fact this could be an idle-notifier. Hmm, I take a
> closer look tomorrow when I'm back at the office.
>
> > If a virtual CPU is idle then i think the "real = steal, virtual = 0"
> > way of thinking about idle looks a bit unnatural to me - wouldnt it be
> > better to think in terms of "steal = 0, virtual = real" ? Basically a
> > virtual CPU can idle at "perfect speed", without the host "stealing" any
> > cycles from it. And with that way of thinking, if s390 passed in the
> > real-idle-time value to the new callbacks below it would all fall into
> > place. Hm?
Martin,
I think we already do something like this. If you look at cpustat in 2.6.22
and earlier we already have steal increase = 0, idle increase = 100 % on an
idle cpu, even on s390. So while from the hardware perspective steal is
growing, we do the right thing in Linux, no?
Chrisian
* Andi Kleen <[email protected]> wrote:
> You should just be using idle notifiers instead instead of adding more
> and more custom hooks (like NOHZ has already) x86-64 still has them
> and there is a old patch around to add them to i386.
these are specially placed callbacks that we want to call from certain
ACPI codepaths but not from all of them.
Ingo
Am Montag, 20. August 2007 schrieb Ingo Molnar:
> ok. Just to make it sure wrt. release-management: you said s390
> sched_clock() is currently at least as precise as stime/utime - so this
> would suggest that there is no regression over v2.6.22? Regardless of
On current git s390 has a sched_clock implementaton based on real time. That
means we have no knowledge about steal time. E.g. if you only get 50% of your
physial cpu from the hypervisor on 2.6.22 a single cpu bound process has 50%
in top while on 2.6.23-rc top shows 100%, so yes, there is a regression.
Christian
Am Montag, 20. August 2007 schrieb Ingo Molnar:
> could you send that precise sched_clock() patch? It should be an order
> of magnitude simpler than the high-precision stime/utime tracking you
> already do, and it's needed for quality scheduling anyway.
I have a question about that. I just played with sched_clock, and even when I
intentionally slow down sched_clock by a factor of 2, my cpu bound process
gets 100 % in top. If this is intentional, I dont understand how a
virtualized sched_clock would fix the accounting change?
Thanks
Christian
* Christian Borntraeger <[email protected]> wrote:
> Am Montag, 20. August 2007 schrieb Ingo Molnar:
> > could you send that precise sched_clock() patch? It should be an order
> > of magnitude simpler than the high-precision stime/utime tracking you
> > already do, and it's needed for quality scheduling anyway.
>
> I have a question about that. I just played with sched_clock, and even
> when I intentionally slow down sched_clock by a factor of 2, my cpu
> bound process gets 100 % in top. If this is intentional, I dont
> understand how a virtualized sched_clock would fix the accounting
> change?
hm, does on s390 scheduler_tick() get driven in virtual time or in real
time? The very latest scheduler code will enforce a minimum rate of
sched_clock() across two scheduler_tick() calls (in rc3 and later
kernels). If sched_clock() "slows down" but scheduler_tick() still has a
real-time frequency then that impacts the quality of scheduling. So
scheduler_tick() and sched_clock() must really have the same behavior
(either both are virtual or both are real), so that scheduling becomes
invariant to steal-time.
Ingo
On Tue, 2007-08-21 at 10:42 +0200, Ingo Molnar wrote:
> * Christian Borntraeger <[email protected]> wrote:
>
> > Am Montag, 20. August 2007 schrieb Ingo Molnar:
> > > could you send that precise sched_clock() patch? It should be an order
> > > of magnitude simpler than the high-precision stime/utime tracking you
> > > already do, and it's needed for quality scheduling anyway.
> >
> > I have a question about that. I just played with sched_clock, and even
> > when I intentionally slow down sched_clock by a factor of 2, my cpu
> > bound process gets 100 % in top. If this is intentional, I dont
> > understand how a virtualized sched_clock would fix the accounting
> > change?
>
> hm, does on s390 scheduler_tick() get driven in virtual time or in real
> time? The very latest scheduler code will enforce a minimum rate of
> sched_clock() across two scheduler_tick() calls (in rc3 and later
> kernels). If sched_clock() "slows down" but scheduler_tick() still has a
> real-time frequency then that impacts the quality of scheduling. So
> scheduler_tick() and sched_clock() must really have the same behavior
> (either both are virtual or both are real), so that scheduling becomes
> invariant to steal-time.
scheduler_tick() is based on the HZ timer which uses the TOD clock =
real time. sched_clock() currently uses the TOD clock as well so in
regard to the new scheduler we currently do not have a problem. We have
a problem with cpu time accounting, the change to the /proc code breaks
the precise accounting on s390. To solve the cpu time accounting we need
to change sched_clock() to the cpu timer = virtual time. To change the
scheduler_tick() as well requires another patch and I fear it would
complicate things in the s390 backend.
And if you say that the scheduling becomes invariant to steal-time, how
is the cpu time accounting via sum_exec supposed to work if it does not
take steal-time into account ?
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
On Tue, 2007-08-21 at 09:00 +0200, Christian Borntraeger wrote:
> Am Montag, 20. August 2007 schrieb Martin Schwidefsky:
> > On Mon, 2007-08-20 at 20:08 +0200, Ingo Molnar wrote:
> > Ok, that would mean that sched_clock can just return the virtual cpu
> > time and the two hooks starts and stops the idle periods as far as the
> > scheduler is concerned. In this case we can use the patch from Jan with
> > the new implementation for sched_clock and add the two hooks to the
> > places where the cpu-idle notifiers are done (do_monitor_call and
> > default_idle). In fact this could be an idle-notifier. Hmm, I take a
> > closer look tomorrow when I'm back at the office.
> >
> > > If a virtual CPU is idle then i think the "real = steal, virtual = 0"
> > > way of thinking about idle looks a bit unnatural to me - wouldnt it be
> > > better to think in terms of "steal = 0, virtual = real" ? Basically a
> > > virtual CPU can idle at "perfect speed", without the host "stealing" any
> > > cycles from it. And with that way of thinking, if s390 passed in the
> > > real-idle-time value to the new callbacks below it would all fall into
> > > place. Hm?
>
> Martin,
>
> I think we already do something like this. If you look at cpustat in 2.6.22
> and earlier we already have steal increase = 0, idle increase = 100 % on an
> idle cpu, even on s390. So while from the hardware perspective steal is
> growing, we do the right thing in Linux, no?
This is done in kernel/sched.c:account_steal_time(). If the architecture
backend reports steal time for idle it is accounted as idle time.
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
* Martin Schwidefsky <[email protected]> wrote:
> > hm, does on s390 scheduler_tick() get driven in virtual time or in
> > real time? The very latest scheduler code will enforce a minimum
> > rate of sched_clock() across two scheduler_tick() calls (in rc3 and
> > later kernels). If sched_clock() "slows down" but scheduler_tick()
> > still has a real-time frequency then that impacts the quality of
> > scheduling. So scheduler_tick() and sched_clock() must really have
> > the same behavior (either both are virtual or both are real), so
> > that scheduling becomes invariant to steal-time.
>
> scheduler_tick() is based on the HZ timer which uses the TOD clock =
> real time. sched_clock() currently uses the TOD clock as well so in
> regard to the new scheduler we currently do not have a problem. We
> have a problem with cpu time accounting, the change to the /proc code
> breaks the precise accounting on s390. To solve the cpu time
> accounting we need to change sched_clock() to the cpu timer = virtual
> time. To change the scheduler_tick() as well requires another patch
> and I fear it would complicate things in the s390 backend.
my feeling is that it gives us generally higher-quality scheduling if we
drive all things scheduler via virtual time. Do you agree with that?
> And if you say that the scheduling becomes invariant to steal-time,
> how is the cpu time accounting via sum_exec supposed to work if it
> does not take steal-time into account ?
right now there are two distinct and independent things: scheduler
behavior (the scheduling decisions the scheduler makes) and accounting
behavior.
the 'invariant' i mentioned only covers scheduler behavior, not
accounting behavior. Accounting is separate in theory, but coupled in
practice now via sum_exec_runtime.
Before we do a patch to decouple them again, lets make sure we agree on
the direction to take here. There are two ways to account within a
virtual machine: either in real time or in virtual time.
it seems you'd like accounting to be sensitive to 'external load' - i.e.
you'd like an 'internal' top to show the 'real' CPU accounting, right?
Wouldnt it be more consistent if a virtual box would not show any
dependency on external load? (i.e. it would slow down all of its
internal functionality transparently, without exposing it via /proc. The
only way to observe that would be the TOD interfaces: gettimeofday and
real-time clock driven POSIX timers. Even timer_list could be driven via
virtual time - although that would probably break user expectations,
right?) Or would accounting-in-virtual-time break user expectations too?
(most of the other hypervisors let guests account in virtual time.)
Ingo
Ingo Molnar writes:
> my feeling is that it gives us generally higher-quality scheduling if we
> drive all things scheduler via virtual time. Do you agree with that?
On PowerPC at least, while we can measure virtual time, there is no
hardware facility for getting an interrupt after a certain amount of
virtual time has elapsed, but there is a facility to get an interrupt
after an elapsed real-time interval. So sched_clock() could easily
change to measure virtual time, but I don't see how to make
scheduler_tick() be driven off virtual time. It sounds like s390 is
the same.
> it seems you'd like accounting to be sensitive to 'external load' - i.e.
> you'd like an 'internal' top to show the 'real' CPU accounting, right?
>
> Wouldnt it be more consistent if a virtual box would not show any
> dependency on external load? (i.e. it would slow down all of its
> internal functionality transparently, without exposing it via /proc. The
The tools that use the data in /proc get unhappy if user time + system
time + hardirq time + softirq time + idle time + stolen time doesn't
add up to real time. The way we handle that is by making stolen time
represent the time taken away by the hypervisor (and on PowerPC with
SMT, the time taken by the other thread too).
Paul.
On Tue, Aug 21, 2007 at 09:09:22AM +0200, Ingo Molnar wrote:
>
> * Andi Kleen <[email protected]> wrote:
>
> > You should just be using idle notifiers instead instead of adding more
> > and more custom hooks (like NOHZ has already) x86-64 still has them
> > and there is a old patch around to add them to i386.
>
> these are specially placed callbacks that we want to call from certain
> ACPI codepaths but not from all of them.
Because you believe TSC only stops in C2 and C3? That's not correct on
all systems.
-Andi
* Andi Kleen <[email protected]> wrote:
> On Tue, Aug 21, 2007 at 09:09:22AM +0200, Ingo Molnar wrote:
> >
> > * Andi Kleen <[email protected]> wrote:
> >
> > > You should just be using idle notifiers instead instead of adding
> > > more and more custom hooks (like NOHZ has already) x86-64 still
> > > has them and there is a old patch around to add them to i386.
> >
> > these are specially placed callbacks that we want to call from
> > certain ACPI codepaths but not from all of them.
>
> Because you believe TSC only stops in C2 and C3? That's not correct on
> all systems.
i know there are some incredibly broken (but rare) boxes where the bios
will report it only knows C1 and do C2? Is that the case you are
referring to, or is there something else too?
Ingo
On Tue, 2007-08-21 at 11:34 +0200, Ingo Molnar wrote:
> * Martin Schwidefsky <[email protected]> wrote:
>
> > > hm, does on s390 scheduler_tick() get driven in virtual time or in
> > > real time? The very latest scheduler code will enforce a minimum
> > > rate of sched_clock() across two scheduler_tick() calls (in rc3 and
> > > later kernels). If sched_clock() "slows down" but scheduler_tick()
> > > still has a real-time frequency then that impacts the quality of
> > > scheduling. So scheduler_tick() and sched_clock() must really have
> > > the same behavior (either both are virtual or both are real), so
> > > that scheduling becomes invariant to steal-time.
> >
> > scheduler_tick() is based on the HZ timer which uses the TOD clock =
> > real time. sched_clock() currently uses the TOD clock as well so in
> > regard to the new scheduler we currently do not have a problem. We
> > have a problem with cpu time accounting, the change to the /proc code
> > breaks the precise accounting on s390. To solve the cpu time
> > accounting we need to change sched_clock() to the cpu timer = virtual
> > time. To change the scheduler_tick() as well requires another patch
> > and I fear it would complicate things in the s390 backend.
>
> my feeling is that it gives us generally higher-quality scheduling if we
> drive all things scheduler via virtual time. Do you agree with that?
Yes, I'm in favour of converting sched_clock to use virtual time. It
makes sense to me.
> > And if you say that the scheduling becomes invariant to steal-time,
> > how is the cpu time accounting via sum_exec supposed to work if it
> > does not take steal-time into account ?
>
> right now there are two distinct and independent things: scheduler
> behavior (the scheduling decisions the scheduler makes) and accounting
> behavior.
>
> the 'invariant' i mentioned only covers scheduler behavior, not
> accounting behavior. Accounting is separate in theory, but coupled in
> practice now via sum_exec_runtime.
Hmm, ok. But the fact is that right now the accounting via
sum_exec_runtime is broken in regard to virtual cpus, isn't it?
> Before we do a patch to decouple them again, lets make sure we agree on
> the direction to take here. There are two ways to account within a
> virtual machine: either in real time or in virtual time.
>
> it seems you'd like accounting to be sensitive to 'external load' - i.e.
> you'd like an 'internal' top to show the 'real' CPU accounting, right?
Yes, we want utime and stime represent the time spent on the physical
cpu. To make up for the missing time the steal time field has been
introduced.
> Wouldnt it be more consistent if a virtual box would not show any
> dependency on external load? (i.e. it would slow down all of its
> internal functionality transparently, without exposing it via /proc. The
> only way to observe that would be the TOD interfaces: gettimeofday and
> real-time clock driven POSIX timers. Even timer_list could be driven via
> virtual time - although that would probably break user expectations,
> right?) Or would accounting-in-virtual-time break user expectations too?
> (most of the other hypervisors let guests account in virtual time.
No, imho it is less consistent if the virtual box shows virtual time. If
you look at the top output as a user and it shows that some process used
x% of cpu what does it tell you? With virtual cpus next to nothing, you
have to normalize the numbers with the %steal while the process was
running. But even then it still is not a good number because the %steal
changes while a process is running. The only good solution is to use
virtual time for all cputime values.
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
Am Dienstag, 21. August 2007 schrieb Ingo Molnar:
> Wouldnt it be more consistent if a virtual box would not show any
> dependency on external load? (i.e. it would slow down all of its
> internal functionality transparently, without exposing it via /proc. The
> only way to observe that would be the TOD interfaces: gettimeofday and
> real-time clock driven POSIX timers. Even timer_list could be driven via
> virtual time - although that would probably break user expectations,
> right?) Or would accounting-in-virtual-time break user expectations too?
> (most of the other hypervisors let guests account in virtual time.)
Most hypervisors let guest account in virtual time because this requires no
code change. We had that in the past as well. But now we have lots of
customers that rely on a different accounting model. Before
CONFIG_VIRT_CPU_ACCOUNTING we got this model of top showing 99% of cpu used,
even if the hypervisor gives us 20% of the physical one. We actually want to
show that this process gets only 19.8% of a real cpu for several reasons:
- people do workload management based on used cycles
- people/departments pay money based on used cycles
- If your physical cpu has to much load, you want to identify processes by
physical cpu usage
There are even some vendors that claimed that Linux accouting was completely
broken and useless on s390 and people should use their vendor tool to get the
right numbers. While these tools still have important features
CONFIG_VIRT_CPU_ACCOUNTING fixed at least the "broken" parts.
Christian
Am Dienstag, 21. August 2007 schrieb Ingo Molnar:
> the 'invariant' i mentioned only covers scheduler behavior, not
> accounting behavior. Accounting is separate in theory, but coupled in
> practice now via sum_exec_runtime.
Forgot to answer about that: That means that the current design does not cover
our requirement of showing process real time, even if we implement
sched_clock? In that case I would suggest to merge my patch as a quick but
correct solution and do it properly for 2.6.24. Of course better solutions
are welcome :-)
Christian
> i know there are some incredibly broken (but rare) boxes where the bios
> will report it only knows C1 and do C2? Is that the case you are
> referring to, or is there something else too?
There are first a couple of older and not so old (Centaur) chips that
generally stop the TSC in C1.
And also some boxes who shouldn't have anything deeper than C2 have
trouble with the TSC. For example I got a Opteron machine (which
definitely shouldn't have any C2 since it's two socket) where the
TSC appears to stop or at least slow down a lot in C1.
And thirdly it's just unclean to add all kinds of custom hooks
there. It was already ugly in NOHZ, please don't continue that.
-Andi
* Christian Borntraeger <[email protected]> wrote:
> Am Dienstag, 21. August 2007 schrieb Ingo Molnar:
> > the 'invariant' i mentioned only covers scheduler behavior, not
> > accounting behavior. Accounting is separate in theory, but coupled in
> > practice now via sum_exec_runtime.
>
> Forgot to answer about that: That means that the current design does
> not cover our requirement of showing process real time, even if we
> implement sched_clock? In that case I would suggest to merge my patch
> as a quick but correct solution and do it properly for 2.6.24. Of
> course better solutions are welcome :-)
you mean to revert b27f03d4bd? I'd really like to see this fixed for
real for s390 + CONFIG_VIRT_CPU_ACCOUNTING=y. (which seems to be the
only case affected)
Ingo
* Andi Kleen <[email protected]> wrote:
> > i know there are some incredibly broken (but rare) boxes where the bios
> > will report it only knows C1 and do C2? Is that the case you are
> > referring to, or is there something else too?
>
> There are first a couple of older and not so old (Centaur) chips that
> generally stop the TSC in C1.
there's not much we can do about them: the ACPI code doesnt measure
their idle time, right? This is mostly for statistics purposes, so
unless "broken" means tons of boxes, we dont have to have 100% coverage.
> And also some boxes who shouldn't have anything deeper than C2 have
> trouble with the TSC. For example I got a Opteron machine (which
> definitely shouldn't have any C2 since it's two socket) where the TSC
> appears to stop or at least slow down a lot in C1.
how much is "a lot" in C1? There's an AMD TSC-slows-down-C1 erratum but
that should be less than 1%. (which is fine enough for idle time
measurements)
> And thirdly it's just unclean to add all kinds of custom hooks there.
> It was already ugly in NOHZ, please don't continue that.
i'm not opposed to the idle notifiers but iirc the idle notifiers caused
problems in themselves so part of them were reverted. We can do this
more cleanly in .24 - it will make the benefits of the notifier cleanup
even more apparent.
Ingo
Am Dienstag, 21. August 2007 schrieb Ingo Molnar:
> you mean to revert b27f03d4bd? I'd really like to see this fixed for
> real for s390 + CONFIG_VIRT_CPU_ACCOUNTING=y. (which seems to be the
> only case affected)
Not a complete revert, but an ifdef-workaround. I wrote this patch last week
and you were on cc:
http://marc.info/?l=linux-mm-commits&m=118737949222362&w=2
This patch should fix s390 and let everybody else use your new code. If you
are conviced that we get a better solution before 2.6.23 hits the street,
thats fine with me.
Christian
* Christian Borntraeger <[email protected]> wrote:
> Am Montag, 20. August 2007 schrieb Ingo Molnar:
> > could you send that precise sched_clock() patch? It should be an order
> > of magnitude simpler than the high-precision stime/utime tracking you
> > already do, and it's needed for quality scheduling anyway.
>
> I have a question about that. I just played with sched_clock, and even
> when I intentionally slow down sched_clock by a factor of 2, my cpu
> bound process gets 100 % in top. If this is intentional, I dont
> understand how a virtualized sched_clock would fix the accounting
> change?
could you try the patch below, does it work any better?
Ingo
---
kernel/sched.c | 9 +++++++++
1 file changed, 9 insertions(+)
Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -333,6 +333,14 @@ static void __update_rq_clock(struct rq
#ifdef CONFIG_SCHED_DEBUG
WARN_ON_ONCE(cpu_of(rq) != smp_processor_id());
#endif
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING
+ /*
+ * Trust sched_clock on s390:
+ */
+ if (unlikely(delta > rq->clock_max_delta))
+ rq->clock_max_delta = delta;
+ clock += delta;
+#else
/*
* Protect against sched_clock() occasionally going backwards:
*/
@@ -355,6 +363,7 @@ static void __update_rq_clock(struct rq
clock += delta;
}
}
+#endif
rq->prev_clock_raw = now;
rq->clock = clock;
* Christian Borntraeger <[email protected]> wrote:
> Am Dienstag, 21. August 2007 schrieb Ingo Molnar:
> > you mean to revert b27f03d4bd? I'd really like to see this fixed for
> > real for s390 + CONFIG_VIRT_CPU_ACCOUNTING=y. (which seems to be the
> > only case affected)
>
> Not a complete revert, but an ifdef-workaround. I wrote this patch last week
> and you were on cc:
> http://marc.info/?l=linux-mm-commits&m=118737949222362&w=2
>
> This patch should fix s390 and let everybody else use your new code.
> If you are conviced that we get a better solution before 2.6.23 hits
> the street, thats fine with me.
what do you think about the rq_clock() #ifdef i did in the previous mail
plus you making sched_clock() virtual? That way you can keep
scheduler_tick() driven by real-time, although that generally will cause
artifacts with SMP load-balancing too. (that was true in the past too)
but i dont mind your patch either - it's really the architecture's
choice how visible it wants to make external load to the task stats of
its virtual machines. I think it is more logical to say that 100% CPU
time displayed in 'top' means that the task got all the CPU time it
asked for from the virtual machine. (and if you are curious about how
much time was stolen from the virtual box altogether you look at the
stolen-time stats in isolation.)
Ingo
* Martin Schwidefsky <[email protected]> wrote:
> > Wouldnt it be more consistent if a virtual box would not show any
> > dependency on external load? (i.e. it would slow down all of its
> > internal functionality transparently, without exposing it via /proc.
> > The only way to observe that would be the TOD interfaces:
> > gettimeofday and real-time clock driven POSIX timers. Even
> > timer_list could be driven via virtual time - although that would
> > probably break user expectations, right?) Or would
> > accounting-in-virtual-time break user expectations too? (most of the
> > other hypervisors let guests account in virtual time.
>
> No, imho it is less consistent if the virtual box shows virtual time.
> If you look at the top output as a user and it shows that some process
> used x% of cpu what does it tell you? [...]
it tells me something really important: that the virtual box's scheduler
gave this task as much CPU time as it could.
> [...] With virtual cpus next to nothing, you have to normalize the
> numbers with the %steal while the process was running. But even then
> it still is not a good number because the %steal changes while a
> process is running. The only good solution is to use virtual time for
> all cputime values.
the steal percentage is really a concept one step higher in the
scheduling hierarchy. You should be running top (or an equivalent tool)
in the _hypervisor_ context if you want to know about how much time each
virtual box gets. 'mixing' that information with the 'internal' task
statistics of the virtual box is less consistent IMO and leads to the
loss of information. (with the 'mixing' method there's no way to tell
whether a task got all CPU time it asked for - and _that_ is which an
admin is interested in just as much as the time allocation between
virtual boxes.)
so in say KVM you determine the steal percentage by looking at 'top'
output in the hypervisor context. (or looking at steal% in the internal
top output) Looking at 'top' in the guest tells you everything internal
to that guest, without mixing external scheduling information into it.
Ingo
On Tue, 2007-08-21 at 13:36 +0200, Ingo Molnar wrote:
> * Martin Schwidefsky <[email protected]> wrote:
>
> > > Wouldnt it be more consistent if a virtual box would not show any
> > > dependency on external load? (i.e. it would slow down all of its
> > > internal functionality transparently, without exposing it via /proc.
> > > The only way to observe that would be the TOD interfaces:
> > > gettimeofday and real-time clock driven POSIX timers. Even
> > > timer_list could be driven via virtual time - although that would
> > > probably break user expectations, right?) Or would
> > > accounting-in-virtual-time break user expectations too? (most of the
> > > other hypervisors let guests account in virtual time.
> >
> > No, imho it is less consistent if the virtual box shows virtual time.
> > If you look at the top output as a user and it shows that some process
> > used x% of cpu what does it tell you? [...]
>
> it tells me something really important: that the virtual box's scheduler
> gave this task as much CPU time as it could.
So? As far as accounting is concerned the user doesn't care one bit what
the scheduler decided. The user cares how much cpu was used. If you want
to know how much of the cputime available to the virtual box was used
for a process you just have to add/subtract the steal time. Your have
the complete picture what happened. You cannot get the complete picture
if you have do process accounting based on the TOD clock, you never know
how much cpu was actually spent for a process.
> > [...] With virtual cpus next to nothing, you have to normalize the
> > numbers with the %steal while the process was running. But even then
> > it still is not a good number because the %steal changes while a
> > process is running. The only good solution is to use virtual time for
> > all cputime values.
>
> the steal percentage is really a concept one step higher in the
> scheduling hierarchy. You should be running top (or an equivalent tool)
> in the _hypervisor_ context if you want to know about how much time each
> virtual box gets. 'mixing' that information with the 'internal' task
> statistics of the virtual box is less consistent IMO and leads to the
> loss of information. (with the 'mixing' method there's no way to tell
> whether a task got all CPU time it asked for - and _that_ is which an
> admin is interested in just as much as the time allocation between
> virtual boxes.)
Not really. If I look at a process I want to know how much cpu it used.
Not virtual but real cpu. And we learned this the hard way, trying to do
accounting with numbers that have to get normalized by numbers from the
hypervisor is just plain broken.
> so in say KVM you determine the steal percentage by looking at 'top'
> output in the hypervisor context. (or looking at steal% in the internal
> top output) Looking at 'top' in the guest tells you everything internal
> to that guest, without mixing external scheduling information into it.
The information about accounting numbers that are internal to the guest
is 99.99% useless. You always have to take the external scheduling into
account. We choose to do the accounting in a way that does not require
additional steps to get to useful numbers.
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
> what do you think about the rq_clock() #ifdef i did in the previous mail
> plus you making sched_clock() virtual? That way you can keep
> scheduler_tick() driven by real-time, although that generally will cause
> artifacts with SMP load-balancing too. (that was true in the past too)
I just has a test run with a virtual sched_clock and your patch.
Unfortunately, it doesnt work. top shows 100% for a cpu bound process, but
steal time shows about 5% stolen cpu.
This brings me to another problem: runtime.
Let me give an example. You get 90% cpu from your hipervisor in a shared
environment. If you now start a cpu bound task that gets the full cpu for
lets say 10 minutes. I REALLY want to see 9 minutes in ps and top because my
department might pay for used cpu cycles.
>
> but i dont mind your patch either - it's really the architecture's
> choice how visible it wants to make external load to the task stats of
> its virtual machines. I think it is more logical to say that 100% CPU
> time displayed in 'top' means that the task got all the CPU time it
> asked for from the virtual machine. (and if you are curious about how
> much time was stolen from the virtual box altogether you look at the
> stolen-time stats in isolation.)
Well, as I said we started with the same approach (virtual cpu) but we learned
that these numbers have no meaning at all because the hypervisor does have
different scheduling timeslices and having 100% inside the guest can still
result in almost nothing if the system is really loaded.
Christian
* Christian Borntraeger <[email protected]> wrote:
> > but i dont mind your patch either - it's really the architecture's
> > choice how visible it wants to make external load to the task stats
> > of its virtual machines. I think it is more logical to say that 100%
> > CPU time displayed in 'top' means that the task got all the CPU time
> > it asked for from the virtual machine. (and if you are curious about
> > how much time was stolen from the virtual box altogether you look at
> > the stolen-time stats in isolation.)
>
> Well, as I said we started with the same approach (virtual cpu) but we
> learned that these numbers have no meaning at all because the
> hypervisor does have different scheduling timeslices and having 100%
> inside the guest can still result in almost nothing if the system is
> really loaded.
hm, i think i must have used the wrong terminology, so let me describe
what i mean, so that we can argue this more efficiently ;-)
What i call "real time sched_clock()" is a sched_clock() that returns
the GTOD (the real time) of the hypervisor. I.e. sched_clock() advances
by 1 billion units every wall-clock second, in each guest.
A "virtual time sched_clock()" is a sched_clock() that returns only the
amount of time the virtual CPU was executed by the hypervisor. I.e. on a
3 times overloaded hypervisor with 3 guests it will advance 333 million
nanoseconds per 1 wall-clock second, in each guest. (it is 'virtual'
because the clock slows down as load goes up. In CFS-speak the virtual
clock is the "fair-clock".)
to me the right scheme for sched_clock() is the virtual variant: to
return the load-scaled nanoseconds. That way CFS will be able to
schedule fairly even if time has been "stolen" from a task [by virtue of
the hypervisor scheduling away the guest context without giving any
notice about this to the guest kernel] - because sched_clock() measures
the virtual time that got allocated to that guest by the hypervisor.
[ here i'm assuming precise host and precise guest statistics (which is
naturally the case if both are Linux), and in that context the virtual
numbers very much make sense, and whether 'top' displays 100% for a
sole CPU-bound task should be mostly a matter of tooling. ]
Ingo
On Tue, 2007-08-21 at 14:21 +0200, Ingo Molnar wrote:
> hm, i think i must have used the wrong terminology, so let me describe
> what i mean, so that we can argue this more efficiently ;-)
Ok, seems we need to be a bit more precise.
> What i call "real time sched_clock()" is a sched_clock() that returns
> the GTOD (the real time) of the hypervisor. I.e. sched_clock() advances
> by 1 billion units every wall-clock second, in each guest.
Which is what we call the TOD clock.
> A "virtual time sched_clock()" is a sched_clock() that returns only the
> amount of time the virtual CPU was executed by the hypervisor. I.e. on a
> 3 times overloaded hypervisor with 3 guests it will advance 333 million
> nanoseconds per 1 wall-clock second, in each guest. (it is 'virtual'
> because the clock slows down as load goes up. In CFS-speak the virtual
> clock is the "fair-clock".)
We 100% agree that sched_clock() should be virtual.
> to me the right scheme for sched_clock() is the virtual variant: to
> return the load-scaled nanoseconds. That way CFS will be able to
> schedule fairly even if time has been "stolen" from a task [by virtue of
> the hypervisor scheduling away the guest context without giving any
> notice about this to the guest kernel] - because sched_clock() measures
> the virtual time that got allocated to that guest by the hypervisor.
Ok, this means we will need Jans patch that makes sched_clock() use the
cpu timer. You said that this change would require that scheduler_tick()
has to use virtual time as well, which would be the second patch.
And then we require a third patch that fixes the process accounting.
> [ here i'm assuming precise host and precise guest statistics (which is
> naturally the case if both are Linux), and in that context the virtual
> numbers very much make sense, and whether 'top' displays 100% for a
> sole CPU-bound task should be mostly a matter of tooling. ]
It is not only a matter of tooling. The tool needs to have enough,
precise information to actually return the requested information. If the
user want to know how much real cpu a process has used (and our user do
want to know that), the output of /proc/<pid>/stat fields 14-17 have to
contain the real cpu usage for user/system/cumulated user and cumulated
system time. If they would contain the virtual cpu usage you cannot tell
how much real cpu a process used, even if you have access to the steal
timer numbers. The reason is that you would have to synchronize the
scheduling points in the guest and the hypervisor to come up with
reasonable numbers. This is something we obviously do not want to do.
The output of /proc/<pid>/stat is what Christian and I are complaining
about. Since the introduction of CFS and VIRT_ACCOUNTING=y the output
of /proc/<pid>/stat has changed its meaning and in our opinion it is
wrong now.
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
Am Dienstag, 21. August 2007 schrieben Sie:
> could you try the patch below, does it work any better?
I looked again at the scheduler code and things are getting better when I run
the patch below on top of your patch and with our sched_clock prototype. I
guess there is a reason why you want rq->clock advanced by at least one tick?
We discussed calling scheduler_tick with virtual time as well.
Would it have the same result?
What would be the impact on latency?
After looking at the current s390 timer code, it seems that this kind of
change is not trivial enough to be rc3+ ready.
I personally think, that for 2.6.23 we should use the patch against
fs/proc/array.c and everything else for 2.6.24?
Christian
---
kernel/sched.c | 6 ------
1 file changed, 6 deletions(-)
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -3321,15 +3321,9 @@ void scheduler_tick(void)
int cpu = smp_processor_id();
struct rq *rq = cpu_rq(cpu);
struct task_struct *curr = rq->curr;
- u64 next_tick = rq->tick_timestamp + TICK_NSEC;
spin_lock(&rq->lock);
__update_rq_clock(rq);
- /*
- * Let rq->clock advance by at least TICK_NSEC:
- */
- if (unlikely(rq->clock < next_tick))
- rq->clock = next_tick;
rq->tick_timestamp = rq->clock;
update_cpu_load(rq);
if (curr != rq->idle) /* FIXME: needed? */
* Christian Borntraeger <[email protected]> wrote:
> Am Dienstag, 21. August 2007 schrieben Sie:
> > could you try the patch below, does it work any better?
>
> I looked again at the scheduler code and things are getting better
> when I run the patch below on top of your patch and with our
> sched_clock prototype. I guess there is a reason why you want
> rq->clock advanced by at least one tick?
yeah - on PCs if for whatever reason the TSC misbehaves (and that's
quite frequent) then this code sets a minimum boundary for behavior. If
sched_clock() is totally random or does not advance at all or goes
backwards all the time then rq_clock() still functions and falls back to
jiffies-granularity behavior in essence.
> We discussed calling scheduler_tick with virtual time as well.
> Would it have the same result?
> What would be the impact on latency?
if you call scheduler_tick() with virtual time then the "safety"
measures in rq_clock() do not kick in and sched_clock() behaves
correctly as far as the scheduler is concerned. (if everything is in
virtual time then the scheduler has no way to observe/notice that in
reality this is a virtual machine.)
> After looking at the current s390 timer code, it seems that this kind of
> change is not trivial enough to be rc3+ ready.
> I personally think, that for 2.6.23 we should use the patch against
> fs/proc/array.c and everything else for 2.6.24?
yes, that has the least impact for .23 - i have added your array.c patch
to my queue.
Ingo