Andrew,
the following patch series is an update in response to your review.
Following points have been addressed:
- documentation for the high res / dyntick design
- documentation for timer_stats
- removal of the patches, which modify timeout behaviour (i8042, slab, timeout
granularity). Those are definitely worth to investigate further, but they are
not a fundamental part of the high res / dyntick feature.
- cleanup of enum -> int abuse
- namespace cleanup
- kernel doc fixups
- improved comments all over the place
- pointed out bugs resolved
- mismerge from -mm1 to -mm2 in the clockevents-i386 patch repaired
- rcu hackery removed: This was a leftover from an attempt to enforce the RCU
updates to be processed on the way to idle rather than waiting for the grace
period expiry. This is interesting to further reduce the idle wakeups with
respect to power saving, but is not necessary for the basic functionality of
the dyntick patch set.
We did not address the GTOD patches, as we want to wait for John's input on
your comments. The persistent clock modifications are useful in two ways:
1. completely remove manipulation of xtime related variables from the
architecture code
2. ensure the correct resume ordering
We experiencend resume problems with earlier versions of the high resolution
timer /dyntick patches and we were able to identify the unordered update as the
cause. After an initial workaround similar to the current code, John
resurrected his timeofday-persistant-xxx patch set, which integrates nicely
with the already merged GTOD functionality.
The series contains two new patches:
#09: hrtimer-enum-and-namespace-cleanup.patch
(new patch, resolves review issues vs. enums and namespaces)
#13: time-and-timer-documentation.patch
(Move hrtimer.txt to a new directory and add high res / dyntick
design notes)
A broken out series and a combined patch are available at the ususal place:
http://tglx.de/projects/hrtimers/2.6.18-mm2/
The following design notes are also part of patch #13
High resolution timers and dynamic ticks design notes
-----------------------------------------------------
Further information can be found in the paper of the OLS 2006 talk "hrtimers
and beyond". The paper is part of the OLS 2006 Proceedings Volume 1, which can
be found on the OLS website:
http://www.linuxsymposium.org/2006/linuxsymposium_procv1.pdf
The slides to this talk are available from:
http://tglx.de/projects/hrtimers/ols2006-hrtimers.pdf
The slides contain five figures (pages 2, 15, 18, 20, 22), which illustrate the
changes in the time(r) related Linux subsystems. Figure #1 (p. 2) shows the
design of the Linux time(r) system before hrtimers and other building blocks
got merged into mainline.
Note: the paper and the slides are talking about "clock event source", while we
switched to the name "clock event devices" in meantime.
The design contains the following basic building blocks:
- hrtimer base infrastructure
- timeofday and clock source management
- clock event management
- high resolution timer functionality
- dynamic ticks
hrtimer base infrastructure
---------------------------
The hrtimer base infrastructure was merged into the 2.6.16 kernel. Details of
the base implementation are covered in Documentation/hrtimer/hrtimer.txt. See
also figure #2 (OLS slides p. 15)
The main differences to the timer wheel, which holds the armed timer_list type
timers are:
- time ordered enqueueing into a rb-tree
- independent of ticks (the processing is based on nanoseconds)
timeofday and clock source management
-------------------------------------
John Stultz's Generic Time Of Day (GTOD) framework moves a large portion of
code out of the architecture-specific areas into a generic management
framework, as illustrated in figure #3 (OLS slides p. 18). The architecture
specific portion is reduced to the low level hardware details of the clock
sources, which are registered in the framework and selected on a quality based
decision. The low level code provides hardware setup and readout routines and
initializes data structures, which are used by the generic time keeping code to
convert the clock ticks to nanosecond based time values. All other time keeping
related functionality is moved into the generic code. The GTOD base patch got
merged into the 2.6.18 kernel.
Further information about the Generic Time Of Day framework is available in the
OLS 2005 Proceedings Volume 1:
http://www.linuxsymposium.org/2005/linuxsymposium_procv1.pdf
The paper "We Are Not Getting Any Younger: A New Approach to Time and
Timers" was written by J. Stultz, D.V. Hart, & N. Aravamudan.
Figure #3 (OLS slides p.18) illustrates the transformation.
clock event management
----------------------
While clock sources provide read access to the monotonically increasing time
value, clock event devices are used to schedule the next event
interrupt(s). The next event is currently defined to be periodic, with its
period defined at compile time. The setup and selection of the event device
for various event driven functionalities is hardwired into the architecture
dependent code. This results in duplicated code across all architectures and
makes it extremely difficult to change the configuration of the system to use
event interrupt devices other than those already built into the
architecture. Another implication of the current design is that it is necessary
to touch all the architecture-specific implementations in order to provide new
functionality like high resolution timers or dynamic ticks.
The clock events subsystem tries to address this problem by providing a generic
solution to manage clock event devices and their usage for the various clock
event driven kernel functionalities. The goal of the clock event subsystem is
to minimize the clock event related architecture dependent code to the pure
hardware related handling and to allow easy addition and utilization of new
clock event devices. It also minimizes the duplicated code across the
architectures as it provides generic functionality down to the interrupt
service handler, which is almost inherently hardware dependent.
Clock event devices are registered either by the architecture dependent boot
code or at module insertion time. Each clock event device fills a data
structure with clock-specific property parameters and callback functions. The
clock event management decides, by using the specified property parameters, the
set of system functions a clock event device will be used to support. This
includes the distinction of per-CPU and per-system global event devices.
System-level global event devices are used for the Linux periodic tick. Per-CPU
event devices are used to provide local CPU functionality such as process
accounting, profiling, and high resolution timers.
The management layer assignes one or more of the folliwing functions to a clock
event device:
- system global periodic tick (jiffies update)
- cpu local update_process_times
- cpu local profiling
- cpu local next event interrupt (non periodic mode)
The clock event device delegates the selection of those timer interrupt related
functions completely to the management layer. The clock management layer stores
a function pointer in the device description structure, which has to be called
from the hardware level handler. This removes a lot of duplicated code from the
architecture specific timer interrupt handlers and hands the control over the
clock event devices and the assignment of timer interrupt related functionality
to the core code.
The clock event layer API is rather small. Aside from the clock event device
registration interface it provides functions to schedule the next event
interrupt, clock event device notification service and support for suspend and
resume.
The framework adds about 700 lines of code which results in a 2KB increase of
the kernel binary size. The conversion of i386 removes about 100 lines of
code. The binary size decrease is in the range of 400 byte. We believe that the
increase of flexibility and the avoidance of duplicated code across
architectures justifies the slight increase of the binary size.
The conversion of an architecture has no functional impact, but allows to
utilize the high resolution and dynamic tick functionalites without any change
to the clock event device and timer interrupt code. After the conversion the
enabling of high resolution timers and dynamic ticks is simply provided by
adding the kernel/time/Kconfig file to the architecture specific Kconfig and
adding the dynamic tick specific calls to the idle routine (a total of 3 lines
added to the idle function and the Kconfig file)
Figure #4 (OLS slides p.20) illustrates the transformation.
high resolution timer functionality
-----------------------------------
During system boot it is not possible to use the high resolution timer
functionality, while making it possible would be difficult and would serve no
useful function. The initialization of the clock event device framework, the
clock source framework (GTOD) and hrtimers itself has to be done and
appropriate clock sources and clock event devices have to be registered before
the high resolution functionality can work. Up to the point where hrtimers are
initialized, the system works in the usual low resolution periodic mode. The
clock source and the clock event device layers provide notification functions
which inform hrtimers about availability of new hardware. hrtimers validates
the usability of the registered clock sources and clock event devices before
switching to high resolution mode. This ensures also that a kernel which is
configured for high resolution timers can run on a system which lacks the
necessary hardware support.
The high resolution timer code does not support SMP machines which have only
global clock event devices. The support of such hardware would involve IPI
calls when an interrupt happens. The overhead would be much larger than the
benefit. This is the reason why we currently disable high resolution and
dynamic ticks on i386 SMP systems which stop the local APIC in C3 power
state. A workaround is available as an idea, but the problem has not been
tackled yet.
The time ordered insertion of timers provides all the infrastructure to decide
whether the event device has to be reprogrammed when a timer is added. The
decision is made per timer base and synchronized across per-cpu timer bases in
a support function. The design allows the system to utilize separate per-CPU
clock event devices for the per-CPU timer bases, but currently only one
reprogrammable clock event device per-CPU is utilized.
When the timer interrupt happens, the next event interrupt handler is called
from the clock event distribution code and moves expired timers from the
red-black tree to a separate double linked list and invokes the softirq
handler. An additional mode field in the hrtimer structure allows the system to
execute callback functions directly from the next event interrupt handler. This
is restricted to code which can safely be executed in the hard interrupt
context. This applies, for example, to the common case of a wakeup function as
used by nanosleep. The advantage of executing the handler in the interrupt
context is the avoidance of up to two context switches - from the interrupted
context to the softirq and to the task which is woken up by the expired
timer.
Once a system has switched to high resolution mode, the periodic tick is
switched off. This disables the per system global periodic clock event device -
e.g. the PIT on i386 SMP systems.
The periodic tick functionality is provided by an per-cpu hrtimer. The callback
function is executed in the next event interrupt context and updates jiffies
and calls update_process_times and profiling. The implementation of the hrtimer
based periodic tick is designed to be extended with dynamic tick functionality.
This allows to use a single clock event device to schedule high resolution
timer and periodic events (jiffies tick, profiling, process accounting) on UP
systems. This has been proved to work with the PIT on i386 and the Incrementer
on PPC.
The softirq for running the hrtimer queues and executing the callbacks has been
separated from the tick bound timer softirq to allow accurate delivery of high
resolution timer signals which are used by itimer and POSIX interval
timers. The execution of this softirq can still be delayed by other softirqs,
but the overall latencies have been significantly improved by this separation.
Figure #5 (OLS slides p.22) illustrates the transformation.
dynamic ticks
-------------
Dynamic ticks are the logical consequence of the hrtimer based periodic tick
replacement (sched_tick). The functionality of the sched_tick hrtimer is
extended by three functions:
- hrtimer_stop_sched_tick
- hrtimer_restart_sched_tick
- hrtimer_update_jiffies
hrtimer_stop_sched_tick() is called when a CPU goes into idle state. The code
evaluates the next scheduled timer event (from both hrtimers and the timer
wheel) and in case that the next event is further away than the next tick it
reprograms the sched_tick to this future event, to allow longer idle sleeps
without worthless interruption by the periodic tick. The function is also
called when an interrupt happens during the idle period, which does not cause a
reschedule. The call is necessary as the interrupt handler might have armed a
new timer whose expiry time is before the time which was identified as the
nearest event in the previous call to hrtimer_stop_sched_tick.
hrtimer_restart_sched_tick() is called when the CPU leaves the idle state before
it calls schedule(). hrtimer_restart_sched_tick() resumes the periodic tick,
which is kept active until the next call to hrtimer_stop_sched_tick().
hrtimer_update_jiffies() is called from irq_enter() when an interrupt happens
in the idle period to make sure that jiffies are up to date and the interrupt
handler has not to deal with an eventually stale jiffy value.
The dynamic tick feature provides statistical values which are exported to
userspace via /proc/stats and can be made available for enhanced power
management control.
The implementation leaves room for further development like full tickless
systems, where the time slice is controlled by the scheduler, variable
frequency profiling, and a complete removal of jiffies in the future.
Aside the current initial submission of i386 support, the patchset has been
extended to x86_64 and ARM already. Initial (work in progress) support is also
available for MIPS and PowerPC.
Thomas, Ingo
--
On Sun, 01 Oct 2006 22:59:01 -0000, Thomas Gleixner said:
> the following patch series is an update in response to your review.
This complains if you try to compile with -Werror-implicit-function-declaration
and rightly so, as we're missing a #include to define IS_ERR_VALUE().
Patch attached.
Signed-off-by: Valdis Kletnieks <[email protected]>
--- linux-2.6.18-mm2/kernel/hrtimer.c.buggy 2006-10-02 00:46:50.000000000 -0400
+++ linux-2.6.18-mm2/kernel/hrtimer.c 2006-10-02 01:02:55.000000000 -0400
@@ -43,6 +43,7 @@
#include <linux/clockchips.h>
#include <linux/profile.h>
#include <linux/seq_file.h>
+#include <linux/err.h>
#include <asm/uaccess.h>
--- linux-2.6.18-mm2/kernel/time/clockevents.c.buggy 2006-10-02 00:46:50.000000000 -0400
+++ linux-2.6.18-mm2/kernel/time/clockevents.c 2006-10-02 01:04:22.000000000 -0400
@@ -33,6 +33,7 @@
#include <linux/profile.h>
#include <linux/sysdev.h>
#include <linux/hrtimer.h>
+#include <linux/err.h>
#define MAX_CLOCK_EVENTS 4
#define GLOBAL_CLOCK_EVENT MAX_CLOCK_EVENTS
On Sun, 01 Oct 2006 22:59:01 -0000, Thomas Gleixner said:
> the following patch series is an update in response to your review.
First runtime results - no lockups or other severe badness in a half-hour or so
of running.
-mm2-hrt-dynticks5 shows severe clock drift issues if you run 'cpuspeed'.
Using speedstep-ich as a kernel built-in, and cpuspeed is invoked as:
cpuspeed -d -n -i 10 -p 10 50 -a /proc/acpi/ac_adapter/*/state
If cpuspeed drops the CPU speed from the default 1.6Ghz down to 1.2Ghz (the
only 2 speeds available on this core), the system clock proceeds to lose
about 15 seconds a minute. I haven't dug further into why yet. (If the system
is busy so cpuspeed keeps the processor at 1.6Ghz, the clock doesn't drift
as much - so it looks like a "when speed is 1.2Ghz" issue...)
I'm also seeing gkrellm reporting about 25% CPU use when "near-idle" (X is up
but not much is going on) when that's usually down around 5-6%. I need to
collect some oprofile numbers and investigate that as well.
On Mon, 2006-10-02 at 09:02 -0400, [email protected] wrote:
> On Sun, 01 Oct 2006 22:59:01 -0000, Thomas Gleixner said:
> > the following patch series is an update in response to your review.
>
> First runtime results - no lockups or other severe badness in a half-hour or so
> of running.
>
> -mm2-hrt-dynticks5 shows severe clock drift issues if you run 'cpuspeed'.
>
> Using speedstep-ich as a kernel built-in, and cpuspeed is invoked as:
>
> cpuspeed -d -n -i 10 -p 10 50 -a /proc/acpi/ac_adapter/*/state
>
> If cpuspeed drops the CPU speed from the default 1.6Ghz down to 1.2Ghz (the
> only 2 speeds available on this core), the system clock proceeds to lose
> about 15 seconds a minute. I haven't dug further into why yet. (If the system
> is busy so cpuspeed keeps the processor at 1.6Ghz, the clock doesn't drift
> as much - so it looks like a "when speed is 1.2Ghz" issue...)
Can you please send me the bootlog and further dmesg output (especially
when related to timers / cpufreq).
> I'm also seeing gkrellm reporting about 25% CPU use when "near-idle" (X is up
> but not much is going on) when that's usually down around 5-6%. I need to
> collect some oprofile numbers and investigate that as well.
I look into the accounting fixups again.
tglx
(Sorry for the size of the note, there's some 50K of logs attached)
On Mon, 02 Oct 2006 15:43:02 +0200, Thomas Gleixner said:
> Can you please send me the bootlog and further dmesg output (especially
> when related to timers / cpufreq).
I booted the box to single-user both times, and then started cpuspeed.
I then did a cat of /proc/interrupts, /proc/uptime, and a date command,
waited 60 seconds according to my watch, and repeated. I then dumped
the dmesg. The -dyntick kernel moved 'uptime' almost exactly 45 seconds
(almost certainly a by-product of running at 1.2Ghz rather than 1.6Ghz).
Does the dyntick code make any unwritten assumptions about a jiffie or
bogomips remaining constant?
Attached - config diff, date and /proc dumps from both -mm2 and -mm2-dyntick,
and the dmesg's from both boots.
Yell if you have any other questions/suggestions/etc..
> > I'm also seeing gkrellm reporting about 25% CPU use when "near-idle" (X is up
> > but not much is going on) when that's usually down around 5-6%. I need to
> > collect some oprofile numbers and investigate that as well.
>
> I look into the accounting fixups again.
I still need to get oprofile runs of this and see what's going on.
On Mon, 2006-10-02 at 14:25 -0400, [email protected] wrote:
> (Sorry for the size of the note, there's some 50K of logs attached)
>
> On Mon, 02 Oct 2006 15:43:02 +0200, Thomas Gleixner said:
> > Can you please send me the bootlog and further dmesg output (especially
> > when related to timers / cpufreq).
>
> I booted the box to single-user both times, and then started cpuspeed.
> I then did a cat of /proc/interrupts, /proc/uptime, and a date command,
> waited 60 seconds according to my watch, and repeated. I then dumped
> the dmesg. The -dyntick kernel moved 'uptime' almost exactly 45 seconds
> (almost certainly a by-product of running at 1.2Ghz rather than 1.6Ghz).
> Does the dyntick code make any unwritten assumptions about a jiffie or
> bogomips remaining constant?
>
> Attached - config diff, date and /proc dumps from both -mm2 and -mm2-dyntick,
> and the dmesg's from both boots.
>
> Yell if you have any other questions/suggestions/etc..
Hmmm. So w/ -mm2 we're seeing the TSC get detected as running too slowly
(and its replaced w/ the ACPI PM), but for some reason that doesn't
happen w/ the dynticks patch.
Now, how is cpuspeed changing the cpufreq? Is it using the /sys
interface? I've got hooks in so when the cpufreq changes we should mark
it unstable and fall back to ACPI PM, but maybe I missed whatever hook
cpuspeed is using.
thanks
-john
On Mon, 2006-10-02 at 14:25 -0400, [email protected] wrote:
> > > I'm also seeing gkrellm reporting about 25% CPU use when "near-idle" (X is up
> > > but not much is going on) when that's usually down around 5-6%. I need to
> > > collect some oprofile numbers and investigate that as well.
> >
> > I look into the accounting fixups again.
> I still need to get oprofile runs of this and see what's going on.
The patch below fixes the accounting weirdness.
tglx
----------------->
Subject: dynticks core: Fix idle time accounting
From: Thomas Gleixner <tglx@linutronix,de>
The extended sleeps during idle must be accounted to the idle thread.
The original accounting fixup was too naive. The time must be accounted
when the idle thread is interrupted and the jiffies update code has
forwarded jiffies. Otherwise the accounting is done on random targets.
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
--
kernel/hrtimer.c | 34 +++++++++++++++++++---------------
1 file changed, 19 insertions(+), 15 deletions(-)
Index: linux-2.6.18-mm2/kernel/hrtimer.c
===================================================================
--- linux-2.6.18-mm2.orig/kernel/hrtimer.c 2006-10-02 19:22:14.000000000 +0200
+++ linux-2.6.18-mm2/kernel/hrtimer.c 2006-10-02 19:22:14.000000000 +0200
@@ -44,6 +44,7 @@
#include <linux/profile.h>
#include <linux/seq_file.h>
#include <linux/err.h>
+#include <linux/kernel_stat.h>
#include <asm/uaccess.h>
@@ -447,10 +448,11 @@ static const ktime_t nsec_per_hz = { .tv
* want to wake up a complete idle cpu just to update jiffies, so we need
* something more intellegent than a mere "do this only on CPUx".
*/
-static void update_jiffies64(ktime_t now)
+static unsigned long update_jiffies64(ktime_t now)
{
unsigned long seq;
ktime_t delta;
+ unsigned long ticks = 0;
/* Preevaluate to avoid lock contention */
do {
@@ -459,14 +461,13 @@ static void update_jiffies64(ktime_t now
} while (read_seqretry(&xtime_lock, seq));
if (delta.tv64 < nsec_per_hz.tv64)
- return;
+ return 0;
/* Reevalute with xtime_lock held */
write_seqlock(&xtime_lock);
delta = ktime_sub(now, last_jiffies_update);
if (delta.tv64 >= nsec_per_hz.tv64) {
- unsigned long ticks = 1;
delta = ktime_sub(delta, nsec_per_hz);
last_jiffies_update = ktime_add(last_jiffies_update,
@@ -480,11 +481,13 @@ static void update_jiffies64(ktime_t now
last_jiffies_update = ktime_add_ns(last_jiffies_update,
incr * ticks);
- ticks++;
}
+ ticks++;
do_timer(ticks);
}
write_sequnlock(&xtime_lock);
+
+ return ticks;
}
#ifdef CONFIG_NO_HZ
@@ -500,7 +503,7 @@ static void update_jiffies64(ktime_t now
void hrtimer_update_jiffies(void)
{
struct hrtimer_cpu_base *cpu_base = &__get_cpu_var(hrtimer_bases);
- unsigned long flags;
+ unsigned long flags, ticks;
ktime_t now;
if (!cpu_base->tick_stopped || !cpu_base->hres_active)
@@ -509,7 +512,17 @@ void hrtimer_update_jiffies(void)
now = ktime_get();
local_irq_save(flags);
- update_jiffies64(now);
+ ticks = update_jiffies64(now);
+ if (ticks) {
+ /*
+ * We stopped the tick in idle and this function got called to
+ * update jiffies. Update process times would randomly account
+ * the time we slept to whatever the context of the next sched
+ * tick is. Enforce that this is accounted to idle !
+ */
+ account_system_time(current, HARDIRQ_OFFSET,
+ jiffies_to_cputime(ticks));
+ }
local_irq_restore(flags);
}
@@ -604,15 +617,6 @@ void hrtimer_restart_sched_tick(void)
local_irq_disable();
update_jiffies64(now);
- /*
- * Update process times would randomly account the time we slept to
- * whatever the context of the next sched tick is. Enforce that this
- * is accounted to idle !
- */
- add_preempt_count(HARDIRQ_OFFSET);
- update_process_times(0);
- sub_preempt_count(HARDIRQ_OFFSET);
-
/* Account the idle time */
delta = ktime_sub(now, cpu_base->idle_entrytime);
cpu_base->idle_sleeptime = ktime_add(cpu_base->idle_sleeptime, delta);
On Mon, 02 Oct 2006 11:38:36 PDT, john stultz said:
> Hmmm. So w/ -mm2 we're seeing the TSC get detected as running too slowly
> (and its replaced w/ the ACPI PM), but for some reason that doesn't
> happen w/ the dynticks patch.
It's been switching to ACPI PM for somewhere near forever, I never bothered
to check into that because the PM timer provides a reasonably stable clock
source (it drifts at about 24 ppm and NTP is happy with it, and I haven't
gotten annoyed at the fact the PM timer is slow to read...)
I wonder if the TSC has been broken for forever on this box, and I'm just
seeing it because dynticks doesn't fall over to PM timer..
> Now, how is cpuspeed changing the cpufreq? Is it using the /sys
> interface? I've got hooks in so when the cpufreq changes we should mark
> it unstable and fall back to ACPI PM, but maybe I missed whatever hook
> cpuspeed is using.
Looking at the source, it appears to do this:
const char SYSFS_CURRENT_SPEED_FILE[] =
"/sys/devices/system/cpu/cpu%u/cpufreq/scaling_setspeed";
// set the current CPU speed
void set_speed(unsigned value)
{
#ifdef DEBUG
fprintf(stderr, "[cpu%u] Setting speed to: %uKHz\n", cpu, value);
#endif
write_line(CURRENT_SPEED_FILE, "%u\n", value);
// give CPU / chipset voltage time to settle down
usleep(10000);
}
On Mon, 2006-10-02 at 15:08 -0400, [email protected] wrote:
> On Mon, 02 Oct 2006 11:38:36 PDT, john stultz said:
>
> > Hmmm. So w/ -mm2 we're seeing the TSC get detected as running too slowly
> > (and its replaced w/ the ACPI PM), but for some reason that doesn't
> > happen w/ the dynticks patch.
>
> It's been switching to ACPI PM for somewhere near forever, I never bothered
> to check into that because the PM timer provides a reasonably stable clock
> source (it drifts at about 24 ppm and NTP is happy with it, and I haven't
> gotten annoyed at the fact the PM timer is slow to read...)
>
> I wonder if the TSC has been broken for forever on this box, and I'm just
> seeing it because dynticks doesn't fall over to PM timer..
This is what I suspect is the issue. Probably due to the new jiffies
accounting being now time based, and one of the TSC unstable checks (the
one you're tripping) being jiffies based. A tad bit circular :). I'm
working w/ tglx to see what we can do here.
> > Now, how is cpuspeed changing the cpufreq? Is it using the /sys
> > interface? I've got hooks in so when the cpufreq changes we should mark
> > it unstable and fall back to ACPI PM, but maybe I missed whatever hook
> > cpuspeed is using.
>
> Looking at the source, it appears to do this:
>
> const char SYSFS_CURRENT_SPEED_FILE[] =
> "/sys/devices/system/cpu/cpu%u/cpufreq/scaling_setspeed";
>
> // set the current CPU speed
> void set_speed(unsigned value)
> {
> #ifdef DEBUG
> fprintf(stderr, "[cpu%u] Setting speed to: %uKHz\n", cpu, value);
> #endif
> write_line(CURRENT_SPEED_FILE, "%u\n", value);
> // give CPU / chipset voltage time to settle down
> usleep(10000);
> }
I'll also take a peek there and see if I cannot add an extra hook, so we
don't have to rely on the jiffies stability check.
thanks
-john
On Mon, 02 Oct 2006 20:43:26 +0200, Thomas Gleixner said:
>
> The patch below fixes the accounting weirdness.
I think it's still slightly defective, or at least suffering from a
disjoint between what is going on - the numbers reported in /proc/stats
add up to the total number of timer interrupts, but that's not necessarily
representative of what happened...
% cat /proc/stat;sleep 15;cat /proc/stat
cpu 27634 0 7762 20470 881 331 252 0
cpu0 27634 0 7762 20470 881 331 252 0
intr 812332 631476 2960 0 4 4 12667 3 14 1 1 4 142891 114 0 22193 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ctxt 2187603
btime 1159817297
processes 4028
procs_running 1
procs_blocked 0
nohz total I:397276 S:379955 T:1187.393123 A:0.003125 E: 629447
cpu 27753 0 7818 20739 881 332 253 0
cpu0 27753 0 7818 20739 881 332 253 0
intr 819027 636542 2969 0 4 4 12801 3 14 1 1 4 144371 114 0 22199 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ctxt 2209881
btime 1159817297
processes 4033
procs_running 1
procs_blocked 0
nohz total I:401991 S:384494 T:1200.732924 A:0.003122 E: 634513
And the deltas between the sums for cpu0 are equal to the difference of
the first intr (where the timer is) - ticking along at about 446/sec over
that 15 second timeframe. And sure enough, the 'user' field is about 1/3 of
the total interrupts.
The breakage is that userspace tools like gkrellm and vmstat and top are quite
happy in saying "oh, we averaged 446 ticks/sec over the last N seconds? That's
odd, but I can deal..." but unfortunately, treating all of them the same
"width" - and the idle ones are probably twice as wide if not wider. I'm not
sure how to fix that.
(The "thought experiment" for this - imagine over a 10 second period, an idle
machine takes 100 short timeslices for a running process, and 100 very long
sleeps 10 times as long as the first 100. What should /proc/stats report at
that point?)
On Mon, 2006-10-02 at 16:17 -0400, [email protected] wrote:
> cpu 27634 0 7762 20470 881 331 252 0
> cpu0 27634 0 7762 20470 881 331 252 0
> intr 812332 631476 2960 0 4 4 12667 3 14 1 1 4 142891 114 0 22193 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> ctxt 2187603
> btime 1159817297
> processes 4028
> procs_running 1
> procs_blocked 0
> nohz total I:397276 S:379955 T:1187.393123 A:0.003125 E: 629447
> cpu 27753 0 7818 20739 881 332 253 0
> cpu0 27753 0 7818 20739 881 332 253 0
> intr 819027 636542 2969 0 4 4 12801 3 14 1 1 4 144371 114 0 22199 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> ctxt 2209881
> btime 1159817297
> processes 4033
> procs_running 1
> procs_blocked 0
> nohz total I:401991 S:384494 T:1200.732924 A:0.003122 E: 634513
Strange.
/me digs deeper
tglx
On Mon, 02 Oct 2006 23:22:38 +0200, Thomas Gleixner said:
> On Mon, 2006-10-02 at 16:17 -0400, [email protected] wrote:
> > cpu 27634 0 7762 20470 881 331 252 0
> > cpu0 27634 0 7762 20470 881 331 252 0
> > intr 812332 631476 2960 0 4 4 12667 3 14 1 1 4 142891 114 0 22193 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0
> > ctxt 2187603
> > btime 1159817297
> > processes 4028
> > procs_running 1
> > procs_blocked 0
> > nohz total I:397276 S:379955 T:1187.393123 A:0.003125 E: 629447
> > cpu 27753 0 7818 20739 881 332 253 0
> > cpu0 27753 0 7818 20739 881 332 253 0
> > intr 819027 636542 2969 0 4 4 12801 3 14 1 1 4 144371 114 0 22199 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0
> > ctxt 2209881
> > btime 1159817297
> > processes 4033
> > procs_running 1
> > procs_blocked 0
> > nohz total I:401991 S:384494 T:1200.732924 A:0.003122 E: 634513
>
> Strange.
>
> /me digs deeper
Not really strange at all - between code inspection and checking other stuff,
I'm now convinced the *counts* of "was the previous tick user/nice/system/idle"
reported in the cpu0 lines are accurate and report the relative counts
correctly. The problem is that userspace tools are assuming that all the ticks
reported are created equal. "We had 200 ticks, total, 100 were user and 100
were idle, so we were at 50/50 user/idle" - but in reality we had 100 10-ms
user ticks and 100 100-ms idle ticks and and only about 10% busy.....
We could "pump up" the relative counts - if 1 no-hz tick would have been 5ms
long, increment the count by 5 rather than 1 (for an alledged 1khz tick).
However, when we do that, we break the property that the sum of the ticks
in the 'cpu0' line is equal to the number of timer interrupts reported in the
'intr' line.
Like I said - unclear how to fix this....
On Sun, 01 Oct 2006 23:00:45 -0000
Thomas Gleixner <[email protected]> wrote:
> We did not address the GTOD patches, as we want to wait for John's input on
> your comments.
I note that the default CONFIG_HIGH_RES_RESOLUTION is still 1000 (one
microsecond), which is far higher resolution than you actually recommend.
I did query that last time around. I'd prefer not to have to go back and
re-review it all, please...
These patches make my Vaio run really really slowly. Maybe a quarter of
the normal speed or lower. Bisection shows that the bug is introduced by
clockevents-drivers-for-i386.patch+clockevents-drivers-for-i386-fix.patch
With all patches applied, the slowdown happens with
CONFIG_HIGH_RES_TIMERS=n and also with CONFIG_HIGH_RES_TIMERS=y &&
CONFIG_NO_HZ=y. So something got collaterally damaged.
I put various helpful stuff at http://userweb.kernel.org/~akpm/x/
I uploaded all the patches I was using to
http://userweb.kernel.org/~akpm/x/patches/
It doesn't seem to be a cpufreq thing: cpuinfo_min_freq=800kHz,
cpuinfo_max_freq=2GHz and cpuinfo_cur_freq goes up to 2GHz under load.
Wall time is increasing at one second per second.
On Mon, 2006-10-02 at 21:00 -0700, Andrew Morton wrote:
> These patches make my Vaio run really really slowly. Maybe a quarter of
> the normal speed or lower. Bisection shows that the bug is introduced by
> clockevents-drivers-for-i386.patch+clockevents-drivers-for-i386-fix.patch
>
> With all patches applied, the slowdown happens with
> CONFIG_HIGH_RES_TIMERS=n and also with CONFIG_HIGH_RES_TIMERS=y &&
> CONFIG_NO_HZ=y. So something got collaterally damaged.
>
> I put various helpful stuff at http://userweb.kernel.org/~akpm/x/
> I uploaded all the patches I was using to
> http://userweb.kernel.org/~akpm/x/patches/
That's basically the same set I have here +/- the fixups
> It doesn't seem to be a cpufreq thing: cpuinfo_min_freq=800kHz,
> cpuinfo_max_freq=2GHz and cpuinfo_cur_freq goes up to 2GHz under load.
> Wall time is increasing at one second per second.
I retest on my Vaio.
tglx
* Andrew Morton <[email protected]> wrote:
> These patches make my Vaio run really really slowly. Maybe a quarter
> of the normal speed or lower. Bisection shows that the bug is
> introduced by
> clockevents-drivers-for-i386.patch+clockevents-drivers-for-i386-fix.patch
>
> With all patches applied, the slowdown happens with
> CONFIG_HIGH_RES_TIMERS=n and also with CONFIG_HIGH_RES_TIMERS=y &&
> CONFIG_NO_HZ=y. So something got collaterally damaged.
yeah, i suspect it works again if you disable:
CONFIG_X86_UP_APIC=y
CONFIG_X86_UP_IOAPIC=y
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
as the slowdown has the feeling of a runaway lapic timer irq.
from code review so far we can only see an udelay(10) difference in the
initialization sequence of the PIT - we'll send a fix for that but i
dont think that's the cause of the bug.
investigating it.
Ingo
* Ingo Molnar <[email protected]> wrote:
> yeah, i suspect it works again if you disable:
>
> CONFIG_X86_UP_APIC=y
> CONFIG_X86_UP_IOAPIC=y
> CONFIG_X86_LOCAL_APIC=y
> CONFIG_X86_IO_APIC=y
>
> as the slowdown has the feeling of a runaway lapic timer irq.
>
> from code review so far we can only see an udelay(10) difference in
> the initialization sequence of the PIT - we'll send a fix for that but
> i dont think that's the cause of the bug.
the patch below fixes that particular bug. But ... the symptoms you are
describing have the feeling of being apic related.
Ingo
-------------------->
Subject: clockevents: drivers for i386, fix #2
From: Ingo Molnar <[email protected]>
add back a mistakenly removed udelay(10) to the PIT initialization
sequence.
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/i386/kernel/i8253.c | 1 +
1 file changed, 1 insertion(+)
Index: linux/arch/i386/kernel/i8253.c
===================================================================
--- linux.orig/arch/i386/kernel/i8253.c
+++ linux/arch/i386/kernel/i8253.c
@@ -45,6 +45,7 @@ static void init_pit_timer(enum clock_ev
outb_p(0x34, PIT_MODE);
udelay(10);
outb_p(LATCH & 0xff , PIT_CH0); /* LSB */
+ udelay(10);
outb(LATCH >> 8 , PIT_CH0); /* MSB */
break;
On Mon, 2006-10-02 at 17:35 -0400, [email protected] wrote:
> We could "pump up" the relative counts - if 1 no-hz tick would have been 5ms
> long, increment the count by 5 rather than 1 (for an alledged 1khz tick).
> However, when we do that, we break the property that the sum of the ticks
> in the 'cpu0' line is equal to the number of timer interrupts reported in the
> 'intr' line.
I found a way to fix my thinkos. I put up a queue with all fixes to:
http://www.tglx.de/projects/hrtimers/2.6.18-mm3/patch-2.6.18-mm3-hrt-dyntick1.patches.tar.bz2
Can you please verify if it makes your problem go away ?
tglx
On Tue, 2006-10-03 at 22:02 +0200, Thomas Gleixner wrote:
> I found a way to fix my thinkos. I put up a queue with all fixes to:
>
> http://www.tglx.de/projects/hrtimers/2.6.18-mm3/patch-2.6.18-mm3-hrt-dyntick1.patches.tar.bz2
>
> Can you please verify if it makes your problem go away ?
Please use dyntick2, as #1 is missing a fix. Sorry.
tglx
On Tue, 03 Oct 2006 22:02:30 +0200, Thomas Gleixner said:
> I found a way to fix my thinkos. I put up a queue with all fixes to:
>
> http://www.tglx.de/projects/hrtimers/2.6.18-mm3/patch-2.6.18-mm3-hrt-dyntick1.patches.tar.bz2
>
> Can you please verify if it makes your problem go away ?
Was -dyntick3 by the time I got there.
The user/system/idle/wait numbers now look sane, with one caveat:
static const ktime_t nsec_per_hz = { .tv64 = NSEC_PER_SEC / HZ };
...
if (unlikely(delta.tv64 >= nsec_per_hz.tv64)) {
s64 incr = ktime_to_ns(nsec_per_hz);
ticks = ktime_divns(delta, incr);
Even though I have CONFIG_HZ=1000, this ends up generating a synthetic
count that works out to 100 per second. gkrellm and vmstat are happy with
that state of affairs, but I'm not sure why it came out to 100/sec rather
than 1000/sec.
% cat /proc/stat /proc/uptime
cpu 28224 4688 9159 168290 9143 283 256 0
cpu0 28224 4688 9159 168290 9143 283 256 0
intr 818891 627337 3466 0 4 4 6459 3 7 1 1 4 160328 115 0 21162 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ctxt 971441
btime 1159926408
processes 4544
procs_running 1
procs_blocked 0
nohz total I:367986 S:302473 T:1737.640072 A:0.005744 E: 625327
2176.02 1775.11
Also, it still disagrees with speedstep - it isn't noticing the TSC has
gone slow and drop back to the PM timer.
All in all, we're making progress. ;)
On Tue, 3 Oct 2006 12:35:03 +0200
Ingo Molnar <[email protected]> wrote:
>
> add back a mistakenly removed udelay(10) to the PIT initialization
> sequence.
>
> Signed-off-by: Ingo Molnar <[email protected]>
> ---
> arch/i386/kernel/i8253.c | 1 +
> 1 file changed, 1 insertion(+)
>
> Index: linux/arch/i386/kernel/i8253.c
> ===================================================================
> --- linux.orig/arch/i386/kernel/i8253.c
> +++ linux/arch/i386/kernel/i8253.c
> @@ -45,6 +45,7 @@ static void init_pit_timer(enum clock_ev
> outb_p(0x34, PIT_MODE);
> udelay(10);
> outb_p(LATCH & 0xff , PIT_CH0); /* LSB */
> + udelay(10);
> outb(LATCH >> 8 , PIT_CH0); /* MSB */
> break;
>
Doesn't help.
> * Ingo Molnar <[email protected]> wrote:
>
> > yeah, i suspect it works again if you disable:
> >
> > CONFIG_X86_UP_APIC=y
> > CONFIG_X86_UP_IOAPIC=y
> > CONFIG_X86_LOCAL_APIC=y
> > CONFIG_X86_IO_APIC=y
> >
> > as the slowdown has the feeling of a runaway lapic timer irq.
> >
Disabling IO_APIC doesn't fix the slowdown.
Disabling LOCAL_APIC does fix it.
* Andrew Morton <[email protected]> wrote:
> Disabling LOCAL_APIC does fix it.
thanks, that narrows it down quite a bit. (We've double-checked the
lapic path and it seemed all our changes are NOP, but obviously it isnt
and we'll check it all again.)
(if you have that kernel still booted by any chance then do you see the
'LOC' IRQ count in /proc/interrupts or any other count in /proc/stats
increasing at an alarming rate? That would narrow it down to lapic timer
misprogramming.)
Ingo
On Wed, 4 Oct 2006 08:46:20 +0200
Ingo Molnar <[email protected]> wrote:
>
> * Andrew Morton <[email protected]> wrote:
>
> > Disabling LOCAL_APIC does fix it.
>
> thanks, that narrows it down quite a bit. (We've double-checked the
> lapic path and it seemed all our changes are NOP, but obviously it isnt
> and we'll check it all again.)
>
> (if you have that kernel still booted by any chance then do you see the
> 'LOC' IRQ count in /proc/interrupts or any other count in /proc/stats
> increasing at an alarming rate? That would narrow it down to lapic timer
> misprogramming.)
>
None of the interrupts are doing anything wrong. oprofile shows nothing
alarming.
Disabling cpufreq in config doesn't fix it.
Userspace can count to a billion in 3.9 seconds when this problem is
present, which is the same time as it takes on a non-slow kernel.
`sleep 5' takes 5 seconds.
Yet initscripts take a long time (especially applying the ipfilter firewall
rues for some reason), and `startx' takes a long time, etc. This kernel
takes 112 seconds to boot to a login prompt - other kernels take 56 seconds
(interesting ratio..)
Weird.
* Andrew Morton <[email protected]> wrote:
> Yet initscripts take a long time (especially applying the ipfilter
> firewall rues for some reason), and `startx' takes a long time, etc.
> This kernel takes 112 seconds to boot to a login prompt - other
> kernels take 56 seconds (interesting ratio..)
you are still using the non-hres config, correct? (so this is still
collateral damage on vanilla kernel functionality)
Ingo
On Wed, 4 Oct 2006 09:41:42 +0200
Ingo Molnar <[email protected]> wrote:
>
> * Andrew Morton <[email protected]> wrote:
>
> > Yet initscripts take a long time (especially applying the ipfilter
> > firewall rues for some reason), and `startx' takes a long time, etc.
> > This kernel takes 112 seconds to boot to a login prompt - other
> > kernels take 56 seconds (interesting ratio..)
>
> you are still using the non-hres config, correct? (so this is still
> collateral damage on vanilla kernel functionality)
>
yup.
* Andrew Morton <[email protected]> wrote:
> None of the interrupts are doing anything wrong. oprofile shows
> nothing alarming.
>
> Disabling cpufreq in config doesn't fix it.
>
> Userspace can count to a billion in 3.9 seconds when this problem is
> present, which is the same time as it takes on a non-slow kernel.
>
> `sleep 5' takes 5 seconds.
>
> Yet initscripts take a long time (especially applying the ipfilter
> firewall rues for some reason), and `startx' takes a long time, etc.
> This kernel takes 112 seconds to boot to a login prompt - other
> kernels take 56 seconds (interesting ratio..)
hm, do you have the NMI watchdog enabled by any chance? [in particular,
do you have nmi_watchdog=2?] Although your bootlog does not show it.
Ingo
* [email protected] <[email protected]> wrote:
> Even though I have CONFIG_HZ=1000, this ends up generating a synthetic
> count that works out to 100 per second. gkrellm and vmstat are happy
> with that state of affairs, but I'm not sure why it came out to
> 100/sec rather than 1000/sec.
that's how it worked for quite some time: all userspace APIs are
HZ-independent and depend on USER_HZ (which is 100 even if HZ is 1000).
Ingo
On Wed, 4 Oct 2006 09:55:40 +0200
Ingo Molnar <[email protected]> wrote:
>
> * Andrew Morton <[email protected]> wrote:
>
> > None of the interrupts are doing anything wrong. oprofile shows
> > nothing alarming.
> >
> > Disabling cpufreq in config doesn't fix it.
> >
> > Userspace can count to a billion in 3.9 seconds when this problem is
> > present, which is the same time as it takes on a non-slow kernel.
> >
> > `sleep 5' takes 5 seconds.
> >
> > Yet initscripts take a long time (especially applying the ipfilter
> > firewall rues for some reason), and `startx' takes a long time, etc.
> > This kernel takes 112 seconds to boot to a login prompt - other
> > kernels take 56 seconds (interesting ratio..)
>
> hm, do you have the NMI watchdog enabled by any chance? [in particular,
> do you have nmi_watchdog=2?] Although your bootlog does not show it.
>
There's no nmi_watchdog setting in the kernel boot command line and the
NMI counter isn't incrementing.
On Wed, 04 Oct 2006 09:56:57 +0200, Ingo Molnar said:
>
> * [email protected] <[email protected]> wrote:
>
> > Even though I have CONFIG_HZ=1000, this ends up generating a synthetic
> > count that works out to 100 per second. gkrellm and vmstat are happy
> > with that state of affairs, but I'm not sure why it came out to
> > 100/sec rather than 1000/sec.
>
> that's how it worked for quite some time: all userspace APIs are
> HZ-independent and depend on USER_HZ (which is 100 even if HZ is 1000).
Nevermind - I missed where fs/proc/proc_misc.c applied jiffies_64_to_clock_t()
to the number before handing it to userspace. So the numbers *were* being
kept in terms of HZ (as my reading of the code indicated), they just didn't
manage to escape to userspace that way....
* Andrew Morton <[email protected]> wrote:
> > hm, do you have the NMI watchdog enabled by any chance? [in
> > particular, do you have nmi_watchdog=2?] Although your bootlog does
> > not show it.
> >
>
> There's no nmi_watchdog setting in the kernel boot command line and
> the NMI counter isn't incrementing.
there's one material difference we just found: in the !hres case we'll
do the timer IRQ handling mostly from the lapic vector - while in
mainline we do it from the irq0 vector. So, how does your
/proc/interrupts look like? How frequently does LOC increase, and how
frequently does IRQ 0 increase?
(meanwhile we'll fix and restore things so that it matches mainline
behavior.)
Ingo
On Wed, 2006-10-04 at 12:53 +0200, Ingo Molnar wrote:
> there's one material difference we just found: in the !hres case we'll
> do the timer IRQ handling mostly from the lapic vector - while in
> mainline we do it from the irq0 vector. So, how does your
> /proc/interrupts look like? How frequently does LOC increase, and how
> frequently does IRQ 0 increase?
>
> (meanwhile we'll fix and restore things so that it matches mainline
> behavior.)
Andrew, does the patch below fix your problem ?
You should see the same weird behaviour when you run a plain -mm3 with
CONFIG_SMP=y on that box. This moves update_process_times() to the lapic
too.
tglx
Index: linux-2.6.18-mm3/arch/i386/kernel/apic.c
===================================================================
--- linux-2.6.18-mm3.orig/arch/i386/kernel/apic.c 2006-10-04 13:02:35.000000000 +0200
+++ linux-2.6.18-mm3/arch/i386/kernel/apic.c 2006-10-04 12:59:06.000000000 +0200
@@ -84,7 +84,9 @@ static void lapic_timer_setup(enum clock
static struct clock_event_device lapic_clockevent = {
.name = "lapic",
.capabilities = CLOCK_CAP_NEXTEVT | CLOCK_CAP_PROFILE
+#ifdef CONFIG_SMP
| CLOCK_CAP_UPDATE,
+#endif
.shift = 32,
.set_mode = lapic_timer_setup,
.set_next_event = lapic_next_event,
On Wed, 04 Oct 2006 13:19:35 +0200
Thomas Gleixner <[email protected]> wrote:
> On Wed, 2006-10-04 at 12:53 +0200, Ingo Molnar wrote:
> > there's one material difference we just found: in the !hres case we'll
> > do the timer IRQ handling mostly from the lapic vector - while in
> > mainline we do it from the irq0 vector. So, how does your
> > /proc/interrupts look like? How frequently does LOC increase, and how
> > frequently does IRQ 0 increase?
sony:/home/akpm> cat /proc/interrupts ; sleep 1 ; cat /proc/interrupts
CPU0
0: 39256 IO-APIC-edge timer
1: 8 IO-APIC-edge i8042
8: 1 IO-APIC-edge rtc
9: 160 IO-APIC-fasteoi acpi
11: 3 IO-APIC-edge sonypi
12: 107 IO-APIC-edge i8042
14: 5 IO-APIC-edge libata
15: 0 IO-APIC-edge libata
16: 1 IO-APIC-fasteoi yenta, uhci_hcd:usb4
17: 246 IO-APIC-fasteoi ohci1394, eth0
18: 5759 IO-APIC-fasteoi libata
19: 3 IO-APIC-fasteoi ipw2200
20: 710 IO-APIC-fasteoi HDA Intel, uhci_hcd:usb3
21: 2 IO-APIC-fasteoi ehci_hcd:usb1
22: 0 IO-APIC-fasteoi uhci_hcd:usb2, uhci_hcd:usb5
NMI: 0
LOC: 3131
ERR: 0
MIS: 0
CPU0
0: 39519 IO-APIC-edge timer
1: 8 IO-APIC-edge i8042
8: 1 IO-APIC-edge rtc
9: 160 IO-APIC-fasteoi acpi
11: 3 IO-APIC-edge sonypi
12: 107 IO-APIC-edge i8042
14: 5 IO-APIC-edge libata
15: 0 IO-APIC-edge libata
16: 1 IO-APIC-fasteoi yenta, uhci_hcd:usb4
17: 248 IO-APIC-fasteoi ohci1394, eth0
18: 5759 IO-APIC-fasteoi libata
19: 3 IO-APIC-fasteoi ipw2200
20: 715 IO-APIC-fasteoi HDA Intel, uhci_hcd:usb3
21: 2 IO-APIC-fasteoi ehci_hcd:usb1
22: 0 IO-APIC-fasteoi uhci_hcd:usb2, uhci_hcd:usb5
NMI: 0
LOC: 3134
ERR: 0
MIS: 0
> > (meanwhile we'll fix and restore things so that it matches mainline
> > behavior.)
>
> Andrew, does the patch below fix your problem ?
>
> You should see the same weird behaviour when you run a plain -mm3 with
> CONFIG_SMP=y on that box. This moves update_process_times() to the lapic
> too.
> tglx
>
>
> Index: linux-2.6.18-mm3/arch/i386/kernel/apic.c
> ===================================================================
> --- linux-2.6.18-mm3.orig/arch/i386/kernel/apic.c 2006-10-04 13:02:35.000000000 +0200
> +++ linux-2.6.18-mm3/arch/i386/kernel/apic.c 2006-10-04 12:59:06.000000000 +0200
> @@ -84,7 +84,9 @@ static void lapic_timer_setup(enum clock
> static struct clock_event_device lapic_clockevent = {
> .name = "lapic",
> .capabilities = CLOCK_CAP_NEXTEVT | CLOCK_CAP_PROFILE
> +#ifdef CONFIG_SMP
> | CLOCK_CAP_UPDATE,
> +#endif
> .shift = 32,
> .set_mode = lapic_timer_setup,
> .set_next_event = lapic_next_event,
that (after a tweak to make it compile) fixes it. What's it do?
On Wed, 2006-10-04 at 09:02 -0700, Andrew Morton wrote:
> On Wed, 04 Oct 2006 13:19:35 +0200
> Thomas Gleixner <[email protected]> wrote:
>
> > On Wed, 2006-10-04 at 12:53 +0200, Ingo Molnar wrote:
> > > there's one material difference we just found: in the !hres case we'll
> > > do the timer IRQ handling mostly from the lapic vector - while in
> > > mainline we do it from the irq0 vector. So, how does your
> > > /proc/interrupts look like? How frequently does LOC increase, and how
> > > frequently does IRQ 0 increase?
>
> sony:/home/akpm> cat /proc/interrupts ; sleep 1 ; cat /proc/interrupts
> CPU0
> 0: 39256 IO-APIC-edge timer
> LOC: 3131
> 0: 39519 IO-APIC-edge timer
> LOC: 3134
delta IRQ == 263
delta LOC == 3
That explains the problem. The lapic frequency seems to be way off. I
have no good idea offhand how to detect such lapic brokeness.
> > static struct clock_event_device lapic_clockevent = {
> > .name = "lapic",
> > .capabilities = CLOCK_CAP_NEXTEVT | CLOCK_CAP_PROFILE
> > +#ifdef CONFIG_SMP
> > | CLOCK_CAP_UPDATE,
> > +#endif
> > .shift = 32,
> > .set_mode = lapic_timer_setup,
> > .set_next_event = lapic_next_event,
>
> that (after a tweak to make it compile) fixes it. What's it do?
It brings update_process_times() back into IRQ0. On systems with a
working lapic, it would not matter. SMP moves update_process_times() to
lapic too. That's why I asked whether a SMP=y kernel has the same
problems on this box.
tglx
* Andrew Morton <[email protected]> wrote:
> > .name = "lapic",
> > .capabilities = CLOCK_CAP_NEXTEVT | CLOCK_CAP_PROFILE
> > +#ifdef CONFIG_SMP
> > | CLOCK_CAP_UPDATE,
> > +#endif
> > .shift = 32,
> > .set_mode = lapic_timer_setup,
> > .set_next_event = lapic_next_event,
>
> that (after a tweak to make it compile) fixes it. [...]
cool!
the vanilla SMP kernel will likely show similar effects on your laptop.
We'll figure out a safe way to detect this quirk, and to work it around
or turn off the lapic timer driver in that case.
(Btw., this bug was cleanups collateral damage. Many people are running
-rt on laptops and i think we'd have noticed.)
Ingo