2003-08-13 02:31:59

by timothy parkinson

[permalink] [raw]
Subject: 2.6.0-test3 "loosing ticks"


hey all,

the 2.6 kernels have been loosing time on my box, and i just noticed this
message at the very bottom of dmesg that i think may relate:

Loosing too many ticks!
TSC cannot be used as a timesource. (Are you running with SpeedStep?)
Falling back to a sane timesource.

i'm running test-3 right now, but i've been seeing the same problem back into
the 2.5 kernels. any ideas? it's a dual PIII coppermine, relatively close
to default slackware 9, i'll attach the config in case that helps. thanks in
advance!

tim


Attachments:
(No filename) (0.00 B)
(No filename) (189.00 B)
Download all attachments

2003-08-13 16:55:57

by john stultz

[permalink] [raw]
Subject: Re: 2.6.0-test3 "loosing ticks"

On Tue, 2003-08-12 at 18:47, timothy parkinson wrote:
> the 2.6 kernels have been loosing time on my box, and i just noticed this
> message at the very bottom of dmesg that i think may relate:
>
> Loosing too many ticks!
> TSC cannot be used as a timesource. (Are you running with SpeedStep?)
> Falling back to a sane timesource.
>
> i'm running test-3 right now, but i've been seeing the same problem back into
> the 2.5 kernels. any ideas? it's a dual PIII coppermine, relatively close
> to default slackware 9, i'll attach the config in case that helps. thanks in
> advance!

Sounds like either your PIT is running slowly or something is
consistently keeping the timer interrupt from being handled. In 2.4 do
you have any time related issues at all? Does the "Loosing too many
ticks!" message correlate to any event on the system (boot, heavy load)?

Also listing system type, motherboard, any sort of funky devices you've
got might be helpful.

thanks
-john


2003-08-14 17:17:28

by Jamie Lokier

[permalink] [raw]
Subject: Re: 2.6.0-test3 "loosing ticks"

john stultz wrote:
> Sounds like either your PIT is running slowly or something is
> consistently keeping the timer interrupt from being handled. In 2.4 do
> you have any time related issues at all? Does the "Loosing too many
> ticks!" message correlate to any event on the system (boot, heavy load)?
>
> Also listing system type, motherboard, any sort of funky devices you've
> got might be helpful.

I am seeing something similar on my dual Athlon MP 1800 box.

It is running NTP to synchronise with another machine over the LAN,
but ntpdc reports that it develops a larger and larger offset relative
to the server - ntpd clearly is not managing to regulate the clock.

It does not have this problem with 2.4 - the time synchronises perfectly.

-- Jamie


2003-08-14 17:29:41

by john stultz

[permalink] [raw]
Subject: Re: 2.6.0-test3 "loosing ticks"

On Thu, 2003-08-14 at 10:17, Jamie Lokier wrote:
> john stultz wrote:
> > Sounds like either your PIT is running slowly or something is
> > consistently keeping the timer interrupt from being handled. In 2.4 do
> > you have any time related issues at all? Does the "Loosing too many
> > ticks!" message correlate to any event on the system (boot, heavy load)?
> >
> > Also listing system type, motherboard, any sort of funky devices you've
> > got might be helpful.
>
> I am seeing something similar on my dual Athlon MP 1800 box.
>
> It is running NTP to synchronise with another machine over the LAN,
> but ntpdc reports that it develops a larger and larger offset relative
> to the server - ntpd clearly is not managing to regulate the clock.

Approximately at what rate does it skew? Does ntpdate -b <server> set it
properly?

Are you also seeing the "Loosing too many ticks!" message?

thanks
-john


2003-08-14 22:00:07

by Jamie Lokier

[permalink] [raw]
Subject: Re: 2.6.0-test3 "loosing ticks"

john stultz wrote:
> > I am seeing something similar on my dual Athlon MP 1800 box.
> >
> > It is running NTP to synchronise with another machine over the LAN,
> > but ntpdc reports that it develops a larger and larger offset relative
> > to the server - ntpd clearly is not managing to regulate the clock.
>
> Approximately at what rate does it skew? Does ntpdate -b <server> set it
> properly?

I'll keep a note. It's not very fast, but enough to reach several
tens of seconds after a day's work - enough to break Make over NFS,
that's why I noticed.

It might stop showing up now, as I am now running 2.5.75 on the server too :)

> Are you also seeing the "Loosing too many ticks!" message?

No.

-- Jamie

2003-08-15 00:22:36

by Charles Lepple

[permalink] [raw]
Subject: PIT, TSC and power management [was: Re: 2.6.0-test3 "loosing ticks"]

john stultz wrote:
> On Thu, 2003-08-14 at 10:17, Jamie Lokier wrote:
>
>>john stultz wrote:
>>
>>>Sounds like either your PIT is running slowly or something is
>>>consistently keeping the timer interrupt from being handled. In 2.4 do
>>>you have any time related issues at all? Does the "Loosing too many
>>>ticks!" message correlate to any event on the system (boot, heavy load)?
>>>
>>>Also listing system type, motherboard, any sort of funky devices you've
>>>got might be helpful.
>>
>>I am seeing something similar on my dual Athlon MP 1800 box.
>>
>>It is running NTP to synchronise with another machine over the LAN,
>>but ntpdc reports that it develops a larger and larger offset relative
>>to the server - ntpd clearly is not managing to regulate the clock.

I also see the time offset problem (Athlon MP 2000+ x2, Tyan S2460 m/b,
2.6.0-test{1,2,3}) but it is most noticeable when I have amd76x_pm
installed (it's not in 2.6.x yet, but a late 2.5.x patch was posted to
LKML a little while back).

amd76x_pm is roughly equivalent to ACPI C2 idling, but since my BIOS
doesn't export any C-state functionality to the kernel ACPI code, I am
stuck with letting amd76x_pm frob the chipset registers. A quick look at
AMD's datasheets does not indicate that a return from C2 should cause
much delay at all-- if I understand the timing requirements correctly,
it would have to sit for more than 1 ms to miss more than one interrupt.
That said, I don't see any missing interrupts indicated in
/proc/interrupts, nor do any such messages appear in the kernel logs.

Brings up another question: does the "try HZ=100" suggestion still apply
for these faster machines? I would think that if HZ=1000 is too fast,
then at least an occasional lost interrupt would be logged.

When using the TSC for time-of-day, I generally have to set tick to
10200 or somewhere thereabouts. ntpd usually gives up after a few hours,
though, so I presume that this value for tick is only good for a certain
combination of processor load and planetary alignment.

I booted with clock=pit to test that, and now I need tick=9963
(according to adjtimex's configuration routine). However, that makes the
clock jump all over the place, with ntpd making step adjustments +/- 2
seconds every 5 minutes.

> Approximately at what rate does it skew?

Well, it's not constant, and I don't trust the tick values given above,
since they don't seem to hold true for long.

> Does ntpdate -b <server> set it properly?

I'm confused. Are there cases where a step time adjustment would fail?
Is there a possibility that the kernel is rejecting ntpd's step
adjustments? (I presume that these use the same as 'ntpdate -b';
specifically, the time is not slewed.)

> Are you also seeing the "Loosing too many ticks!" message?

Never seen it.

Other miscellaneous info:

dmesg:
> Enabling APIC mode: Flat. Using 1 I/O APICs
...
> CPU: CLK_CTL MSR was 6003d22f. Reprogramming to 2003d22f

(does this have anything to do with the TSC?)

> Using local APIC timer interrupts.
> calibrating APIC timer ...
> ..... CPU clock speed is 1666.0503 MHz.
> ..... host bus clock speed is 266.0640 MHz.
> checking TSC synchronization across 2 CPUs: passed.

(note this still appears when using clock=pit)

lspci:

00:00.0 Host bridge: Advanced Micro Devices [AMD] AMD-760 MP [IGD4-2P]
System Controller (rev 11)
00:01.0 PCI bridge: Advanced Micro Devices [AMD] AMD-760 MP [IGD4-2P]
AGP Bridge
00:07.0 ISA bridge: Advanced Micro Devices [AMD] AMD-766 [ViperPlus] ISA
(rev 02)
00:07.1 IDE interface: Advanced Micro Devices [AMD] AMD-766 [ViperPlus]
IDE (rev 01)
00:07.3 Bridge: Advanced Micro Devices [AMD] AMD-766 [ViperPlus] ACPI
(rev 01)

CPU-selection portions of .config:

CONFIG_MK7=y
[...]
CONFIG_X86_CMPXCHG=y
CONFIG_X86_XADD=y
CONFIG_X86_L1_CACHE_SHIFT=6
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_X86_WP_WORKS_OK=y
CONFIG_X86_INVLPG=y
CONFIG_X86_BSWAP=y
CONFIG_X86_POPAD_OK=y
CONFIG_X86_GOOD_APIC=y
CONFIG_X86_INTEL_USERCOPY=y
CONFIG_X86_USE_PPRO_CHECKSUM=y
CONFIG_X86_USE_3DNOW=y
CONFIG_SMP=y
CONFIG_NR_CPUS=2
CONFIG_PREEMPT=y
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
CONFIG_X86_TSC=y
CONFIG_X86_MCE=y
CONFIG_X86_MCE_NONFATAL=y

(rest available on request)

I am open to suggestions for testing.

Also, how much has the kernel changed with respect to the PLL used by ntpd?

thanks,

--
Charles Lepple <ghz.cc!clepple>

2003-08-15 12:14:55

by Jamie Lokier

[permalink] [raw]
Subject: Re: PIT, TSC and power management [was: Re: 2.6.0-test3 "loosing ticks"]

Charles Lepple wrote:
> amd76x_pm is roughly equivalent to ACPI C2 idling, but since my BIOS
> doesn't export any C-state functionality to the kernel ACPI code, I am
> stuck with letting amd76x_pm frob the chipset registers.

Same here.

-- Jamie

2003-08-15 17:50:56

by john stultz

[permalink] [raw]
Subject: Re: PIT, TSC and power management [was: Re: 2.6.0-test3 "loosing ticks"]

On Thu, 2003-08-14 at 17:19, Charles Lepple wrote:
> I also see the time offset problem (Athlon MP 2000+ x2, Tyan S2460 m/b,
> 2.6.0-test{1,2,3}) but it is most noticeable when I have amd76x_pm
> installed (it's not in 2.6.x yet, but a late 2.5.x patch was posted to
> LKML a little while back).
>
> amd76x_pm is roughly equivalent to ACPI C2 idling, but since my BIOS
> doesn't export any C-state functionality to the kernel ACPI code, I am
> stuck with letting amd76x_pm frob the chipset registers. A quick look at
> AMD's datasheets does not indicate that a return from C2 should cause
> much delay at all-- if I understand the timing requirements correctly,
> it would have to sit for more than 1 ms to miss more than one interrupt.
> That said, I don't see any missing interrupts indicated in
> /proc/interrupts, nor do any such messages appear in the kernel logs.

In this case you're throttling the cpu frequency. This affects the
frequency the TSC updates, which makes it very hard to use the TSC as a
timesource (the cpu_freq notifier tries to compensate by changing the
tsc multiplier but my systems don't have cpu_freq drivers, so I've not
seen it work).

> Brings up another question: does the "try HZ=100" suggestion still apply
> for these faster machines? I would think that if HZ=1000 is too fast,
> then at least an occasional lost interrupt would be logged.

If you're losing interrupts and the lost-tick detection code is not
compensating, shifting back to HZ=100 just tries to minimize the
problem.


> When using the TSC for time-of-day, I generally have to set tick to
> 10200 or somewhere thereabouts. ntpd usually gives up after a few hours,
> though, so I presume that this value for tick is only good for a certain
> combination of processor load and planetary alignment.
>
> I booted with clock=pit to test that, and now I need tick=9963
> (according to adjtimex's configuration routine). However, that makes the
> clock jump all over the place, with ntpd making step adjustments +/- 2
> seconds every 5 minutes.
>
> > Approximately at what rate does it skew?
>
> Well, it's not constant, and I don't trust the tick values given above,
> since they don't seem to hold true for long.


Do these problems still show when you're not using the amd76x_pm?


> > Does ntpdate -b <server> set it properly?
>
> I'm confused. Are there cases where a step time adjustment would fail?
> Is there a possibility that the kernel is rejecting ntpd's step
> adjustments? (I presume that these use the same as 'ntpdate -b';
> specifically, the time is not slewed.)

Well, depending on how ntp is compiled, it could use stime, rather then
settimeofday. This causes ntp to set the time on average .5 seconds off
the desired time. Since .5 is outside the .128 sec slew boundary, ntp
will do another step adjustment which has the same poor accuracy. This
results in ntp just hopping back and forth around the desired time.

thanks
-john


2003-08-15 18:56:25

by Charles Lepple

[permalink] [raw]
Subject: Re: PIT, TSC and power management [was: Re: 2.6.0-test3 "loosingticks"]

john stultz said:
> On Thu, 2003-08-14 at 17:19, Charles Lepple wrote:
>> I also see the time offset problem (Athlon MP 2000+ x2, Tyan S2460 m/b,
>> 2.6.0-test{1,2,3}) but it is most noticeable when I have amd76x_pm
>> installed (it's not in 2.6.x yet, but a late 2.5.x patch was posted to
>> LKML a little while back).
>>
>> amd76x_pm is roughly equivalent to ACPI C2 idling, but since my BIOS
>> doesn't export any C-state functionality to the kernel ACPI code, I am
>> stuck with letting amd76x_pm frob the chipset registers. A quick look at
>> AMD's datasheets does not indicate that a return from C2 should cause
>> much delay at all-- if I understand the timing requirements correctly,
>> it would have to sit for more than 1 ms to miss more than one interrupt.
>> That said, I don't see any missing interrupts indicated in
>> /proc/interrupts, nor do any such messages appear in the kernel logs.
>
> In this case you're throttling the cpu frequency. This affects the
> frequency the TSC updates, which makes it very hard to use the TSC as a
> timesource (the cpu_freq notifier tries to compensate by changing the
> tsc multiplier but my systems don't have cpu_freq drivers, so I've not
> seen it work).

I'm not familiar with the cpu_freq code, or how true ACPI throttling is
implemented, but it sounds like the amd76x_pm driver is doing something a
little different than throttling. I tried the regular ACPI code on an IBM
desktop, and its throttling support appears to offer several distinct
throttling percentages, which would seem to be much easier to compensate
for. The amd76x_pm idle routine simply sets a bit in one of the bridge
chips, but I think that it turns the clocks back on at some indeterminate
time in the future (probably triggered by an interrupt) which would be
hard to measure if clocks are stopped.

>> Brings up another question: does the "try HZ=100" suggestion still apply
>> for these faster machines? I would think that if HZ=1000 is too fast,
>> then at least an occasional lost interrupt would be logged.
>
> If you're losing interrupts and the lost-tick detection code is not
> compensating, shifting back to HZ=100 just tries to minimize the
> problem.

OK. Well, I'm optimistic, so I'll try that and see if that pulls the error
down to a point where ntpd can manage to control the clock.

>> When using the TSC for time-of-day, I generally have to set tick to
>> 10200 or somewhere thereabouts. ntpd usually gives up after a few hours,
>> though, so I presume that this value for tick is only good for a certain
>> combination of processor load and planetary alignment.
>>
>> I booted with clock=pit to test that, and now I need tick=9963
>> (according to adjtimex's configuration routine). However, that makes the
>> clock jump all over the place, with ntpd making step adjustments +/- 2
>> seconds every 5 minutes.
>>
>> > Approximately at what rate does it skew?
>>
>> Well, it's not constant, and I don't trust the tick values given above,
>> since they don't seem to hold true for long.
>
> Do these problems still show when you're not using the amd76x_pm?

I'll have to try again next week-- I think the last time that I tried to
remove amd76x_pm, I didn't reset the ntp adjustments (drift, plus the
adjtimex variables) so it was fighting the old power-management values.

>> > Does ntpdate -b <server></server> set it properly?
>>
>> I'm confused. Are there cases where a step time adjustment would fail?
>> Is there a possibility that the kernel is rejecting ntpd's step
>> adjustments? (I presume that these use the same as 'ntpdate -b';
>> specifically, the time is not slewed.)
>
> Well, depending on how ntp is compiled, it could use stime, rather then
> settimeofday. This causes ntp to set the time on average .5 seconds off
> the desired time. Since .5 is outside the .128 sec slew boundary, ntp
> will do another step adjustment which has the same poor accuracy. This
> results in ntp just hopping back and forth around the desired time.

I'll have to check with strace (I'm using ntpd from Debian sid).

thanks,

--
Charles Lepple <[email protected]>
http://www.ghz.cc/charles/

2003-08-15 23:13:03

by Jamie Lokier

[permalink] [raw]
Subject: Re: PIT, TSC and power management [was: Re: 2.6.0-test3 "loosing ticks"]

john stultz wrote:
> Well, depending on how ntp is compiled, it could use stime, rather then
> settimeofday. This causes ntp to set the time on average .5 seconds off
> the desired time. Since .5 is outside the .128 sec slew boundary, ntp
> will do another step adjustment which has the same poor accuracy. This
> results in ntp just hopping back and forth around the desired time.

On my more-or-less Red Hat 9 system, it would be quite surprising if
the ntpd which works with 2.4 suddenly stopped working...

Though it would be less of a surprise if ntpd had always had this
problem in this box, and I just didn't notice with 2.4.

-- Jamie

2003-08-15 23:27:12

by john stultz

[permalink] [raw]
Subject: Re: PIT, TSC and power management [was: Re: 2.6.0-test3 "loosing ticks"]

On Fri, 2003-08-15 at 16:12, Jamie Lokier wrote:
> john stultz wrote:
> > Well, depending on how ntp is compiled, it could use stime, rather then
> > settimeofday. This causes ntp to set the time on average .5 seconds off
> > the desired time. Since .5 is outside the .128 sec slew boundary, ntp
> > will do another step adjustment which has the same poor accuracy. This
> > results in ntp just hopping back and forth around the desired time.
>
> On my more-or-less Red Hat 9 system, it would be quite surprising if
> the ntpd which works with 2.4 suddenly stopped working...

Yea, I don't think this is the issue. RH9 doesn't have this problem. I
was just explaining why I asked if ntpdate -b <server> set the time
properly on his box.

Really I think the amd76x_pm module is cause, as it seems to changes the
cpu frequency and I'm suspecting it doesn't use the cpu_freq notifiers.
I'd be quite interested to see if the issue still appears when you're
not running that module.

thanks
-john