2005-12-21 15:38:59

by Gene Heskett

[permalink] [raw]
Subject: Another casualty of -rc6

Greetings;

I just noted my toolbar clock was off by about 45 minutes, with
an uptime on -rc6 or 27hrs & change.

A service ntpd restart resets it, but thats temporary.
Apic is on, and a cat of /proc/interrupts returns this:
CPU0
0: 96565780 IO-APIC-edge timer
1: 92669 IO-APIC-edge i8042
3: 87 IO-APIC-edge serial
4: 537421 IO-APIC-edge serial
8: 53487 IO-APIC-edge rtc
9: 0 IO-APIC-level acpi
14: 2362430 IO-APIC-edge ide0
15: 238356 IO-APIC-edge ide1
16: 12702364 IO-APIC-level ohci_hcd:usb3, eth0
17: 42635 IO-APIC-level ehci_hcd:usb1, NVidia nForce2
18: 22 IO-APIC-level ohci_hcd:usb2
19: 6338513 IO-APIC-level ohci1394, radeon@pci:0000:02:00.0
20: 0 IO-APIC-level cx88[0], cx88[0]
NMI: 0
LOC: 93879684
ERR: 0
MIS: 0

But when I looked at the log, it was synched to LOCAL. I have a script
to enhance the logging, runs at 30 minute intervals, and it was logging
this:
---------------
remote refid st t when poll reach delay offset jitter
==============================================================================
ntp2.usv.ro 80.96.120.253 2 u 59 64 377 180.742 -260483 5112.35
210.118.170.59 211.115.194.21 3 u 24 64 377 248.620 -260423 6261.15
ns1.dns.pciwest 204.123.2.5 2 u 48 64 377 126.289 -261067 4127.39
*LOCAL(0) LOCAL(0) 10 l 57 64 377 0.000 0.000 0.001
drift=3.720
remote refid st t when poll reach delay offset jitter
==============================================================================
ntp2.usv.ro 80.96.120.253 2 u 52 64 377 183.230 -264401 6453.20
210.118.170.59 211.115.194.21 3 u 18 64 377 249.206 -265488 6445.08
ns1.dns.pciwest 204.123.2.5 2 u 42 64 377 126.446 -264423 6491.55
*LOCAL(0) LOCAL(0) 10 l 56 64 377 0.000 0.000 0.001
drift=3.720
-------------
That drift is slightly lower than normal, its usually in the 4.x's.

This is with the newest bios update, which fixed this problem for -rc5.
Looks like reboot time, to -rc5. :(

--

Cheers, Gene
People having trouble with vz bouncing email to me should add the word
'online' between the 'verizon', and the dot which bypasses vz's
stupid bounce rules. I do use spamassassin too. :-)
Yahoo.com and AOL/TW attorneys please note, additions to the above
message by Gene Heskett are:
Copyright 2005 by Maurice Eugene Heskett, all rights reserved.


2005-12-21 15:50:24

by Russell King

[permalink] [raw]
Subject: Re: Another casualty of -rc6

On Wed, Dec 21, 2005 at 10:39:25AM -0500, Gene Heskett wrote:
> remote refid st t when poll reach delay offset jitter
> ==============================================================================
> ntp2.usv.ro 80.96.120.253 2 u 59 64 377 180.742 -260483 5112.35
> 210.118.170.59 211.115.194.21 3 u 24 64 377 248.620 -260423 6261.15
> ns1.dns.pciwest 204.123.2.5 2 u 48 64 377 126.289 -261067 4127.39
> *LOCAL(0) LOCAL(0) 10 l 57 64 377 0.000 0.000 0.001
> drift=3.720

You have:

server 127.127.1.0 # local clock
fudge 127.127.1.0 stratum 10

in your ntp.conf, right? I bet if you comment out those two lines ntp
will start behaving.

(I had to comment them out long ago, and I have been lead to believe that
it is wrong to include them in a configuration which is only supposed to
synchronnise to external time sources.)

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core

2005-12-21 20:43:54

by Gene Heskett

[permalink] [raw]
Subject: Re: Another casualty of -rc6

On Wednesday 21 December 2005 10:50, Russell King wrote:
>On Wed, Dec 21, 2005 at 10:39:25AM -0500, Gene Heskett wrote:
>> remote refid st t when poll reach delay
>> offset jitter
>> ===================================================================
>>=========== ntp2.usv.ro 80.96.120.253 2 u 59 64 377
>> 180.742 -260483 5112.35 210.118.170.59 211.115.194.21 3 u 24
>> 64 377 248.620 -260423 6261.15 ns1.dns.pciwest 204.123.2.5
>> 2 u 48 64 377 126.289 -261067 4127.39 *LOCAL(0)
>> LOCAL(0) 10 l 57 64 377 0.000 0.000 0.001
>> drift=3.720
>
>You have:
>
>server 127.127.1.0 # local clock
>fudge 127.127.1.0 stratum 10
>
>in your ntp.conf, right? I bet if you comment out those two lines
> ntp will start behaving.
>
They are, or were, in there, and its running just fine under -rc5,
its -rc6 that blew up and got off by 45 minutes in 27 hours of uptime.
Thats way an gone many times my normal clock drift here Russell.
That (the drift file) stays around 4 ppm, as shown here, after a
reboot to 2.6.15-rc5:
----------------
==============================================================================
+152.Red-80-33-1 150.214.94.5 2 u 437 1024 377 205.716 7.214 0.178
+mx1.gs.washingt 140.142.1.8 3 u 477 1024 377 115.473 -0.795 0.475
*ntp3.usv.ro 192.53.103.103 2 u 491 1024 377 181.308 2.771 1.007
LOCAL(0) LOCAL(0) 10 l 25 64 377 0.000 0.000 0.001
drift=3.712
remote refid st t when poll reach delay offset jitter
==============================================================================
+152.Red-80-33-1 150.214.94.5 2 u 183 1024 377 206.155 6.608 0.438
+mx1.gs.washingt 140.142.1.8 3 u 225 1024 377 116.734 -0.497 0.091
*ntp3.usv.ro 192.53.103.103 2 u 238 1024 377 179.751 2.590 1.465
LOCAL(0) LOCAL(0) 10 l 18 64 377 0.000 0.000 0.001
drift=3.712

>(I had to comment them out long ago, and I have been lead to believe
> that it is wrong to include them in a configuration which is only
> supposed to synchronnise to external time sources.)

it out now, and ntpd restarted. We'll see if it (ntpq -p) even bothers to
report LOCAL now. Nope. But the status is presently 0041, whatever that
indicates, from the ntp.log:
21 Dec 15:22:47 ntpd[9198]: logging to file /var/log/ntp.log
21 Dec 15:22:47 ntpd[9198]: ntpd [email protected] Fri Aug 26 04:27:20 EDT 2005 (1)
21 Dec 15:22:47 ntpd[9198]: precision = 1.000 usec
21 Dec 15:22:47 ntpd[9198]: Listening on interface wildcard, 0.0.0.0#123
21 Dec 15:22:47 ntpd[9198]: Listening on interface lo, 127.0.0.1#123
21 Dec 15:22:47 ntpd[9198]: Listening on interface eth0, 192.168.71.3#123
21 Dec 15:22:47 ntpd[9198]: kernel time sync status 0040
21 Dec 15:22:47 ntpd[9198]: frequency initialized 3.712 PPM from /var/lib/ntp/drift
21 Dec 15:27:06 ntpd[9198]: synchronized to 128.6.213.15, stratum 3
21 Dec 15:27:06 ntpd[9198]: kernel time sync disabled 0041
21 Dec 15:31:21 ntpd[9198]: synchronized to 192.36.143.151, stratum 1
21 Dec 15:32:29 ntpd[9198]: kernel time sync enabled 0001

Neither of those are in an ntpq -p report?

When I had -rc6 running it was spitting out some sort
of code at about 1 second intervals to the vt's:

Dec 21 10:29:23 coyote last message repeated 12 times
Dec 21 10:29:29 coyote kernel: set_rtc_mmss: can't update from 14 to 59

That was from /var/log/messages while it was running the small code
option kernel. Without that option set, it throws the same message
out on all vt's at ~1 second intervals.

So I think this one might be a legit problem. I haven't tested
it without the apic enabled in the bios though. Recall that this
didn't work with apic on at all in the previous bios flash. No
messages, but no work either. Nforce2 chipset FWTW.

--
Cheers, Gene
People having trouble with vz bouncing email to me should add the word
'online' between the 'verizon', and the dot which bypasses vz's
stupid bounce rules. I do use spamassassin too. :-)
Yahoo.com and AOL/TW attorneys please note, additions to the above
message by Gene Heskett are:
Copyright 2005 by Maurice Eugene Heskett, all rights reserved.

2005-12-21 21:19:21

by Russell King

[permalink] [raw]
Subject: Re: Another casualty of -rc6

On Wed, Dec 21, 2005 at 03:43:51PM -0500, Gene Heskett wrote:
> On Wednesday 21 December 2005 10:50, Russell King wrote:
> >in your ntp.conf, right? I bet if you comment out those two lines
> > ntp will start behaving.
> >
> They are, or were, in there, and its running just fine under -rc5,
> its -rc6 that blew up and got off by 45 minutes in 27 hours of uptime.

An early 2.6 kernel "blew up" in precisely the same manner as you're
describing here for me back in the early 2.6 days.

As I explained in my first reply, I was led to believe that it is
wrong to have the local clock in the configuration. Since then
I've been running ntp without the local clock line, and it's been
fine since.

I'm not saying that this is your problem, but I suspect that this
may not be helping the situation - especially since it appears that
ntpd has ruled out the other peers as possible synchronisation
sources.

> it out now, and ntpd restarted. We'll see if it (ntpq -p) even bothers to
> report LOCAL now. Nope.

It won't do. ntpq -p reports the _peers_ known to the server.
Obviously, if you remove the local peer, it won't be shown in that
output anymore. Moreover, think whether it is correct to setup a
peering between your local clock and your local clock.

> But the status is presently 0041, whatever that
> indicates, from the ntp.log:

0040, 0041 and 0001 are the values of the timex.status field. See
include/linux/timex.h for details. The relevant ones are:

#define STA_PLL 0x0001 /* enable PLL updates (rw) */
#define STA_UNSYNC 0x0040 /* clock unsynchronized (rw) */

> 21 Dec 15:22:47 ntpd[9198]: logging to file /var/log/ntp.log
> 21 Dec 15:22:47 ntpd[9198]: ntpd [email protected] Fri Aug 26 04:27:20 EDT 2005 (1)
> 21 Dec 15:22:47 ntpd[9198]: precision = 1.000 usec
> 21 Dec 15:22:47 ntpd[9198]: Listening on interface wildcard, 0.0.0.0#123
> 21 Dec 15:22:47 ntpd[9198]: Listening on interface lo, 127.0.0.1#123
> 21 Dec 15:22:47 ntpd[9198]: Listening on interface eth0, 192.168.71.3#123
> 21 Dec 15:22:47 ntpd[9198]: kernel time sync status 0040
> 21 Dec 15:22:47 ntpd[9198]: frequency initialized 3.712 PPM from /var/lib/ntp/drift
> 21 Dec 15:27:06 ntpd[9198]: synchronized to 128.6.213.15, stratum 3
> 21 Dec 15:27:06 ntpd[9198]: kernel time sync disabled 0041
> 21 Dec 15:31:21 ntpd[9198]: synchronized to 192.36.143.151, stratum 1
> 21 Dec 15:32:29 ntpd[9198]: kernel time sync enabled 0001

So the kernel's timekeeping transitioned from unsynchronised to
PLL mode, to PLL + synchronised. Great, ntpd has adjusted the
kernels timekeeping in an attempt to keep it synchronised.

> Neither of those are in an ntpq -p report?

Are you using the pooled servers? They resolve to _many_ addresses.

> When I had -rc6 running it was spitting out some sort
> of code at about 1 second intervals to the vt's:
>
> Dec 21 10:29:23 coyote last message repeated 12 times
> Dec 21 10:29:29 coyote kernel: set_rtc_mmss: can't update from 14 to 59

This is a different problem. This triggers:

if (abs(real_minutes - cmos_minutes) < 30) {
...
} else {
printk(KERN_WARNING
"set_rtc_mmss: can't update from %d to %d\n",
cmos_minutes, real_minutes);
retval = -1;
}

so the time is more than 30 minutes out. What happens if you ensure
that the correct time is set, then start ntpd with this configuration?

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core

2005-12-21 21:33:00

by Kyle Moffett

[permalink] [raw]
Subject: Re: Another casualty of -rc6

On Dec 21, 2005, at 16:19, Russell King wrote:
> As I explained in my first reply, I was led to believe that it is
> wrong to have the local clock in the configuration. Since then
> I've been running ntp without the local clock line, and it's been
> fine since.
>
> I'm not saying that this is your problem, but I suspect that this
> may not be helping the situation - especially since it appears that
> ntpd has ruled out the other peers as possible synchronisation
> sources.

ntpd _only_ considers peers of larger strata numbers after it has
considered all smaller strata peers to be erroneous using various
checks (such as too-high slew rate).

>> it out now, and ntpd restarted. We'll see if it (ntpq -p) even
>> bothers to report LOCAL now. Nope.
>
> It won't do. ntpq -p reports the _peers_ known to the server.
> Obviously, if you remove the local peer, it won't be shown in that
> output anymore. Moreover, think whether it is correct to setup a
> peering between your local clock and your local clock.

That's not why that line goes in there. As documented in the Debian
ntpd file, the extra local line is there for when you use ntpd as a
server for _other_ systems. Essentially, if you point all other
systems at that server, it will temporarily serve out its local time
as a strata 10 clock with that line. Without, it will refuse to
serve out _any_ time. The result is that people include that line so
that even during network failure, all their local clocks remain
synchronized to each other, even if not to an atomic clock.

>> 21 Dec 15:22:47 ntpd[9198]: logging to file /var/log/ntp.log
>> 21 Dec 15:22:47 ntpd[9198]: ntpd [email protected] Fri Aug 26
>> 04:27:20 EDT 2005 (1)
>> 21 Dec 15:22:47 ntpd[9198]: precision = 1.000 usec
>> 21 Dec 15:22:47 ntpd[9198]: Listening on interface wildcard,
>> 0.0.0.0#123
>> 21 Dec 15:22:47 ntpd[9198]: Listening on interface lo, 127.0.0.1#123
>> 21 Dec 15:22:47 ntpd[9198]: Listening on interface eth0,
>> 192.168.71.3#123
>> 21 Dec 15:22:47 ntpd[9198]: kernel time sync status 0040
>> 21 Dec 15:22:47 ntpd[9198]: frequency initialized 3.712 PPM from /
>> var/lib/ntp/drift
>> 21 Dec 15:27:06 ntpd[9198]: synchronized to 128.6.213.15, stratum 3
>> 21 Dec 15:27:06 ntpd[9198]: kernel time sync disabled 0041
>> 21 Dec 15:31:21 ntpd[9198]: synchronized to 192.36.143.151, stratum 1
>> 21 Dec 15:32:29 ntpd[9198]: kernel time sync enabled 0001
>
> So the kernel's timekeeping transitioned from unsynchronised to PLL
> mode, to PLL + synchronised. Great, ntpd has adjusted the kernels
> timekeeping in an attempt to keep it synchronised.

If the local clock slews too much, then ntpd basically gives up
(because it assumes that the server must be broken :-/). When it
does so, it flags the kernels clock as unsynched.


Cheers,
Kyle Moffett

--
When you go into court you either want a very, very, very bright line
or you want the stomach to outlast the other guy in trench warfare.
If both sides are reasonable, you try to stay _out_ of court in the
first place.
-- Rob Landley



2005-12-21 21:53:56

by Gene Heskett

[permalink] [raw]
Subject: Re: Another casualty of -rc6

On Wednesday 21 December 2005 16:19, Russell King wrote:
>On Wed, Dec 21, 2005 at 03:43:51PM -0500, Gene Heskett wrote:
>> On Wednesday 21 December 2005 10:50, Russell King wrote:
>> >in your ntp.conf, right? I bet if you comment out those two lines
>> > ntp will start behaving.
>>
>> They are, or were, in there, and its running just fine under -rc5,
>> its -rc6 that blew up and got off by 45 minutes in 27 hours of
>> uptime.
>
>An early 2.6 kernel "blew up" in precisely the same manner as you're
>describing here for me back in the early 2.6 days.
>
>As I explained in my first reply, I was led to believe that it is
>wrong to have the local clock in the configuration. Since then
>I've been running ntp without the local clock line, and it's been
>fine since.
>
>I'm not saying that this is your problem, but I suspect that this
>may not be helping the situation - especially since it appears that
>ntpd has ruled out the other peers as possible synchronisation
>sources.
>
>> it out now, and ntpd restarted. We'll see if it (ntpq -p) even
>> bothers to report LOCAL now. Nope.
>
>It won't do. ntpq -p reports the _peers_ known to the server.
>Obviously, if you remove the local peer, it won't be shown in that
>output anymore. Moreover, think whether it is correct to setup a
>peering between your local clock and your local clock.
>
>> But the status is presently 0041, whatever that
>> indicates, from the ntp.log:
>
>0040, 0041 and 0001 are the values of the timex.status field. See
>include/linux/timex.h for details. The relevant ones are:
>
>#define STA_PLL 0x0001 /* enable PLL updates (rw) */
>#define STA_UNSYNC 0x0040 /* clock unsynchronized (rw) */
>
>> 21 Dec 15:22:47 ntpd[9198]: logging to file /var/log/ntp.log
>> 21 Dec 15:22:47 ntpd[9198]: ntpd [email protected] Fri Aug 26
>> 04:27:20 EDT 2005 (1) 21 Dec 15:22:47 ntpd[9198]: precision = 1.000
>> usec
>> 21 Dec 15:22:47 ntpd[9198]: Listening on interface wildcard,
>> 0.0.0.0#123 21 Dec 15:22:47 ntpd[9198]: Listening on interface lo,
>> 127.0.0.1#123 21 Dec 15:22:47 ntpd[9198]: Listening on interface
>> eth0, 192.168.71.3#123 21 Dec 15:22:47 ntpd[9198]: kernel time sync
>> status 0040
>> 21 Dec 15:22:47 ntpd[9198]: frequency initialized 3.712 PPM from
>> /var/lib/ntp/drift 21 Dec 15:27:06 ntpd[9198]: synchronized to
>> 128.6.213.15, stratum 3 21 Dec 15:27:06 ntpd[9198]: kernel time
>> sync disabled 0041 21 Dec 15:31:21 ntpd[9198]: synchronized to
>> 192.36.143.151, stratum 1 21 Dec 15:32:29 ntpd[9198]: kernel time
>> sync enabled 0001
>
>So the kernel's timekeeping transitioned from unsynchronised to
>PLL mode, to PLL + synchronised. Great, ntpd has adjusted the
>kernels timekeeping in an attempt to keep it synchronised.
>
>> Neither of those are in an ntpq -p report?
>
>Are you using the pooled servers? They resolve to _many_ addresses.

I have several instances of pool.ntp.org scattered thru the timeservers list I use, so there is a fairly good chance it will catch at least one of them in the randomization choice routine of /etc/init.d/ntpd.

>> When I had -rc6 running it was spitting out some sort
>> of code at about 1 second intervals to the vt's:
>>
>> Dec 21 10:29:23 coyote last message repeated 12 times
>> Dec 21 10:29:29 coyote kernel: set_rtc_mmss: can't update from 14
>> to 59
>
>This is a different problem. This triggers:
>
> if (abs(real_minutes - cmos_minutes) < 30) {
>...
> } else {
> printk(KERN_WARNING
> "set_rtc_mmss: can't update from %d to %d\n",
> cmos_minutes, real_minutes);
> retval = -1;
> }
>
>so the time is more than 30 minutes out. What happens if you ensure
>that the correct time is set, then start ntpd with this
> configuration?

Its still there if I reboot to 2.6.15-rc6. And the initial
startup of ntpd does do a crash set to the correct time using
ntpdate, so it assureadly did start out in reasonable synch.
FWIW, I keep the hdwr clock on Grenwich time.

Let me go back into the log and see how long it had been booted
before those messages started.

During the reboot at about 6pm on the 19th, this is seen:

Dec 19 18:07:03 coyote kernel: kjournald starting. Commit interval 5 seconds
Dec 19 17:56:14 coyote ntpdate[1963]: step time server 128.208.109.7 offset -648.721376 sec

So it backed up 8 minutes there. Then a few seconds later:
Dec 19 17:56:35 coyote kernel: usbcore: registered new driver pl2303
Dec 19 17:56:35 coyote kernel: drivers/usb/serial/pl2303.c: Prolific PL2303 USB to serial adaptor driver
Dec 19 17:56:35 coyote kernel: set_rtc_mmss: can't update from 7 to 56

Then as rc.local was being executed:
Dec 19 17:56:39 coyote kernel: DVB: registering frontend 0 (Oren OR51132 VSB/QAM Frontend)...
Dec 19 17:56:40 coyote kernel: set_rtc_mmss: can't update from 7 to 56
Dec 19 17:56:41 coyote heyu_relay: relay setting up-
Dec 19 17:56:41 coyote kernel: set_rtc_mmss: can't update from 7 to 56
Dec 19 17:56:49 coyote last message repeated 4 times
Dec 19 17:56:52 coyote rc: Starting webmin: succeeded
Dec 19 17:56:53 coyote kernel: set_rtc_mmss: can't update from 7 to 56
Dec 19 17:56:56 coyote last message repeated 3 times

And it kept up till I rebooted to rc5 a short time later.
And I'd rebuilt -rc6 and tried it again, and it was about 10:25
this morning when I noted the clock was way fast and did a
service ntpd restart, which logged this:
The first line is a cron job that shuts off the nighttime
outside motion detectors. The system was quiet then until I found
the clock fubared. So the next line is my restart of ntpd.

Dec 21 06:30:02 coyote heyu_relay: relay setting up-
Dec 21 11:06:49 coyote ntpd: ntpd shutdown succeeded
Dec 21 10:22:14 coyote ntpdate[6357]: step time server 80.163.145.206 offset -2679.594276 sec
Dec 21 10:22:14 coyote ntpd: succeeded
Dec 21 10:22:14 coyote ntpd: logging to file /var/log/ntp.log
Dec 21 10:22:14 coyote ntpd: ntpd startup succeeded
Dec 21 10:22:16 coyote kernel: set_rtc_mmss: can't update from 6 to 52
Dec 21 10:22:18 coyote last message repeated 2 times
Dec 21 10:22:23 coyote kernel: set_rtc_mmss: can't update from 7 to 52
Dec 21 10:22:56 coyote last message repeated 15 times
Dec 21 10:22:59 coyote last message repeated 3 times
Dec 21 10:23:03 coyote kernel: set_rtc_mmss: can't update from 7 to 53
Dec 21 10:23:19 coyote last message repeated 9 times
Dec 21 10:23:23 coyote kernel: set_rtc_mmss: can't update from 8 to 53
Dec 21 10:23:57 coyote last message repeated 18 times

These last times are correct as I keep my watch on local time
plus or minus a couple of seconds.
And I eventually rebooted to -rc5, which was, and is now,
running fine. And now without a localhost entry in ntp.conf,
thanks for the hint.

--
Cheers, Gene
People having trouble with vz bouncing email to me should add the word
'online' between the 'verizon', and the dot which bypasses vz's
stupid bounce rules. I do use spamassassin too. :-)
Yahoo.com and AOL/TW attorneys please note, additions to the above
message by Gene Heskett are:
Copyright 2005 by Maurice Eugene Heskett, all rights reserved.