2005-11-02 23:03:58

by Jean-Christian de Rivaz

[permalink] [raw]
Subject: NTP broken with 2.6.14

Since I have installed the new kernel 2.6.14, ntpd is unable to
synchronize the time:

talla:~# ntpq
ntpq> pe
remote refid st t when poll reach delay offset
jitter
==============================================================================
10.0.0.1 129.132.2.21 3 u 25 64 377 0.871 -88310.
4885.09
ntpq> as

ind assID status conf reach auth condition last_event cnt
===========================================================
1 14484 9014 yes yes none reject reachable 1
ntpq> rv 14484
assID=14484 status=9014 reach, conf, 1 event, event_reach,
srcadr=10.0.0.1, srcport=123, dstadr=10.0.33.10, dstport=123, leap=00,
stratum=3, precision=-17, rootdelay=37.842, rootdispersion=59.311,
refid=129.132.2.21, reach=377, unreach=0, hmode=3, pmode=4, hpoll=6,
ppoll=6, flash=00 ok, keyid=0, ttl=0, offset=-88310.312, delay=0.871,
dispersion=2.484, jitter=4885.096,
reftime=c713bf2b.ed424e59 Wed, Nov 2 2005 23:41:47.926,
org=c713c16b.0e8ee6b8 Wed, Nov 2 2005 23:51:23.056,
rec=c713c1c6.a420b3d4 Wed, Nov 2 2005 23:52:54.641,
xmt=c713c1c6.a3c7f77a Wed, Nov 2 2005 23:52:54.639,
filtdelay= 0.89 0.89 0.87 0.88 0.90 0.88 0.90 0.90,
filtoffset= -91583. -90207. -88310. -86973. -85104. -83843. -81507. -79682.,
filtdisp= 0.01 0.96 1.93 2.88 3.85 4.83 5.79 6.75
ntpq>

The offset alway grow without correction from ntpd. This was perfectly
working with the previouse 2.6.8 kernel used on this machine and the
10.0.0.1 router still work as others machines on the network work well
at the same time:

craie:~# ntpq -p
remote refid st t when poll reach delay offset
jitter
==============================================================================
*10.0.0.1 129.132.2.21 3 u 636 1024 377 1.170 2.184
0.369

From the /var/log/ntpstats/peerstats history, the offset start growing
exactly at the same time I rebooted with the new 2.6.14 kernel. The ntpd
is the from the Debian Sarge version "ntpd 4.2.0a@1:4.2.0a+stable-2-r
Fri Aug 26 10:30:12 UTC 2005 (1)".
--
Jean-Christian de Rivaz


2005-11-02 23:22:05

by john stultz

[permalink] [raw]
Subject: Re: NTP broken with 2.6.14

On Thu, 2005-11-03 at 00:05 +0100, Jean-Christian de Rivaz wrote:
> Since I have installed the new kernel 2.6.14, ntpd is unable to
> synchronize the time:

I'm working to see if I can reproduce this. Is this with 2.6.14 vanilla,
or from Linus' git tree post 2.6.14?

thanks
-john




2005-11-02 23:24:40

by john stultz

[permalink] [raw]
Subject: Re: NTP broken with 2.6.14

On Thu, 2005-11-03 at 00:05 +0100, Jean-Christian de Rivaz wrote:
> Since I have installed the new kernel 2.6.14, ntpd is unable to
> synchronize the time:

Also, what arch are you using (i386, x86-64, ppc, etc)?

thanks
-john


2005-11-02 23:35:07

by Jean-Christian de Rivaz

[permalink] [raw]
Subject: Re: NTP broken with 2.6.14

john stultz a ?crit :
> On Thu, 2005-11-03 at 00:05 +0100, Jean-Christian de Rivaz wrote:
>
>>Since I have installed the new kernel 2.6.14, ntpd is unable to
>>synchronize the time:
>
>
> I'm working to see if I can reproduce this. Is this with 2.6.14 vanilla,
> or from Linus' git tree post 2.6.14?

This is a vanilla 2.6.14 kernel from Linus git tree.
The architecture is i386:
Linux talla 2.6.14 #1 PREEMPT Tue Nov 1 17:27:04 CET 2005 i686 GNU/Linux

Thanks,
--
Jean-Christian de Rivaz

2005-11-03 00:15:38

by john stultz

[permalink] [raw]
Subject: Re: NTP broken with 2.6.14

On Thu, 2005-11-03 at 00:37 +0100, Jean-Christian de Rivaz wrote:
> john stultz a ?crit :
> > On Thu, 2005-11-03 at 00:05 +0100, Jean-Christian de Rivaz wrote:
> >
> >>Since I have installed the new kernel 2.6.14, ntpd is unable to
> >>synchronize the time:
> >
> >
> > I'm working to see if I can reproduce this. Is this with 2.6.14 vanilla,
> > or from Linus' git tree post 2.6.14?
>
> This is a vanilla 2.6.14 kernel from Linus git tree.
> The architecture is i386:
> Linux talla 2.6.14 #1 PREEMPT Tue Nov 1 17:27:04 CET 2005 i686 GNU/Linux

I can't seem to trivially reproduce this.


Your ntpq associations output looks suspicious, though.
ind assID status conf reach auth condition last_event cnt
===========================================================
1 14484 9014 yes yes none reject reachable 1

That reject condition seems odd.


What does running "ntpdate -uq <server>" produce?


Also, could you check 2.6.13, or even better do a binary search of
mainline releases since 2.6.8 to narrow down where this broke for you?


thanks
-john

2005-11-03 00:37:03

by Roman Zippel

[permalink] [raw]
Subject: Re: NTP broken with 2.6.14

Hi,

On Thu, 3 Nov 2005, Jean-Christian de Rivaz wrote:

> From the /var/log/ntpstats/peerstats history, the offset start growing
> exactly at the same time I rebooted with the new 2.6.14 kernel. The ntpd
> is the from the Debian Sarge version "ntpd 4.2.0a@1:4.2.0a+stable-2-r
> Fri Aug 26 10:30:12 UTC 2005 (1)".

Could you post a few lines from loopstats from before and after the
upgrade? Do you have adjtimex installed? If yes, what's in
/etc/default/adjtimex?

bye, Roman

2005-11-03 00:44:37

by Jean-Christian de Rivaz

[permalink] [raw]
Subject: Re: NTP broken with 2.6.14

john stultz a ?crit :
> On Thu, 2005-11-03 at 00:37 +0100, Jean-Christian de Rivaz wrote:
>
>>john stultz a ?crit :
>>
>>>On Thu, 2005-11-03 at 00:05 +0100, Jean-Christian de Rivaz wrote:
>>>
>>>
>>>>Since I have installed the new kernel 2.6.14, ntpd is unable to
>>>>synchronize the time:
>>>
>>>
>>>I'm working to see if I can reproduce this. Is this with 2.6.14 vanilla,
>>>or from Linus' git tree post 2.6.14?
>>
>>This is a vanilla 2.6.14 kernel from Linus git tree.
>>The architecture is i386:
>>Linux talla 2.6.14 #1 PREEMPT Tue Nov 1 17:27:04 CET 2005 i686 GNU/Linux
>
>
> I can't seem to trivially reproduce this.
>
>
> Your ntpq associations output looks suspicious, though.
> ind assID status conf reach auth condition last_event cnt
> ===========================================================
> 1 14484 9014 yes yes none reject reachable 1
>
> That reject condition seems odd.
>
>
> What does running "ntpdate -uq <server>" produce?

First I have rebooted with a new kernel 2.6.14 that have the patch
pointed out by Dean Gaudet, this don't change the problem.

On the machine with 2.6.14:

talla:~# uname -a
Linux talla 2.6.14-1 #2 PREEMPT Thu Nov 3 00:54:44 CET 2005 i686 GNU/Linux
talla:~# ntpdate -uq 10.0.0.1
server 10.0.0.1, stratum 3, offset -14.893095, delay 0.02644
3 Nov 01:31:59 ntpdate[8186]: step time server 10.0.0.1 offset
-14.893095 sec
talla:~# ntpdate -uq 129.132.2.21
server 129.132.2.21, stratum 2, offset -14.907672, delay 0.04263
3 Nov 01:32:00 ntpdate[8187]: step time server 129.132.2.21 offset
-14.907672 sec

>
> Also, could you check 2.6.13, or even better do a binary search of
> mainline releases since 2.6.8 to narrow down where this broke for you?

On the others machines using the same server:

craie:~# uname -a
Linux craie 2.4.27-pre2-7-k7 #1 lun mai 17 00:08:15 CEST 2004 i686 GNU/Linux
craie:~# ntpdate -uq 10.0.0.1
server 10.0.0.1, stratum 3, offset 0.000046, delay 0.02641
3 Nov 01:31:38 ntpdate[16783]: adjust time server 10.0.0.1 offset
0.000046 sec
craie:~# ntpdate -uq 129.132.2.21
server 129.132.2.21, stratum 2, offset -0.013689, delay 0.04294
3 Nov 01:31:39 ntpdate[16786]: adjust time server 129.132.2.21 offset
-0.013689 sec

citron:~# uname -a
Linux citron 2.6.12-nfs-1 #1 Fri Jun 24 18:23:39 CEST 2005 i686 GNU/Linux
citron:~# ntpdate -uq 10.0.0.1
server 10.0.0.1, stratum 3, offset 0.003676, delay 0.02647
3 Nov 01:32:06 ntpdate[13476]: adjust time server 10.0.0.1 offset
0.003676 sec
citron:~# ntpdate -uq 129.132.2.21
server 129.132.2.21, stratum 2, offset -0.010485, delay 0.04341
3 Nov 01:32:11 ntpdate[13477]: adjust time server 129.132.2.21 offset
-0.010485 sec

So this could to be something after the 2.6.12. All machines run the
same version of ntpd and use the same configuration file.

Thanks,
--
Jean-Christian de Rivaz

2005-11-03 01:07:15

by john stultz

[permalink] [raw]
Subject: Re: NTP broken with 2.6.14

On Thu, 2005-11-03 at 01:45 +0100, Jean-Christian de Rivaz wrote:
> john stultz a ?crit :
> > On Thu, 2005-11-03 at 00:37 +0100, Jean-Christian de Rivaz wrote:
> >>john stultz a ?crit :
> >>>On Thu, 2005-11-03 at 00:05 +0100, Jean-Christian de Rivaz wrote:
> >>>>Since I have installed the new kernel 2.6.14, ntpd is unable to
> >>>>synchronize the time:
> >>>
> >>>I'm working to see if I can reproduce this. Is this with 2.6.14 vanilla,
> >>>or from Linus' git tree post 2.6.14?
> >>
> >>This is a vanilla 2.6.14 kernel from Linus git tree.
> >>The architecture is i386:
> >>Linux talla 2.6.14 #1 PREEMPT Tue Nov 1 17:27:04 CET 2005 i686 GNU/Linux
> >
> > I can't seem to trivially reproduce this.
> >
> >
> > Your ntpq associations output looks suspicious, though.
> > ind assID status conf reach auth condition last_event cnt
> > ===========================================================
> > 1 14484 9014 yes yes none reject reachable 1
> >
> > That reject condition seems odd.
> >
> > What does running "ntpdate -uq <server>" produce?
>
> First I have rebooted with a new kernel 2.6.14 that have the patch
> pointed out by Dean Gaudet, this don't change the problem.
>
> On the machine with 2.6.14:
>
> talla:~# uname -a
> Linux talla 2.6.14-1 #2 PREEMPT Thu Nov 3 00:54:44 CET 2005 i686 GNU/Linux
> talla:~# ntpdate -uq 10.0.0.1
> server 10.0.0.1, stratum 3, offset -14.893095, delay 0.02644
> 3 Nov 01:31:59 ntpdate[8186]: step time server 10.0.0.1 offset
> -14.893095 sec

Hmm. Ok, so network wise you seem to be communicating with the server
without an issue. The other reasons for a reject condition are sync-loop
(your NTP server isn't synced to your box I'd assume), or your host is
drifting too severely from the NTP server for ntpd to compensate.

Attached is a cruddy python script I wrote that should provide you with
your ppm drift from your server.
To run:
o Disable ntpd
o Run "./drift-test.py <server>"
o Let it run for 10 minutes to get a decent drift value.


> citron:~# uname -a
> Linux citron 2.6.12-nfs-1 #1 Fri Jun 24 18:23:39 CEST 2005 i686 GNU/Linux
> citron:~# ntpdate -uq 10.0.0.1
> server 10.0.0.1, stratum 3, offset 0.003676, delay 0.02647
> 3 Nov 01:32:06 ntpdate[13476]: adjust time server 10.0.0.1 offset
> 0.003676 sec
> citron:~# ntpdate -uq 129.132.2.21
> server 129.132.2.21, stratum 2, offset -0.010485, delay 0.04341
> 3 Nov 01:32:11 ntpdate[13477]: adjust time server 129.132.2.21 offset
> -0.010485 sec
>
> So this could to be something after the 2.6.12. All machines run the
> same version of ntpd and use the same configuration file.

Would you mind confirming 2.6.12 does not have the issue on the same
hardware?

thanks
-john


Attachments:
drift-test.py (1.17 kB)

2005-11-03 01:12:37

by Jean-Christian de Rivaz

[permalink] [raw]
Subject: Re: NTP broken with 2.6.14

Roman Zippel a ?crit :
> Hi,
>
> On Thu, 3 Nov 2005, Jean-Christian de Rivaz wrote:
>
>
>>From the /var/log/ntpstats/peerstats history, the offset start growing
>>exactly at the same time I rebooted with the new 2.6.14 kernel. The ntpd
>>is the from the Debian Sarge version "ntpd 4.2.0a@1:4.2.0a+stable-2-r
>>Fri Aug 26 10:30:12 UTC 2005 (1)".
>
>
> Could you post a few lines from loopstats from before and after the
> upgrade? Do you have adjtimex installed? If yes, what's in
> /etc/default/adjtimex?

Here is a visible transition into the loopstats history:

53675 46251.308 0.004081011 -194.259811 0.005871592 17.580735 10
53675 47277.997 -0.000660212 -194.262329 0.005610392 15.279835 10
53675 48303.646 0.002776830 -194.251724 0.005153706 14.303375 10
53675 49330.304 0.004138248 -194.235901 0.004514851 14.801189 10
53675 50353.973 0.003474513 -194.222672 0.003924034 14.497788 10
53675 51379.750 0.000253633 -194.221710 0.003760592 12.565096 10
53675 52406.302 0.003895862 -194.206818 0.003731353 13.287282 10
53675 53432.968 0.004171375 -194.190887 0.003234381 14.104549 10
53675 54456.672 0.000078349 -194.190598 0.003469025 12.215800 10
53675 55481.298 0.003641894 -194.176697 0.003492895 12.750438 10
53675 56504.967 0.002649463 -194.166611 0.003065365 12.190070 10
53675 57531.730 0.000912963 -194.163132 0.002793064 10.706129 10
53675 58558.296 0.003194148 -194.150925 0.002674295 11.181610 10
53675 59584.996 0.002605389 -194.140976 0.002334641 10.941553 10
53675 60611.694 0.004877513 -194.122314 0.002319170 13.456606 10
53675 61636.321 0.003967914 -194.107178 0.002059309 13.995452 10
53675 62660.051 0.007092572 -194.080124 0.002370957 18.405714 10
53675 63684.643 0.001312519 -194.075119 0.003545184 16.144477 10
53675 64710.293 -0.000216660 -194.075943 0.003163992 13.987890 10
53675 65735.018 0.002720645 -194.065567 0.003108870 13.227565 10
53675 66759.621 0.000030339 -194.065460 0.003009692 11.455537 10
53675 67784.291 -0.000155943 -194.066055 0.002608133 9.925464 10
53675 68426.784 0.000000000 -194.065994 0.000003304 0.000184 6
53675 68491.127 -0.013804111 -194.921677 0.006902056 27.381836 6
53675 68748.284 -0.003905727 -195.879013 0.007760366 38.740322 6
53675 68811.324 -0.003714981 -196.102203 0.006721351 34.301879 6
53675 69440.055 0.000000000 -194.065994 0.000003304 0.000184 6
53675 69503.047 0.000000000 -194.065994 0.000002861 0.000159 6
53675 69569.039 0.000000000 -194.065994 0.000002478 0.000138 6
53675 69635.029 0.000000000 -194.065994 0.000002146 0.000119 6
53675 69699.021 0.000000000 -194.065994 0.000001858 0.000103 6
53675 69762.012 0.000000000 -194.065994 0.000001609 0.000089 7
53675 69825.004 0.000000000 -194.065994 0.000001394 0.000077 7
53675 69889.996 0.000000000 -194.065994 0.000001207 0.000067 7
53675 69952.987 0.000000000 -194.065994 0.000001045 0.000058 7
53675 70017.979 0.000000000 -194.065994 0.000000905 0.000050 7
53675 70083.970 0.000000000 -194.065994 0.000000784 0.000044 8
53675 70149.961 0.000000000 -194.065994 0.000000679 0.000038 8
53675 70213.953 0.000000000 -194.065994 0.000000588 0.000033 8
53675 70276.945 0.000000000 -194.065994 0.000000509 0.000028 8
53675 70339.936 0.000000000 -194.065994 0.000000441 0.000025 9
53675 70403.928 0.000000000 -194.065994 0.000000382 0.000021 9
53675 70471.919 0.000000000 -194.065994 0.000000331 0.000018 9
53675 70536.910 0.000000000 -194.065994 0.000000286 0.000016 9
53675 70601.906 0.000000000 -194.065994 0.000000248 0.000014 10
53675 70665.893 0.000000000 -194.065994 0.000000215 0.000012 10
53675 70728.885 0.000000000 -194.065994 0.000000186 0.000010 10
53675 70791.878 0.000000000 -194.065994 0.000000161 0.000009 10
53675 70854.869 0.000000000 -194.065994 0.000000140 0.000008 10
53675 70918.860 0.000000000 -194.065994 0.000000121 0.000007 10

I did not have adjtimex installed. I just intalled it to see what
happens. The adjtimexconfig calculate the following values into the
/etc/default/adjtimex file:
TICK=121
FREQ=3597582

When I do a "/etc/init.d/adjtimex start" I get an error and a strange
value of USER_HZ since the kernel is configured for a HZ of 250 (maybe
this two values are not related, I don't know about this):

talla:~# /etc/init.d/adjtimex start
Regulating system clock...adjtimex: Invalid argument
for this kernel:
USER_HZ = 100 (nominally 100 ticks per second)
9000 <= tick <= 11000
-33554432 <= frequency <= 33554432
done.

talla:~# zcat /proc/config.gz | egrep HZ
# CONFIG_HZ_100 is not set
CONFIG_HZ_250=y
# CONFIG_HZ_1000 is not set
CONFIG_HZ=250
CONFIG_MACHZ_WDT=m

Thanks,
--
Jean-Christian de Rivaz

2005-11-03 02:26:08

by Jean-Christian de Rivaz

[permalink] [raw]
Subject: Re: NTP broken with 2.6.14

john stultz a ?crit :

>>talla:~# uname -a
>>Linux talla 2.6.14-1 #2 PREEMPT Thu Nov 3 00:54:44 CET 2005 i686 GNU/Linux
>>talla:~# ntpdate -uq 10.0.0.1
>>server 10.0.0.1, stratum 3, offset -14.893095, delay 0.02644
>> 3 Nov 01:31:59 ntpdate[8186]: step time server 10.0.0.1 offset
>>-14.893095 sec
>
>
> Hmm. Ok, so network wise you seem to be communicating with the server
> without an issue. The other reasons for a reject condition are sync-loop
> (your NTP server isn't synced to your box I'd assume), or your host is
> drifting too severely from the NTP server for ntpd to compensate.
>
> Attached is a cruddy python script I wrote that should provide you with
> your ppm drift from your server.
> To run:
> o Disable ntpd
> o Run "./drift-test.py <server>"
> o Let it run for 10 minutes to get a decent drift value.

Ok. First I purged (remove the config files and binary) the adjtimex
installation. Then I rebooted with the old 2.6.8 kernel and watch the
first 5 polls of ntpd:

ntpq> pe
remote refid st t when poll reach delay offset
jitter
==============================================================================
*10.0.0.1 129.132.2.21 3 u 2 64 37 1.082 -45.761
24.670
ntpq> rv 54820
assID=54820 status=9614 reach, conf, sel_sys.peer, 1 event, event_reach,
srcadr=10.0.0.1, srcport=123, dstadr=10.0.33.10, dstport=123, leap=00,
stratum=3, precision=-17, rootdelay=35.065, rootdispersion=46.066,
refid=129.132.2.21, reach=037, unreach=0, hmode=3, pmode=4, hpoll=6,
ppoll=6, flash=00 ok, keyid=0, ttl=0, offset=-45.761, delay=1.082,
dispersion=438.994, jitter=24.670,
reftime=c713e731.c00053e2 Thu, Nov 3 2005 2:32:33.750,
org=c713e7b0.6cbe0ded Thu, Nov 3 2005 2:34:40.424,
rec=c713e7b0.7bdb5d89 Thu, Nov 3 2005 2:34:40.483,
xmt=c713e7b0.7b72606f Thu, Nov 3 2005 2:34:40.482,
filtdelay= 1.13 1.08 1.17 1.13 1.14 0.00 0.00 0.00,
filtoffset= -58.48 -45.76 -32.77 -20.31 -7.60 0.00 0.00 0.00,
filtdisp= 0.01 0.96 1.92 2.86 3.82 16000.0 16000.0 16000.0

So with 2.6.8 this machine have a working ntpd. Now I stopped ntpd and
used your script with the server 10.0.0.1:

03 Nov 02:36:32 offset: -6.9e-05 drift: -77.0 ppm
03 Nov 02:37:32 offset: -0.005162 drift: -84.7540983607 ppm
03 Nov 02:38:32 offset: -0.011573 drift: -95.7107438017 ppm
03 Nov 02:39:32 offset: -0.019045 drift: -105.26519337 ppm
03 Nov 02:40:32 offset: -0.02732 drift: -113.394190871 ppm
03 Nov 02:41:32 offset: -0.036287 drift: -120.581395349 ppm
03 Nov 02:42:32 offset: -0.045824 drift: -126.958448753 ppm
03 Nov 02:43:32 offset: -0.055755 drift: -132.45368171 ppm
03 Nov 02:44:33 offset: -0.065992 drift: -136.929460581 ppm
03 Nov 02:45:33 offset: -0.076472 drift: -141.10701107 ppm
03 Nov 02:46:33 offset: -0.087156 drift: -144.790697674 ppm

After, I rebooted the machine with the 2.6.14 kernel and watched the
first 5 polls of ntpd:

ntpq> pe
remote refid st t when poll reach delay offset
jitter
==============================================================================
10.0.0.1 129.132.2.21 3 u 2 64 37 1.106 -6989.1
3351.11
ntpq> rv 51860
assID=51860 status=9014 reach, conf, 1 event, event_reach,
srcadr=10.0.0.1, srcport=123, dstadr=10.0.33.10, dstport=123, leap=00,
stratum=3, precision=-17, rootdelay=36.804, rootdispersion=52.307,
refid=129.132.2.21, reach=037, unreach=0, hmode=3, pmode=4, hpoll=6,
ppoll=6, flash=00 ok, keyid=0, ttl=0, offset=-6989.157, delay=1.106,
dispersion=438.355, jitter=3351.111,
reftime=c713eb31.d4be1eb4 Thu, Nov 3 2005 2:49:37.831,
org=c713ec29.241b64e0 Thu, Nov 3 2005 2:53:45.141,
rec=c713ec30.21790752 Thu, Nov 3 2005 2:53:52.130,
xmt=c713ec30.21121251 Thu, Nov 3 2005 2:53:52.129,
filtdelay= 1.11 1.14 1.16 1.16 1.22 0.00 0.00 0.00,
filtoffset= -6989.1 -6239.9 -5482.4 -4133.2 -1164.0 0.00 0.00 0.00,
filtdisp= 0.01 0.97 1.95 2.91 3.88 16000.0 16000.0 16000.0

As before, the ntpd is not working properly with the 2.6.14 kernel. Now
I stopped ntpd and used your script with the 10.0.0.1 server:

03 Nov 02:54:56 offset: -0.008247 drift: -4236.0 ppm
03 Nov 02:55:56 offset: -0.828716 drift: -13519.7540984 ppm
03 Nov 02:56:57 offset: -1.593172 drift: -13025.9098361 ppm
03 Nov 02:57:57 offset: -2.817531 drift: -15458.9010989 ppm
03 Nov 02:58:57 offset: -3.442019 drift: -14206.6446281 ppm
03 Nov 02:59:57 offset: -4.070492 drift: -13465.1688742 ppm
03 Nov 03:00:57 offset: -4.658962 drift: -12858.980663 ppm
03 Nov 03:01:57 offset: -5.267374 drift: -12472.4241706 ppm
03 Nov 03:02:57 offset: -5.843858 drift: -12115.8651452 ppm
03 Nov 03:03:57 offset: -7.052287 drift: -13004.199262 ppm
03 Nov 03:04:58 offset: -7.564786 drift: -12538.5986733 ppm

Interresting! So if I understand correctly the ntpd problem is because
the kernel 2.6.14 kernel show a drift about 100 time bigger than with
the kernel 2.6.8 on the same hardware. For information the mainboard is
the MSI K7N2 (nForce 2). Here is the cpuinfo in case that matter:

processor : 0
vendor_id : AuthenticAMD
cpu family : 6
model : 10
model name : AMD Athlon(tm)
stepping : 0
cpu MHz : 2004.860
cache size : 512 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow
bogomips : 4013.69


>>citron:~# uname -a
>>Linux citron 2.6.12-nfs-1 #1 Fri Jun 24 18:23:39 CEST 2005 i686 GNU/Linux
>>citron:~# ntpdate -uq 10.0.0.1
>>server 10.0.0.1, stratum 3, offset 0.003676, delay 0.02647
>> 3 Nov 01:32:06 ntpdate[13476]: adjust time server 10.0.0.1 offset
>>0.003676 sec
>>citron:~# ntpdate -uq 129.132.2.21
>>server 129.132.2.21, stratum 2, offset -0.010485, delay 0.04341
>> 3 Nov 01:32:11 ntpdate[13477]: adjust time server 129.132.2.21 offset
>>-0.010485 sec
>>
>>So this could to be something after the 2.6.12. All machines run the
>>same version of ntpd and use the same configuration file.
>
>
> Would you mind confirming 2.6.12 does not have the issue on the same
> hardware?

The kernel 2.6.12 run on a different hardware and is not configured to
work on the hardware that have the problem with 2.6.14, so I can't
confime exactly your question yet. If you don't have any better idea, I
can try several kernel version to find when the problem start. But I
will make that after I sleep because I you look at the time field into
the test result this is very late now for me...

Thanks for you support, I hope we will quicky find to solution.
--
Jean-Christian de Rivaz

2005-11-03 09:42:34

by Roman Zippel

[permalink] [raw]
Subject: Re: NTP broken with 2.6.14

Hi,

On Thu, 3 Nov 2005, Jean-Christian de Rivaz wrote:

> talla:~# zcat /proc/config.gz | egrep HZ
> # CONFIG_HZ_100 is not set
> CONFIG_HZ_250=y
> # CONFIG_HZ_1000 is not set
> CONFIG_HZ=250

It's possible that it works again, if you change this back to 1000, but
that would mean something is still wrong.
Could you compare the compare the boot messages, if you find any
suspicious differences?

bye, Roman

2005-11-03 19:32:32

by john stultz

[permalink] [raw]
Subject: Re: NTP broken with 2.6.14

On Thu, 2005-11-03 at 03:26 +0100, Jean-Christian de Rivaz wrote:
> john stultz a ?crit :
> > Hmm. Ok, so network wise you seem to be communicating with the server
> > without an issue. The other reasons for a reject condition are sync-loop
> > (your NTP server isn't synced to your box I'd assume), or your host is
> > drifting too severely from the NTP server for ntpd to compensate.
> >
> > Attached is a cruddy python script I wrote that should provide you with
> > your ppm drift from your server.
> > To run:
> > o Disable ntpd
> > o Run "./drift-test.py <server>"
> > o Let it run for 10 minutes to get a decent drift value.

[snip]
> So with 2.6.8 this machine have a working ntpd. Now I stopped ntpd and
> used your script with the server 10.0.0.1:
>
> 03 Nov 02:36:32 offset: -6.9e-05 drift: -77.0 ppm
> 03 Nov 02:37:32 offset: -0.005162 drift: -84.7540983607 ppm
> 03 Nov 02:38:32 offset: -0.011573 drift: -95.7107438017 ppm
> 03 Nov 02:39:32 offset: -0.019045 drift: -105.26519337 ppm
> 03 Nov 02:40:32 offset: -0.02732 drift: -113.394190871 ppm
> 03 Nov 02:41:32 offset: -0.036287 drift: -120.581395349 ppm
> 03 Nov 02:42:32 offset: -0.045824 drift: -126.958448753 ppm
> 03 Nov 02:43:32 offset: -0.055755 drift: -132.45368171 ppm
> 03 Nov 02:44:33 offset: -0.065992 drift: -136.929460581 ppm
> 03 Nov 02:45:33 offset: -0.076472 drift: -141.10701107 ppm
> 03 Nov 02:46:33 offset: -0.087156 drift: -144.790697674 ppm

[snip]
> As before, the ntpd is not working properly with the 2.6.14 kernel. Now
> I stopped ntpd and used your script with the 10.0.0.1 server:
>
> 03 Nov 02:54:56 offset: -0.008247 drift: -4236.0 ppm
> 03 Nov 02:55:56 offset: -0.828716 drift: -13519.7540984 ppm
> 03 Nov 02:56:57 offset: -1.593172 drift: -13025.9098361 ppm
> 03 Nov 02:57:57 offset: -2.817531 drift: -15458.9010989 ppm
> 03 Nov 02:58:57 offset: -3.442019 drift: -14206.6446281 ppm
> 03 Nov 02:59:57 offset: -4.070492 drift: -13465.1688742 ppm
> 03 Nov 03:00:57 offset: -4.658962 drift: -12858.980663 ppm
> 03 Nov 03:01:57 offset: -5.267374 drift: -12472.4241706 ppm
> 03 Nov 03:02:57 offset: -5.843858 drift: -12115.8651452 ppm
> 03 Nov 03:03:57 offset: -7.052287 drift: -13004.199262 ppm
> 03 Nov 03:04:58 offset: -7.564786 drift: -12538.5986733 ppm
>
> Interresting! So if I understand correctly the ntpd problem is because
> the kernel 2.6.14 kernel show a drift about 100 time bigger than with
> the kernel 2.6.8 on the same hardware. For information the mainboard is
> the MSI K7N2 (nForce 2). Here is the cpuinfo in case that matter:

Yep. Thats what I was guessing. For some reason time is running too
quickly on your system. Since it is more then +/-500ppm NTP gives up and
won't sync.

Time running too fast can have a number of causes.

Could you open a bugme bug on this and tag me as the owner?
http://bugzilla.kernel.org

Also attach dmesg output and we'll see if that doesn't provide more
clues.

> > Would you mind confirming 2.6.12 does not have the issue on the same
> > hardware?
>
> The kernel 2.6.12 run on a different hardware and is not configured to
> work on the hardware that have the problem with 2.6.14, so I can't
> confime exactly your question yet. If you don't have any better idea, I
> can try several kernel version to find when the problem start. But I
> will make that after I sleep because I you look at the time field into
> the test result this is very late now for me...

When you get a chance starting with a binary search of kernel versions
would help narrow down the issue (start w/ 2.6.8 vanilla to make sure
something in the distro tree isn't avoiding this issue).

> Thanks for you support, I hope we will quicky find to solution.

I really appreciate your immediate feedback and testing! I'm sure we can
resolve this soon, esp since you do have a working configuration to
compare against.

thanks
-john

2005-11-03 19:51:27

by Lennart Sorensen

[permalink] [raw]
Subject: Re: NTP broken with 2.6.14

On Thu, Nov 03, 2005 at 11:32:28AM -0800, john stultz wrote:
> Yep. Thats what I was guessing. For some reason time is running too
> quickly on your system. Since it is more then +/-500ppm NTP gives up and
> won't sync.
>
> Time running too fast can have a number of causes.
>
> Could you open a bugme bug on this and tag me as the owner?
> http://bugzilla.kernel.org
>
> Also attach dmesg output and we'll see if that doesn't provide more
> clues.

I have no idea if this is related at all, but I have had system time
speed issues on my machine for a while too.

With 2.6.8 it always ran fine, with 2.6.12 it seemed to gain around 10
minutes per hour, and so far today with 2.6.14 it seems to be gaining
about 3 or 4 minutes per hour. These are all Debian kernels, although I
hope they haven't added/removed anything that would affect this.

This is happening on an Asus A7N8X-E-DX with an Athlon XP 2800+. I have
acpi enabled, so who knows if that is what is breaking things. There
does seem to have been time keeping issues on ati chipsets big time in
recent kernels, and some other acpi issues at times, so it wouldn't
surprise me if a fix for one issue causes problems on another chipset.
The chipset on this board is the nforce2.

Len Sorensen

2005-11-03 20:11:18

by john stultz

[permalink] [raw]
Subject: Re: NTP broken with 2.6.14

On Thu, 2005-11-03 at 14:51 -0500, Lennart Sorensen wrote:
> On Thu, Nov 03, 2005 at 11:32:28AM -0800, john stultz wrote:
> > Yep. Thats what I was guessing. For some reason time is running too
> > quickly on your system. Since it is more then +/-500ppm NTP gives up and
> > won't sync.
> >
> > Time running too fast can have a number of causes.
> >
> > Could you open a bugme bug on this and tag me as the owner?
> > http://bugzilla.kernel.org
> >
> > Also attach dmesg output and we'll see if that doesn't provide more
> > clues.
>
> I have no idea if this is related at all, but I have had system time
> speed issues on my machine for a while too.
>
> With 2.6.8 it always ran fine, with 2.6.12 it seemed to gain around 10
> minutes per hour, and so far today with 2.6.14 it seems to be gaining
> about 3 or 4 minutes per hour. These are all Debian kernels, although I
> hope they haven't added/removed anything that would affect this.
>
> This is happening on an Asus A7N8X-E-DX with an Athlon XP 2800+. I have
> acpi enabled, so who knows if that is what is breaking things. There
> does seem to have been time keeping issues on ati chipsets big time in
> recent kernels, and some other acpi issues at times, so it wouldn't
> surprise me if a fix for one issue causes problems on another chipset.
> The chipset on this board is the nforce2.

Yea, we have some issues with a few specific chipsets, but those were
not regressions to my knowledge.

Hmm. Check bug #5038 to see if sounds familiar.
http://bugzilla.kernel.org/show_bug.cgi?id=5038

thanks
-john




2005-11-03 20:48:09

by Lennart Sorensen

[permalink] [raw]
Subject: Re: NTP broken with 2.6.14

On Thu, Nov 03, 2005 at 12:11:10PM -0800, john stultz wrote:
> Yea, we have some issues with a few specific chipsets, but those were
> not regressions to my knowledge.

Well my nforce2 worked with 2.6.8 and earlier, and I believe it was even
fine with 2.6.10 and 2.6.11, but certainly 2.6.12 ran rather awful time
sync wise, and 2.6.14 appears so far to be running a little fast,
although I can't say for sure. It is much better than 2.6.12 was. It
may be that 2.6.14 is running correctly, but that the previous drift has
caused something to need to be realigned.

> Hmm. Check bug #5038 to see if sounds familiar.
> http://bugzilla.kernel.org/show_bug.cgi?id=5038

I was seeting WAY more drift than that with 2.6.12.

Len Sorensen

2005-11-03 21:00:31

by john stultz

[permalink] [raw]
Subject: Re: NTP broken with 2.6.14

On Thu, 2005-11-03 at 15:48 -0500, Lennart Sorensen wrote:
> On Thu, Nov 03, 2005 at 12:11:10PM -0800, john stultz wrote:
> > Yea, we have some issues with a few specific chipsets, but those were
> > not regressions to my knowledge.
>
> Well my nforce2 worked with 2.6.8 and earlier, and I believe it was even
> fine with 2.6.10 and 2.6.11, but certainly 2.6.12 ran rather awful time
> sync wise, and 2.6.14 appears so far to be running a little fast,
> although I can't say for sure. It is much better than 2.6.12 was. It
> may be that 2.6.14 is running correctly, but that the previous drift has
> caused something to need to be realigned.

You could try disabling NTP and running the python script I sent out
earlier in this thread to determine your systems ppm drift. Outside
+/-500ppm is def broken, outside of +/-250ppm is probably broken,
outside +/-100ppm isn't great but correctable and inside +/-100ppm is
(unfortunately) pretty average for most hardware.

>
> > Hmm. Check bug #5038 to see if sounds familiar.
> > http://bugzilla.kernel.org/show_bug.cgi?id=5038
>
> I was seeting WAY more drift than that with 2.6.12.

Ok, do you want to open your own bug on this and we'll mark them
duplicate as needed?

Please attach dmesg output to the bug as well.

thanks
-john


2005-11-03 21:12:47

by Jean-Christian de Rivaz

[permalink] [raw]
Subject: Re: NTP broken with 2.6.14

john stultz a ?crit :

>>So with 2.6.8 this machine have a working ntpd. Now I stopped ntpd and
>>used your script with the server 10.0.0.1:
>>
>>03 Nov 02:46:33 offset: -0.087156 drift: -144.790697674 ppm
>
>>As before, the ntpd is not working properly with the 2.6.14 kernel. Now
>>I stopped ntpd and used your script with the 10.0.0.1 server:
>>
>>03 Nov 03:04:58 offset: -7.564786 drift: -12538.5986733 ppm
>>
>>Interresting! So if I understand correctly the ntpd problem is because
>>the kernel 2.6.14 kernel show a drift about 100 time bigger than with
>>the kernel 2.6.8 on the same hardware. For information the mainboard is
>>the MSI K7N2 (nForce 2). Here is the cpuinfo in case that matter:
>
>
> Yep. Thats what I was guessing. For some reason time is running too
> quickly on your system. Since it is more then +/-500ppm NTP gives up and
> won't sync.
>
> Time running too fast can have a number of causes.
>
> Could you open a bugme bug on this and tag me as the owner?
> http://bugzilla.kernel.org
>
> Also attach dmesg output and we'll see if that doesn't provide more
> clues.

Ok, I will make open a bug in a moment.

>>>Would you mind confirming 2.6.12 does not have the issue on the same
>>>hardware?
>>
>>The kernel 2.6.12 run on a different hardware and is not configured to
>>work on the hardware that have the problem with 2.6.14, so I can't
>>confime exactly your question yet. If you don't have any better idea, I
>>can try several kernel version to find when the problem start. But I
>>will make that after I sleep because I you look at the time field into
>>the test result this is very late now for me...
>
> When you get a chance starting with a binary search of kernel versions
> would help narrow down the issue (start w/ 2.6.8 vanilla to make sure
> something in the distro tree isn't avoiding this issue).

A have tested 7 differents vanilla kernel on the same suspect hardware:

2.6.8 : ntpd working : drift from -77ppm to -144ppm
2.6.9 : ntpd working : drift from -99ppm to -231ppm
2.6.10 : ntpd failed : drift from -37825ppm to -29912ppm
2.6.12 : ntpd failed : drift from -43429ppm to -45251ppm
CONFIG_HZ=100 2.6.14 : ntpd failed : drift from -7598ppm to -4410ppm
CONFIG_HZ=250 2.6.14 : ntpd failed : drift from -13519ppm to -12538ppm
CONFIG_HZ=1000 2.6.14 : ntpd failed : drift from -14497ppm to -19543ppm

So it seems that the problem start from the kernel 2.6.10. While testing
I feel that the 2.6.14 kernel behave a little different than the others
kernels. The drift don't increase alway in the same direction as for the
others kernels, but oscillate around a central (too high) value. I also
tested differents values of CONFIG_HZ with the 2.6.14 like suggested by
Roman Zippel, but this don't make the ntpd working. Interresting, with
CONFIG_HZ=100 the drift is halfed and ntpd reset his stats after each 5
pools! Something I didn't notice with any others kernels.

>
>>Thanks for you support, I hope we will quicky find to solution.
>
> I really appreciate your immediate feedback and testing! I'm sure we can
> resolve this soon, esp since you do have a working configuration to
> compare against.

I have 2 others machines with kernel >=2.6.10: one with kernel 2.6.12
and one with kernel 2.6.13 and theres don't have any problem with ntpd.
So this seems to be specific to the hardware. The two machines without
problem are based on VIA chipset, one with a K7 and one with a K8. The
machine with the ntp problem is based on a nForce2 with a K7. I see an
other post saying that there exists the same problem with a other
nForce2 based board.

Hope this help,
--
Jean-Christian de Rivaz

2005-11-03 21:12:46

by Lennart Sorensen

[permalink] [raw]
Subject: Re: NTP broken with 2.6.14

On Thu, Nov 03, 2005 at 01:00:27PM -0800, john stultz wrote:
> You could try disabling NTP and running the python script I sent out
> earlier in this thread to determine your systems ppm drift. Outside
> +/-500ppm is def broken, outside of +/-250ppm is probably broken,
> outside +/-100ppm isn't great but correctable and inside +/-100ppm is
> (unfortunately) pretty average for most hardware.

Well with no ntpd running on 2.6.14, it appears that my nforce2 board
matches that bug rather well. About a second gained every minute or
two.

> Ok, do you want to open your own bug on this and we'll mark them
> duplicate as needed?
>
> Please attach dmesg output to the bug as well.

Well it seems whatever was wrong with 2.6.12 for me, isnt' a problem in
2.6.14, as it is not gaining 10 to 15 seconds every minute now. It is
still gaining a bit though. I have no tried to run without the local
apic in case that makes a difference.

Len Sorensen

2005-11-03 21:28:35

by Jean-Christian de Rivaz

[permalink] [raw]
Subject: Re: NTP broken with 2.6.14

john stultz a ?crit :

>>This is happening on an Asus A7N8X-E-DX with an Athlon XP 2800+. I have
>>acpi enabled, so who knows if that is what is breaking things. There
>>does seem to have been time keeping issues on ati chipsets big time in
>>recent kernels, and some other acpi issues at times, so it wouldn't
>>surprise me if a fix for one issue causes problems on another chipset.
>>The chipset on this board is the nforce2.
>
>
> Yea, we have some issues with a few specific chipsets, but those were
> not regressions to my knowledge.
>
> Hmm. Check bug #5038 to see if sounds familiar.
> http://bugzilla.kernel.org/show_bug.cgi?id=5038

Interresting bug report. You have to know that the machine (nForce2
based) that have the ntpd problem is NFS root and import via NFS several
mount points. In fact this machine don't have any hard disk and make
everything over NFS. ( This way, with a passiv water cooling and passiv
power supply I enjoy an absolutly silent dektop. )

I have an other machine (VIA based) with a kernel 2.6.12 that is NFS
root the same way but don't have the ntpd problem.

--
Jean-Christian de Rivaz

2005-11-03 21:41:31

by john stultz

[permalink] [raw]
Subject: Re: NTP broken with 2.6.14

On Thu, 2005-11-03 at 22:12 +0100, Jean-Christian de Rivaz wrote:
> A have tested 7 differents vanilla kernel on the same suspect hardware:
>
> 2.6.8 : ntpd working : drift from -77ppm to -144ppm
> 2.6.9 : ntpd working : drift from -99ppm to -231ppm
> 2.6.10 : ntpd failed : drift from -37825ppm to -29912ppm
> 2.6.12 : ntpd failed : drift from -43429ppm to -45251ppm

Ok, that makes it pretty clear we have a regression w/ 2.6.10. I really
appreciate your helping narrow down this issue. If you have the time,
could you test the three 2.6.10-rcX patches?

You can find them here:
ftp://ftp.kernel.org/pub/linux/kernel/v2.6/testing/

And they apply independently (not cumulatively) ontop of 2.6.9

thanks
-john



2005-11-03 22:10:38

by Jean-Christian de Rivaz

[permalink] [raw]
Subject: Re: NTP broken with 2.6.14

john stultz a ?crit :
> On Thu, 2005-11-03 at 22:12 +0100, Jean-Christian de Rivaz wrote:
>
>>A have tested 7 differents vanilla kernel on the same suspect hardware:
>>
>> 2.6.8 : ntpd working : drift from -77ppm to -144ppm
>> 2.6.9 : ntpd working : drift from -99ppm to -231ppm
>> 2.6.10 : ntpd failed : drift from -37825ppm to -29912ppm
>> 2.6.12 : ntpd failed : drift from -43429ppm to -45251ppm
>
>
> Ok, that makes it pretty clear we have a regression w/ 2.6.10. I really
> appreciate your helping narrow down this issue. If you have the time,
> could you test the three 2.6.10-rcX patches?
>
> You can find them here:
> ftp://ftp.kernel.org/pub/linux/kernel/v2.6/testing/
>
> And they apply independently (not cumulatively) ontop of 2.6.9

I will try, but compiling the kernels take time even with 3 machines
(one per kernel version)...


I compared the dmesg log of the different kernel, but since I don't know
what I should find it's a little difficult. There is many differences
between each kernels. Despit that, I noticed this difference between the
kernel 2.6.9 (ntps working) and the kernel 2.6.10 (ntpd failed):

--- linux-2.6.9.txt 2005-11-03 22:49:29.000000000 +0100
+++ linux-2.6.10.txt 2005-11-03 22:48:41.000000000 +0100
[...snip...]
@@ -67,16 +68,12 @@
Enabling unmasked SIMD FPU exception support... done.
Checking 'hlt' instruction... OK.
ENABLING IO-APIC IRQs
- vector=0x31 pin1=2 pin2=-1
- 8254 timer not connected to IO-APIC
- ...trying to set up timer (IRQ0) through the 8259A ... failed.
- ...trying to set up timer as Virtual Wire IRQ... failed.
- ...trying to set up timer as ExtINT IRQ... works.
+ vector=0x31 pin1=0 pin2=-1
Registered protocol family 16
PCI BIOS revision 2.10 entry at 0xfbbb0, last bus=3
Using configuration type 1
[..snip...]

Maybe a way to go ?

--
Jean-Christian de Rivaz

2005-11-03 22:54:46

by john stultz

[permalink] [raw]
Subject: Re: NTP broken with 2.6.14

On Thu, 2005-11-03 at 23:10 +0100, Jean-Christian de Rivaz wrote:
> john stultz a ?crit :
> > On Thu, 2005-11-03 at 22:12 +0100, Jean-Christian de Rivaz wrote:
> >
> >>A have tested 7 differents vanilla kernel on the same suspect hardware:
> >>
> >> 2.6.8 : ntpd working : drift from -77ppm to -144ppm
> >> 2.6.9 : ntpd working : drift from -99ppm to -231ppm
> >> 2.6.10 : ntpd failed : drift from -37825ppm to -29912ppm
> >> 2.6.12 : ntpd failed : drift from -43429ppm to -45251ppm
> >
> >
> > Ok, that makes it pretty clear we have a regression w/ 2.6.10. I really
> > appreciate your helping narrow down this issue. If you have the time,
> > could you test the three 2.6.10-rcX patches?
> >
> > You can find them here:
> > ftp://ftp.kernel.org/pub/linux/kernel/v2.6/testing/
> >
> > And they apply independently (not cumulatively) ontop of 2.6.9
>
> I will try, but compiling the kernels take time even with 3 machines
> (one per kernel version)...
>
>
> I compared the dmesg log of the different kernel, but since I don't know
> what I should find it's a little difficult. There is many differences
> between each kernels. Despit that, I noticed this difference between the
> kernel 2.6.9 (ntps working) and the kernel 2.6.10 (ntpd failed):
>
> --- linux-2.6.9.txt 2005-11-03 22:49:29.000000000 +0100
> +++ linux-2.6.10.txt 2005-11-03 22:48:41.000000000 +0100
> [...snip...]
> @@ -67,16 +68,12 @@
> Enabling unmasked SIMD FPU exception support... done.
> Checking 'hlt' instruction... OK.
> ENABLING IO-APIC IRQs
> - vector=0x31 pin1=2 pin2=-1
> - 8254 timer not connected to IO-APIC
> - ...trying to set up timer (IRQ0) through the 8259A ... failed.
> - ...trying to set up timer as Virtual Wire IRQ... failed.
> - ...trying to set up timer as ExtINT IRQ... works.
> + vector=0x31 pin1=0 pin2=-1
> Registered protocol family 16
> PCI BIOS revision 2.10 entry at 0xfbbb0, last bus=3
> Using configuration type 1
> [..snip...]
>
> Maybe a way to go ?


You might check booting w/ noapic to see if that changes the behaviour
in 2.6.10.

I know there were some pretty troubling issues w/ the nforce2 early in
the 2.6 cycle. See
http://atlas.et.tudelft.nl/verwei90/nforce2/index.html for some details.


Maciej: I noticed you had been involved with earlier nforce2 issues.
Does the above change in the ioapic pin1 value look familiar?


Digging around in the bkcvs git web between 2.6.9 and 2.6.10, I found
the following ioapic related changes:

Randy:
http://kernel.org/git/?p=linux/kernel/git/torvalds/old-2.6-bkcvs.git;a=commitdiff;h=0b517c442f66f9b1e280ca49d4b215cc3681d4e5;hp=60a7a584ad5a266afa5d7fde5f2828894e615c17

Len:
http://kernel.org/git/?p=linux/kernel/git/torvalds/old-2.6-bkcvs.git;a=commitdiff;h=eb3f18413cb759662b34230674fb6f07c9e16e56;hp=e87e2e7669129dc0e8b2959c656650d7ea5c066f

Any clues?


Jean-Christian: Since it ACPI is involved, have you verified that you're
running the current BIOS for your system?

thanks
-john


2005-11-04 00:15:33

by Jean-Christian de Rivaz

[permalink] [raw]
Subject: Re: NTP broken with 2.6.14

john stultz a ?crit :
> You might check booting w/ noapic to see if that changes the behaviour
> in 2.6.10.

Yes! With a vanilla 2.6.10, the noapic solve the problem and ntpd is happy.


> Jean-Christian: Since it ACPI is involved, have you verified that you're
> running the current BIOS for your system?

More fun now: it look like the BIOS actually used on this mainboard is
not designed for it, but for an other board!!!

The board is exactly this one "K7N2 Delta-L":
http://www.msi.com.tw/program/support/download/dld/spt_dld_detail.php?UID=436&kind=1
And according to MSI it must use a BIOS version 5.9. But when I enter
into the BIOS setup the version info say "W6570MS V7.4 081203".

Here is the BIOS version history:
http://www.msi.com.tw/program/support/bios/bos/spt_bos_detail.php?UID=436&kind=1
The version 7.4 dated 2003-8-12 has a special note:

1. Only for K7N2 Delta-ILSR
2. This BIOS cannot be used on K7N2 Delta-L

Crasy. I use this board without any issue since around two years and
only found the first problem when upgrading to the kernel 2.6.14!


At least the situation is more clear now.
--
Jean-Christian de Rivaz

2005-11-04 00:40:50

by john stultz

[permalink] [raw]
Subject: Re: NTP broken with 2.6.14

On Fri, 2005-11-04 at 01:15 +0100, Jean-Christian de Rivaz wrote:
> john stultz a ?crit :
> > You might check booting w/ noapic to see if that changes the behaviour
> > in 2.6.10.
>
> Yes! With a vanilla 2.6.10, the noapic solve the problem and ntpd is happy.

Great. Glad you have a workaround now.

> > Jean-Christian: Since it ACPI is involved, have you verified that you're
> > running the current BIOS for your system?
>
> More fun now: it look like the BIOS actually used on this mainboard is
> not designed for it, but for an other board!!!
>
> The board is exactly this one "K7N2 Delta-L":
> http://www.msi.com.tw/program/support/download/dld/spt_dld_detail.php?UID=436&kind=1
> And according to MSI it must use a BIOS version 5.9. But when I enter
> into the BIOS setup the version info say "W6570MS V7.4 081203".
>
> Here is the BIOS version history:
> http://www.msi.com.tw/program/support/bios/bos/spt_bos_detail.php?UID=436&kind=1
> The version 7.4 dated 2003-8-12 has a special note:
>
> 1. Only for K7N2 Delta-ILSR
> 2. This BIOS cannot be used on K7N2 Delta-L
>
> Crasy. I use this board without any issue since around two years and
> only found the first problem when upgrading to the kernel 2.6.14!

Heh. Yea, I'm amazed you were able to flash it and still have the system
boot. I'd suggest making sure you have the proper and current BIOS, as
its *very* difficult for folks to help debug problems on unofficial or
unsupported hardware/firmware configs.

I believe ioapic support should function correctly regardless (or be
blacklisted and with others reporting problems, its might not be just
this BIOS issue), so after you get your BIOS sorted out, please let me
know if the problem still persists.

thanks
-john


2005-11-04 02:50:13

by Jean-Christian de Rivaz

[permalink] [raw]
Subject: Re: NTP broken with 2.6.14

I have tested the 2.6.10-rc* kernels too. Here is the full list:

2.6.8 : ntpd working : low drift
2.6.9 : ntpd working : low drift
2.6.10-rc1 : ntpd working : low drift
2.6.10-rc2 : ntpd working : low drift
2.6.10-rc3 : ntpd failed : high drift
2.6.10 : ntpd failed : high drift
2.6.12 : ntpd failed : high drift
2.6.14 : ntpd failed : high drift

The picture is very clear: the problem start from the 2.6.10-rc3 kernel.
All the kernel that make ntpd failed can be fixed by the "noapic" option
to make ntpd working.

Log of kernels 2.6.8 to 2.6.10-rc2 say this:

kernel: ENABLING IO-APIC IRQs
kernel: ..TIMER: vector=0x31 pin1=2 pin2=-1
kernel: ..MP-BIOS bug: 8254 timer not connected to IO-APIC
kernel: ...trying to set up timer (IRQ0) through the 8259A ... failed.
kernel: ...trying to set up timer as Virtual Wire IRQ... failed.
kernel: ...trying to set up timer as ExtINT IRQ... works.

Log of kernel 2.6.10-rc3 to 2.6.14 say this:

kernel: ENABLING IO-APIC IRQs
kernel: ..TIMER: vector=0x31 pin1=0 pin2=-1

I don't understand if the problem is the pin1 that change from 2 to 0 or
if this is because the code to solve the "MP-BIOS bug" is not executed
(maybe because it is not in the kernel anymore, I have not verified).

>>More fun now: it look like the BIOS actually used on this mainboard is
>>not designed for it, but for an other board!!!
>>
>>The board is exactly this one "K7N2 Delta-L":
>>http://www.msi.com.tw/program/support/download/dld/spt_dld_detail.php?UID=436&kind=1
>>And according to MSI it must use a BIOS version 5.9. But when I enter
>>into the BIOS setup the version info say "W6570MS V7.4 081203".
>>
>>Here is the BIOS version history:
>>http://www.msi.com.tw/program/support/bios/bos/spt_bos_detail.php?UID=436&kind=1
>>The version 7.4 dated 2003-8-12 has a special note:
>>
>>1. Only for K7N2 Delta-ILSR
>>2. This BIOS cannot be used on K7N2 Delta-L
>>
>>Crasy. I use this board without any issue since around two years and
>>only found the first problem when upgrading to the kernel 2.6.14!
>
>
> Heh. Yea, I'm amazed you were able to flash it and still have the system
> boot. I'd suggest making sure you have the proper and current BIOS, as
> its *very* difficult for folks to help debug problems on unofficial or
> unsupported hardware/firmware configs.
>
> I believe ioapic support should function correctly regardless (or be
> blacklisted and with others reporting problems, its might not be just
> this BIOS issue), so after you get your BIOS sorted out, please let me
> know if the problem still persists.

After trying several time, I am unable to upgrade the BIOS of this
machine. The flash utility hang all the system at the very beginning of
the real access to programm the flash! This is maybe because I use a
freedos image over pxelinux. I will try with a floppy and a MSDOS if I
found such olds stuffs somehere.

Thanks a lot for the support,
--
Jean-Christian de Rivaz

2005-11-04 03:46:03

by Brown, Len

[permalink] [raw]
Subject: RE: NTP broken with 2.6.14

NFORCE2 on an ACPI-enabled kernel should automatically invoke
the acpi_skip_timer_override BIOS workaround -- as
the NFORCE family of chip-sets have the timer interrupt
attached to pin-0, but some of them shipped with
a bogus BIOS over-ride telling Linux the timer is on pin-2.

This issue is quite old -- google NFORCE2 and acpi_skip_timer_override.
IIR there are whole web-sites with NFORCE2
workarounds provided by its dedicated fans...

-Len

2005-11-04 04:07:19

by john stultz

[permalink] [raw]
Subject: RE: NTP broken with 2.6.14

On Thu, 2005-11-03 at 22:44 -0500, Brown, Len wrote:
> NFORCE2 on an ACPI-enabled kernel should automatically invoke
> the acpi_skip_timer_override BIOS workaround -- as
> the NFORCE family of chip-sets have the timer interrupt
> attached to pin-0, but some of them shipped with
> a bogus BIOS over-ride telling Linux the timer is on pin-2.
>
> This issue is quite old -- google NFORCE2 and acpi_skip_timer_override.
> IIR there are whole web-sites with NFORCE2
> workarounds provided by its dedicated fans...

Thanks for the info, Len. Although its odd that the Jean-Christian's
issue appears to show up around the time the fix you mention shows up.

Regardless, Jean-Chistian has some sever BIOS problems, so until those
are resolved, I suggest he use the workaround (noapic) and ping us if
the issue persists once he arrives at a supportable configuration.

thanks
-john

2005-11-04 12:40:44

by Maciej W. Rozycki

[permalink] [raw]
Subject: Re: NTP broken with 2.6.14

On Thu, 3 Nov 2005, john stultz wrote:

> > I compared the dmesg log of the different kernel, but since I don't know
> > what I should find it's a little difficult. There is many differences
> > between each kernels. Despit that, I noticed this difference between the
> > kernel 2.6.9 (ntps working) and the kernel 2.6.10 (ntpd failed):
> >
> > --- linux-2.6.9.txt 2005-11-03 22:49:29.000000000 +0100
> > +++ linux-2.6.10.txt 2005-11-03 22:48:41.000000000 +0100
> > [...snip...]
> > @@ -67,16 +68,12 @@
> > Enabling unmasked SIMD FPU exception support... done.
> > Checking 'hlt' instruction... OK.
> > ENABLING IO-APIC IRQs
> > - vector=0x31 pin1=2 pin2=-1
> > - 8254 timer not connected to IO-APIC
> > - ...trying to set up timer (IRQ0) through the 8259A ... failed.
> > - ...trying to set up timer as Virtual Wire IRQ... failed.
> > - ...trying to set up timer as ExtINT IRQ... works.
> > + vector=0x31 pin1=0 pin2=-1
> > Registered protocol family 16
> > PCI BIOS revision 2.10 entry at 0xfbbb0, last bus=3
> > Using configuration type 1
> > [..snip...]
> >
> > Maybe a way to go ?
>
>
> You might check booting w/ noapic to see if that changes the behaviour
> in 2.6.10.
>
> I know there were some pretty troubling issues w/ the nforce2 early in
> the 2.6 cycle. See
> http://atlas.et.tudelft.nl/verwei90/nforce2/index.html for some details.
>
>
> Maciej: I noticed you had been involved with earlier nforce2 issues.
> Does the above change in the ioapic pin1 value look familiar?

Oh, absolutely -- the timer interrupt line of the nForce2 chipset is
known to suffer from glitches under certain circumstances. As APIC inputs
are truly edge-triggered, if thus configured, unlike ones of 8259A chips
which ignore such interrupts if not handled before deassertion, all
glitches are actually handled as real interrupts leading to a significant
time drift. I've thought nVidia had a workaround for that -- cc-ing their
contact; not sure if still active. I've had a brief look at my archives
and the suggestion was to disable the "Spread Spectrum" option if
available in the firmware setup (not sure what to do if there is none).

Maciej

2005-11-04 13:51:44

by Lennart Sorensen

[permalink] [raw]
Subject: Re: NTP broken with 2.6.14

On Thu, Nov 03, 2005 at 08:07:13PM -0800, john stultz wrote:
> Thanks for the info, Len. Although its odd that the Jean-Christian's
> issue appears to show up around the time the fix you mention shows up.
>
> Regardless, Jean-Chistian has some sever BIOS problems, so until those
> are resolved, I suggest he use the workaround (noapic) and ping us if
> the issue persists once he arrives at a supportable configuration.

Well as an update, running 2.6.14 in the last 17 hours, my system gained
26 minutes. That seems to average gaining 1s every 44s. So while
better than 2.6.12 by a lot, there is certainly still something odd with
the nforce2 handling.

I think I need to go grab that python script and run it to see how it is
gaining since it doesn't really appear to be at a constant rate.

Maybe I should check if my bios is the latest version.

And I could try with the apic override options.

I will try a few more things to see what is happening.

Len Sorensen

2005-11-04 16:40:06

by Jean-Christian de Rivaz

[permalink] [raw]
Subject: Re: NTP broken with 2.6.14

john stultz a ?crit :
> On Thu, 2005-11-03 at 22:44 -0500, Brown, Len wrote:
>
>>NFORCE2 on an ACPI-enabled kernel should automatically invoke
>>the acpi_skip_timer_override BIOS workaround -- as
>>the NFORCE family of chip-sets have the timer interrupt
>>attached to pin-0, but some of them shipped with
>>a bogus BIOS over-ride telling Linux the timer is on pin-2.
>>
>>This issue is quite old -- google NFORCE2 and acpi_skip_timer_override.
>>IIR there are whole web-sites with NFORCE2
>>workarounds provided by its dedicated fans...
>
>
> Thanks for the info, Len. Although its odd that the Jean-Christian's
> issue appears to show up around the time the fix you mention shows up.
>
> Regardless, Jean-Chistian has some sever BIOS problems, so until those
> are resolved, I suggest he use the workaround (noapic) and ping us if
> the issue persists once he arrives at a supportable configuration.

Well, I finally get evidence that my mainboard is a "K7N2 Delta-ILSR"
not a "K7N2 Delta-L" (from the shipping package, the invoice, and online
review of the two motherboards). So the BIOS version 7.4 is a valid one.
I updated the BIOS to the version 7.8. Now the drift is low and ntpd
happy (me too), all that without the "noapic" option.

Strange is that the kernel log is almost the same but this little
difference:

--- kernel 2.6.14 BIOS V7.4
+++ kernel 2.6.14 BIOS V7.8
talla kernel: Normal zone: 225280 pages, LIFO batch:31
talla kernel: HighMem zone: 294896 pages, LIFO batch:31
talla kernel: DMI 2.3 present.
-talla kernel: ACPI: RSDP (v000 Nvidia )
@ 0x000f73b0
+talla kernel: ACPI: RSDP (v000 Nvidia )
@ 0x000f73d0
talla kernel: ACPI: RSDT (v001 Nvidia AWRDACPI 0x42302e31 AWRD
0x00000000) @ 0x7fff3000
talla kernel: ACPI: FADT (v001 Nvidia AWRDACPI 0x42302e31 AWRD
0x00000000) @ 0x7fff3040
-talla kernel: ACPI: MADT (v001 Nvidia AWRDACPI 0x42302e31 AWRD
0x00000000) @ 0x7fff7780
+talla kernel: ACPI: MADT (v001 Nvidia AWRDACPI 0x42302e31 AWRD
0x00000000) @ 0x7fff77c0
talla kernel: ACPI: DSDT (v001 NVIDIA AWRDACPI 0x00001000 MSFT
0x0100000e) @ 0x00000000
talla kernel: ACPI: PM-Timer IO Port: 0x4008
talla kernel: ACPI: Local APIC address 0xfee00000

The following message is still here:

talla kernel: ENABLING IO-APIC IRQs
talla kernel: ..TIMER: vector=0x31 pin1=0 pin2=-1

But all work fine with the latest BIOS.

For me the issue is solved. A lot of thanks to all peoples that helped
to this thread with special mention for John Stultz. I hope that the
others that have the same problem can solve it the same way.

Best regards,
--
Jean-Christian de Rivaz

2005-11-04 17:41:08

by Jean-Christian de Rivaz

[permalink] [raw]
Subject: Re: NTP broken with 2.6.14

Jean-Christian de Rivaz a ?crit :

> Well, I finally get evidence that my mainboard is a "K7N2 Delta-ILSR"
> not a "K7N2 Delta-L" (from the shipping package, the invoice, and online
> review of the two motherboards). So the BIOS version 7.4 is a valid one.
> I updated the BIOS to the version 7.8. Now the drift is low and ntpd
> happy (me too), all that without the "noapic" option.
>
> Strange is that the kernel log is almost the same but this little
> difference:
>
> --- kernel 2.6.14 BIOS V7.4
> +++ kernel 2.6.14 BIOS V7.8
> talla kernel: Normal zone: 225280 pages, LIFO batch:31
> talla kernel: HighMem zone: 294896 pages, LIFO batch:31
> talla kernel: DMI 2.3 present.
> -talla kernel: ACPI: RSDP (v000 Nvidia )
> @ 0x000f73b0
> +talla kernel: ACPI: RSDP (v000 Nvidia )
> @ 0x000f73d0
> talla kernel: ACPI: RSDT (v001 Nvidia AWRDACPI 0x42302e31 AWRD
> 0x00000000) @ 0x7fff3000
> talla kernel: ACPI: FADT (v001 Nvidia AWRDACPI 0x42302e31 AWRD
> 0x00000000) @ 0x7fff3040
> -talla kernel: ACPI: MADT (v001 Nvidia AWRDACPI 0x42302e31 AWRD
> 0x00000000) @ 0x7fff7780
> +talla kernel: ACPI: MADT (v001 Nvidia AWRDACPI 0x42302e31 AWRD
> 0x00000000) @ 0x7fff77c0
> talla kernel: ACPI: DSDT (v001 NVIDIA AWRDACPI 0x00001000 MSFT
> 0x0100000e) @ 0x00000000
> talla kernel: ACPI: PM-Timer IO Port: 0x4008
> talla kernel: ACPI: Local APIC address 0xfee00000
>
> The following message is still here:
>
> talla kernel: ENABLING IO-APIC IRQs
> talla kernel: ..TIMER: vector=0x31 pin1=0 pin2=-1
>
> But all work fine with the latest BIOS.
>
> For me the issue is solved. A lot of thanks to all peoples that helped
> to this thread with special mention for John Stultz. I hope that the
> others that have the same problem can solve it the same way.

Just to show how well is the 2.6.14 with the latest BIOS, here is the
drift according to the John Stultz python script:

04 Nov 17:45:23 offset: 1.7e-05 drift: 3.0 ppm
04 Nov 17:46:24 offset: 2e-05 drift: 0.0967741935484 ppm
04 Nov 17:47:24 offset: 2.7e-05 drift: 0.106557377049 ppm
04 Nov 17:48:24 offset: 4.8e-05 drift: 0.186813186813 ppm
04 Nov 17:49:24 offset: 6.5e-05 drift: 0.210743801653 ppm
04 Nov 17:50:24 offset: 9.4e-05 drift: 0.264900662252 ppm
04 Nov 17:51:24 offset: 0.000122 drift: 0.298342541436 ppm
04 Nov 17:52:24 offset: 0.000168 drift: 0.364928909953 ppm
04 Nov 17:53:25 offset: 0.0002 drift: 0.385093167702 ppm
04 Nov 17:54:25 offset: 0.000266 drift: 0.46408839779 ppm
04 Nov 17:55:25 offset: 0.000305 drift: 0.482587064677 ppm
04 Nov 17:56:25 offset: 0.000356 drift: 0.515837104072 ppm
04 Nov 17:57:25 offset: 0.000415 drift: 0.554633471646 ppm
04 Nov 17:58:25 offset: 0.000475 drift: 0.588761174968 ppm
04 Nov 17:59:25 offset: 0.000555 drift: 0.641755634638 ppm
04 Nov 18:00:25 offset: 0.000624 drift: 0.675526024363 ppm
04 Nov 18:01:25 offset: 0.000711 drift: 0.723779854621 ppm
04 Nov 18:02:26 offset: 0.00079 drift: 0.7578125 ppm
04 Nov 18:03:26 offset: 0.000868 drift: 0.787822878229 ppm
04 Nov 18:04:26 offset: 0.000954 drift: 0.821678321678 ppm
04 Nov 18:05:26 offset: 0.001029 drift: 0.843023255814 ppm
04 Nov 18:06:26 offset: 0.001114 drift: 0.870253164557 ppm
04 Nov 18:07:26 offset: 0.001196 drift: 0.892749244713 ppm
04 Nov 18:08:26 offset: 0.001274 drift: 0.910404624277 ppm
04 Nov 18:09:26 offset: 0.00136 drift: 0.932132963989 ppm
04 Nov 18:10:26 offset: 0.001445 drift: 0.951462765957 ppm
04 Nov 18:11:27 offset: 0.001537 drift: 0.973162939297 ppm
04 Nov 18:12:27 offset: 0.001627 drift: 0.992615384615 ppm

It's very very low, far more that with 2.6.9 kernel and old BIOS. MSI
have obviousely fixed a timer issus in the BIOS, but this is not show in
the BIOS history.

--
Jean-Christian de Rivaz

2005-11-06 22:49:34

by Hans-Peter Jansen

[permalink] [raw]
Subject: Re: NTP broken with 2.6.14

Am Freitag, 4. November 2005 03:50 schrieb Jean-Christian de Rivaz:
>
> After trying several time, I am unable to upgrade the BIOS of this
> machine. The flash utility hang all the system at the very beginning
> of the real access to programm the flash! This is maybe because I use
> a freedos image over pxelinux. I will try with a floppy and a MSDOS
> if I found such olds stuffs somehere.

Could very well be the netboot stuff. I typically flash BIOS/firmware
via DOS network boot images, which provides at least two different ways
of disk emulation: a: and c:, but some flash tools just freeze the
system on load/image load in both ways. Most prominently is the Promise
TX2/100 firmware update, but also a couple of motherboards BIOS'
flashers behave that way (cannot remember which ones, though).

Pete

2005-11-07 21:44:48

by Jean-Christian de Rivaz

[permalink] [raw]
Subject: Re: NTP broken with 2.6.14

Hans-Peter Jansen a ?crit :
> Am Freitag, 4. November 2005 03:50 schrieb Jean-Christian de Rivaz:
>
>>After trying several time, I am unable to upgrade the BIOS of this
>>machine. The flash utility hang all the system at the very beginning
>>of the real access to programm the flash! This is maybe because I use
>>a freedos image over pxelinux. I will try with a floppy and a MSDOS
>>if I found such olds stuffs somehere.
>
>
> Could very well be the netboot stuff. I typically flash BIOS/firmware
> via DOS network boot images, which provides at least two different ways
> of disk emulation: a: and c:, but some flash tools just freeze the
> system on load/image load in both ways. Most prominently is the Promise
> TX2/100 firmware update, but also a couple of motherboards BIOS'
> flashers behave that way (cannot remember which ones, though).

As you can see in my two latest post, I finnaly found that this was not
the good BIOS for the motherborad. When I understand my mistake, I take
the good BIOS version and I was able to flash it with FreeDOS over PXE
network boot. So the netboot stuff was not the problem in my case.

The funny part is that with my motherboard, if you try to flash the
wrong BIOS version you don't get any clear message agains this
opearation. But this ended randomly into one of the following posibility:
1) system simply hang.
2) "division by zero error".
3) "failed opcode "<put some hexadecimal bytes here>.
4) immediate reboot.

I think that the BIOS update is an area where the mainboard makers have
a hug possibility to improve there product...

--
Jean-Christian de Rivaz