2009-03-02 09:54:06

by Jesper Krogh

[permalink] [raw]
Subject: Re: Linux 2.6.29-rc6

john stultz wrote:
> Ok, so it seems ntp hasn't really had a chance to settle down, its only
> made a 10ppm adjustment so far. NTPd will stop corrections at ~
> +/-500ppm, so you're not at that bound yet, where things would be really
> broken.
>
> If the affected kernel isn't resetting in the logs anymore, I'd be
> interested in what the new ppm value is.

After 20 hours.. its still resetting.
Mar 2 10:43:24 quad12 ntpd[4416]: synchronized to 10.194.133.12, stratum 4
Mar 2 10:50:37 quad12 ntpd[4416]: time reset -1.103654 s
jk@quad12:~$ uptime
10:51:36 up 20:46, 1 user, load average: 0.00, 0.00, 0.00

And it hasn't shifted clocksource either.

jk@quad12:~$ cat
/sys/devices/system/clocksource/clocksource0/current_clocksource
tsc

--
Jesper


2009-03-02 21:28:20

by john stultz

[permalink] [raw]
Subject: Re: Linux 2.6.29-rc6

On Mon, 2009-03-02 at 10:53 +0100, Jesper Krogh wrote:
> john stultz wrote:
> > Ok, so it seems ntp hasn't really had a chance to settle down, its only
> > made a 10ppm adjustment so far. NTPd will stop corrections at ~
> > +/-500ppm, so you're not at that bound yet, where things would be really
> > broken.
> >
> > If the affected kernel isn't resetting in the logs anymore, I'd be
> > interested in what the new ppm value is.
>
> After 20 hours.. its still resetting.
> Mar 2 10:43:24 quad12 ntpd[4416]: synchronized to 10.194.133.12, stratum 4
> Mar 2 10:50:37 quad12 ntpd[4416]: time reset -1.103654 s

So what's the "ntpdc -c kerninfo" output now?

thanks
-john

2009-03-03 06:04:24

by Jesper Krogh

[permalink] [raw]
Subject: Re: Linux 2.6.29-rc6

john stultz wrote:
> On Mon, 2009-03-02 at 10:53 +0100, Jesper Krogh wrote:
>> john stultz wrote:
>>> Ok, so it seems ntp hasn't really had a chance to settle down, its only
>>> made a 10ppm adjustment so far. NTPd will stop corrections at ~
>>> +/-500ppm, so you're not at that bound yet, where things would be really
>>> broken.
>>>
>>> If the affected kernel isn't resetting in the logs anymore, I'd be
>>> interested in what the new ppm value is.
>> After 20 hours.. its still resetting.
>> Mar 2 10:43:24 quad12 ntpd[4416]: synchronized to 10.194.133.12, stratum 4
>> Mar 2 10:50:37 quad12 ntpd[4416]: time reset -1.103654 s
>
> So what's the "ntpdc -c kerninfo" output now?

Mar 3 06:41:10 quad12 ntpd[4416]: time reset -0.813957 s
Mar 3 06:45:20 quad12 ntpd[4416]: synchronized to LOCAL(0), stratum 13
Mar 3 06:45:36 quad12 ntpd[4416]: synchronized to 10.194.133.12, stratum 4
Mar 3 06:51:57 quad12 ntpd[4416]: synchronized to 10.194.133.13, stratum 4
Mar 3 07:00:29 quad12 ntpd[4416]: time reset -0.783390 s
jk@quad12:~$ ntpdc -c kerninfo
pll offset: 0 s
pll frequency: -28.691 ppm
maximum error: 1.0433 s
estimated error: 0 s
status: 0001 pll
pll time constant: 4
precision: 1e-06 s
frequency tolerance: 500 ppm
jk@quad12:~$ w
07:03:17 up 1 day, 16:59, 1 user, load average: 0.00, 0.00, 0.00

2009-03-03 20:01:57

by john stultz

[permalink] [raw]
Subject: Re: Linux 2.6.29-rc6

On Tue, 2009-03-03 at 07:04 +0100, Jesper Krogh wrote:
> john stultz wrote:
> > On Mon, 2009-03-02 at 10:53 +0100, Jesper Krogh wrote:
> >> john stultz wrote:
> >>> Ok, so it seems ntp hasn't really had a chance to settle down, its only
> >>> made a 10ppm adjustment so far. NTPd will stop corrections at ~
> >>> +/-500ppm, so you're not at that bound yet, where things would be really
> >>> broken.
> >>>
> >>> If the affected kernel isn't resetting in the logs anymore, I'd be
> >>> interested in what the new ppm value is.
> >> After 20 hours.. its still resetting.
> >> Mar 2 10:43:24 quad12 ntpd[4416]: synchronized to 10.194.133.12, stratum 4
> >> Mar 2 10:50:37 quad12 ntpd[4416]: time reset -1.103654 s
> >
> > So what's the "ntpdc -c kerninfo" output now?
>
> Mar 3 06:41:10 quad12 ntpd[4416]: time reset -0.813957 s
> Mar 3 06:45:20 quad12 ntpd[4416]: synchronized to LOCAL(0), stratum 13
> Mar 3 06:45:36 quad12 ntpd[4416]: synchronized to 10.194.133.12, stratum 4
> Mar 3 06:51:57 quad12 ntpd[4416]: synchronized to 10.194.133.13, stratum 4
> Mar 3 07:00:29 quad12 ntpd[4416]: time reset -0.783390 s
> jk@quad12:~$ ntpdc -c kerninfo
> pll offset: 0 s
> pll frequency: -28.691 ppm


This is baffling. You've only gone from -34.754ppm to -28.691ppm in over
a day? And you're still not syncing? If the calibration was so bad that
NTP couldn't sync, I'd expect the freq value to hit +/-500ppm before it
gave up. This just doesn't follow my expectations.

Could you provide:
/usr/sbin/ntpdc -c version

Do you see the same behavior if you drop all but one server (including
the local clock: 127.127.1.0)?

You might also add "minpoll 4 maxpoll 4" to the server line to speed up
testing.

Actually, if you could, I'd be interested if you could send your
ntp.conf

thanks
-john

2009-03-03 20:19:59

by Jesper Krogh

[permalink] [raw]
Subject: Re: Linux 2.6.29-rc6

john stultz wrote:
> On Tue, 2009-03-03 at 07:04 +0100, Jesper Krogh wrote:
>> john stultz wrote:
>>> On Mon, 2009-03-02 at 10:53 +0100, Jesper Krogh wrote:
>>>> john stultz wrote:
>>>>> Ok, so it seems ntp hasn't really had a chance to settle down, its only
>>>>> made a 10ppm adjustment so far. NTPd will stop corrections at ~
>>>>> +/-500ppm, so you're not at that bound yet, where things would be really
>>>>> broken.
>>>>>
>>>>> If the affected kernel isn't resetting in the logs anymore, I'd be
>>>>> interested in what the new ppm value is.
>>>> After 20 hours.. its still resetting.
>>>> Mar 2 10:43:24 quad12 ntpd[4416]: synchronized to 10.194.133.12, stratum 4
>>>> Mar 2 10:50:37 quad12 ntpd[4416]: time reset -1.103654 s
>>> So what's the "ntpdc -c kerninfo" output now?
>> Mar 3 06:41:10 quad12 ntpd[4416]: time reset -0.813957 s
>> Mar 3 06:45:20 quad12 ntpd[4416]: synchronized to LOCAL(0), stratum 13
>> Mar 3 06:45:36 quad12 ntpd[4416]: synchronized to 10.194.133.12, stratum 4
>> Mar 3 06:51:57 quad12 ntpd[4416]: synchronized to 10.194.133.13, stratum 4
>> Mar 3 07:00:29 quad12 ntpd[4416]: time reset -0.783390 s
>> jk@quad12:~$ ntpdc -c kerninfo
>> pll offset: 0 s
>> pll frequency: -28.691 ppm
>
>
> This is baffling. You've only gone from -34.754ppm to -28.691ppm in over
> a day? And you're still not syncing? If the calibration was so bad that
> NTP couldn't sync, I'd expect the freq value to hit +/-500ppm before it
> gave up. This just doesn't follow my expectations.

It's resetting.. without deep knowledge about ntp, doesnt that mean
"start over again"? I believe it hits +/-500ppm

> Could you provide:
> /usr/sbin/ntpdc -c version

$ ntpdc -c version
ntpdc [email protected] Tue Jan 6 15:51:00 UTC 2009 (1)

> Do you see the same behavior if you drop all but one server (including
> the local clock: 127.127.1.0)?
>
> You might also add "minpoll 4 maxpoll 4" to the server line to speed up
> testing.

Will try those option while debugging.

> Actually, if you could, I'd be interested if you could send your
> ntp.conf

http://krogh.cc/~jesper/ntp.conf

But this seems to be a "regression". Since 2.6.27.19 doesn't misbehave.
Same NTP, same configuration, same hardware. only change is the kernel
version. Or am I missing some parameter here?

Would it make sense to try to bisect it?

Jesper

--
Jesper

2009-03-03 20:40:04

by Jesper Krogh

[permalink] [raw]
Subject: Re: Linux 2.6.29-rc6

john stultz wrote:
> Do you see the same behavior if you drop all but one server (including
> the local clock: 127.127.1.0)?

Yes.
Mar 3 21:20:59 quad12 ntpd[2435]: ntpd [email protected] Tue Jan 6
15:50:55 UTC 2009 (1)
Mar 3 21:20:59 quad12 ntpd[2436]: precision = 1.000 usec
Mar 3 21:20:59 quad12 ntpd[2436]: Listening on interface #0 wildcard,
0.0.0.0#123 Disabled
Mar 3 21:20:59 quad12 ntpd[2436]: Listening on interface #1 wildcard,
::#123 Disabled
Mar 3 21:20:59 quad12 ntpd[2436]: Listening on interface #2 lo, ::1#123
Enabled
Mar 3 21:20:59 quad12 ntpd[2436]: Listening on interface #3 bond0,
fe80::21e:68ff:fe57:8169#123 Enabled
Mar 3 21:20:59 quad12 ntpd[2436]: Listening on interface #4 lo,
127.0.0.1#123 Enabled
Mar 3 21:20:59 quad12 ntpd[2436]: Listening on interface #5 bond0,
10.194.132.91#123 Enabled
Mar 3 21:20:59 quad12 ntpd[2436]: kernel time sync status 0040
Mar 3 21:20:59 quad12 ntpd[2436]: frequency initialized -29.286 PPM
from /var/lib/ntp/ntp.drift
Mar 3 21:21:58 quad12 ntpd[2436]: synchronized to 10.194.133.12, stratum 4
Mar 3 21:21:58 quad12 ntpd[2436]: time reset -6.148275 s
Mar 3 21:21:58 quad12 ntpd[2436]: kernel time sync status change 0001
Mar 3 21:25:01 quad12 ntpd[2436]: synchronized to 10.194.133.12, stratum 4
Mar 3 21:37:03 quad12 ntpd[2436]: time reset -0.664351 s

Only one server and the minpoll 4 maxpoll 4 options to the server line.

--
Jesper

2009-03-03 22:24:41

by john stultz

[permalink] [raw]
Subject: Re: Linux 2.6.29-rc6

On Tue, 2009-03-03 at 21:39 +0100, Jesper Krogh wrote:
> john stultz wrote:
> > Do you see the same behavior if you drop all but one server (including
> > the local clock: 127.127.1.0)?
>
> Yes.
> Mar 3 21:20:59 quad12 ntpd[2435]: ntpd [email protected] Tue Jan 6
> 15:50:55 UTC 2009 (1)
> Mar 3 21:20:59 quad12 ntpd[2436]: precision = 1.000 usec
> Mar 3 21:20:59 quad12 ntpd[2436]: Listening on interface #0 wildcard,
> 0.0.0.0#123 Disabled
> Mar 3 21:20:59 quad12 ntpd[2436]: Listening on interface #1 wildcard,
> ::#123 Disabled
> Mar 3 21:20:59 quad12 ntpd[2436]: Listening on interface #2 lo, ::1#123
> Enabled
> Mar 3 21:20:59 quad12 ntpd[2436]: Listening on interface #3 bond0,
> fe80::21e:68ff:fe57:8169#123 Enabled
> Mar 3 21:20:59 quad12 ntpd[2436]: Listening on interface #4 lo,
> 127.0.0.1#123 Enabled
> Mar 3 21:20:59 quad12 ntpd[2436]: Listening on interface #5 bond0,
> 10.194.132.91#123 Enabled
> Mar 3 21:20:59 quad12 ntpd[2436]: kernel time sync status 0040
> Mar 3 21:20:59 quad12 ntpd[2436]: frequency initialized -29.286 PPM
> from /var/lib/ntp/ntp.drift
> Mar 3 21:21:58 quad12 ntpd[2436]: synchronized to 10.194.133.12, stratum 4
> Mar 3 21:21:58 quad12 ntpd[2436]: time reset -6.148275 s
> Mar 3 21:21:58 quad12 ntpd[2436]: kernel time sync status change 0001
> Mar 3 21:25:01 quad12 ntpd[2436]: synchronized to 10.194.133.12, stratum 4
> Mar 3 21:37:03 quad12 ntpd[2436]: time reset -0.664351 s
>
> Only one server and the minpoll 4 maxpoll 4 options to the server line.

Well, it may still need a few hours to settle. :) Again, those time
resets are seen when NTPd doesn't have a good drift ppm at startup, and
it has to find it.

thanks
-john

2009-03-03 22:25:00

by john stultz

[permalink] [raw]
Subject: Re: Linux 2.6.29-rc6

On Tue, 2009-03-03 at 21:19 +0100, Jesper Krogh wrote:
> john stultz wrote:
> > On Tue, 2009-03-03 at 07:04 +0100, Jesper Krogh wrote:
> >> john stultz wrote:
> >>> On Mon, 2009-03-02 at 10:53 +0100, Jesper Krogh wrote:
> >>>> john stultz wrote:
> >>>>> Ok, so it seems ntp hasn't really had a chance to settle down, its only
> >>>>> made a 10ppm adjustment so far. NTPd will stop corrections at ~
> >>>>> +/-500ppm, so you're not at that bound yet, where things would be really
> >>>>> broken.
> >>>>>
> >>>>> If the affected kernel isn't resetting in the logs anymore, I'd be
> >>>>> interested in what the new ppm value is.
> >>>> After 20 hours.. its still resetting.
> >>>> Mar 2 10:43:24 quad12 ntpd[4416]: synchronized to 10.194.133.12, stratum 4
> >>>> Mar 2 10:50:37 quad12 ntpd[4416]: time reset -1.103654 s
> >>> So what's the "ntpdc -c kerninfo" output now?
> >> Mar 3 06:41:10 quad12 ntpd[4416]: time reset -0.813957 s
> >> Mar 3 06:45:20 quad12 ntpd[4416]: synchronized to LOCAL(0), stratum 13
> >> Mar 3 06:45:36 quad12 ntpd[4416]: synchronized to 10.194.133.12, stratum 4
> >> Mar 3 06:51:57 quad12 ntpd[4416]: synchronized to 10.194.133.13, stratum 4
> >> Mar 3 07:00:29 quad12 ntpd[4416]: time reset -0.783390 s
> >> jk@quad12:~$ ntpdc -c kerninfo
> >> pll offset: 0 s
> >> pll frequency: -28.691 ppm
> >
> >
> > This is baffling. You've only gone from -34.754ppm to -28.691ppm in over
> > a day? And you're still not syncing? If the calibration was so bad that
> > NTP couldn't sync, I'd expect the freq value to hit +/-500ppm before it
> > gave up. This just doesn't follow my expectations.
>
> It's resetting.. without deep knowledge about ntp, doesnt that mean
> "start over again"? I believe it hits +/-500ppm

No, the "time reset" message means that when the offset is larger
then .125sec (the slew boundary), NTPd has corrected it by calling
settimeofday instead of slewing the clock.

Here's some background about how NTP and the kernel interact:
Every time NTPd calls adjtimex(), its provides the current offset from
the tracked ntp server. The kernel takes this offset and applies a
temporary correction factor to the clocksource frequency to converge
that offset. It also takes the provided offset, dampens it, and then
uses the result to adjust the frequency value. Once the freq value hits
the max adjustment value (+/- 500ppm), then NTP will start throwing
error messages and give up.

The part that is so odd with your data, is that the freq value isn't
changing very much. After a time reset, I'd expect to see adjustments in
the 100us, then multiple ms, and only once we get above 100ms to see
another time reset. All the while, these adjustment values should be
tweaking the freq value, causing the clocks to converge.

The case I can think of that could cause this, is if the drift is
somehow jumping above the slew boundary before NTPd actually makes any
adjtimex calls, so we end up with minimal correction to the freq value,
but that still doesn't completely vibe with the data.


> > Could you provide:
> > /usr/sbin/ntpdc -c version
>
> $ ntpdc -c version
> ntpdc [email protected] Tue Jan 6 15:51:00 UTC 2009 (1)
>
> > Do you see the same behavior if you drop all but one server (including
> > the local clock: 127.127.1.0)?
> >
> > You might also add "minpoll 4 maxpoll 4" to the server line to speed up
> > testing.
>
> Will try those option while debugging.
>
> > Actually, if you could, I'd be interested if you could send your
> > ntp.conf
>
> http://krogh.cc/~jesper/ntp.conf

Cool, I see you're collecting stats already. Depending on the results of
the tests above I may want to check those out as well.

> But this seems to be a "regression". Since 2.6.27.19 doesn't misbehave.
> Same NTP, same configuration, same hardware. only change is the kernel
> version. Or am I missing some parameter here?
>
> Would it make sense to try to bisect it?

Well, I suspect you'll just bisect it to the fast-pit TSC calibration
causing a different correction freq to be needed for synchronization.
The odd part is that the userland NTPd isn't behaving as I'd expect if
the TSC calibration was really so bad that NTP couldn't handle it.

Bisection may be something worth trying just to verify or disprove that
theory, so if you have the time, it would be interesting to see. But if
the theory is true then we're back to the same spot.

I guess something to test my idea above (that the drift is bad enough
that NTPd isn't making slew adjustments via adjtimex offset) is to
remove NTPd from the init.d startup.

Then after rebooting (into 2.6.29), run the attached python script for
10 minutes or so to get an idea of the ppm drift. Then repeat with
2.6.26.

To run:
./drift-test.py <ntp server>

It will give some wild ppm numbers, but after a few minutes it should
settle down to the "natural drift" of the system.

thanks
-john


Attachments:
drift-test.py (1.49 kB)

2009-03-04 05:36:32

by Jesper Krogh

[permalink] [raw]
Subject: Re: Linux 2.6.29-rc6

john stultz wrote:
> On Tue, 2009-03-03 at 21:39 +0100, Jesper Krogh wrote:
>> john stultz wrote:
>>> Do you see the same behavior if you drop all but one server (including
>>> the local clock: 127.127.1.0)?
>> Yes.
>> Mar 3 21:20:59 quad12 ntpd[2435]: ntpd [email protected] Tue Jan 6
>> 15:50:55 UTC 2009 (1)
>> Mar 3 21:20:59 quad12 ntpd[2436]: precision = 1.000 usec
>> Mar 3 21:20:59 quad12 ntpd[2436]: Listening on interface #0 wildcard,
>> 0.0.0.0#123 Disabled
>> Mar 3 21:20:59 quad12 ntpd[2436]: Listening on interface #1 wildcard,
>> ::#123 Disabled
>> Mar 3 21:20:59 quad12 ntpd[2436]: Listening on interface #2 lo, ::1#123
>> Enabled
>> Mar 3 21:20:59 quad12 ntpd[2436]: Listening on interface #3 bond0,
>> fe80::21e:68ff:fe57:8169#123 Enabled
>> Mar 3 21:20:59 quad12 ntpd[2436]: Listening on interface #4 lo,
>> 127.0.0.1#123 Enabled
>> Mar 3 21:20:59 quad12 ntpd[2436]: Listening on interface #5 bond0,
>> 10.194.132.91#123 Enabled
>> Mar 3 21:20:59 quad12 ntpd[2436]: kernel time sync status 0040
>> Mar 3 21:20:59 quad12 ntpd[2436]: frequency initialized -29.286 PPM
>> from /var/lib/ntp/ntp.drift
>> Mar 3 21:21:58 quad12 ntpd[2436]: synchronized to 10.194.133.12, stratum 4
>> Mar 3 21:21:58 quad12 ntpd[2436]: time reset -6.148275 s
>> Mar 3 21:21:58 quad12 ntpd[2436]: kernel time sync status change 0001
>> Mar 3 21:25:01 quad12 ntpd[2436]: synchronized to 10.194.133.12, stratum 4
>> Mar 3 21:37:03 quad12 ntpd[2436]: time reset -0.664351 s
>>
>> Only one server and the minpoll 4 maxpoll 4 options to the server line.
>
> Well, it may still need a few hours to settle. :) Again, those time
> resets are seen when NTPd doesn't have a good drift ppm at startup, and
> it has to find it.

With one server and the maxpoll minpoll stuff, this on "settled" after a
bit more than 3 hours:
Mar 4 01:14:05 quad12 ntpd[2436]: time reset -0.381826 s
Mar 4 01:15:39 quad12 ntpd[2436]: synchronized to 10.194.133.12, stratum 4
jk@quad12:~$ uptime
06:35:40 up 15:55, 1 user, load average: 0.00, 0.00, 0.00
jk@quad12:~$ ntpq -c peers
remote refid st t when poll reach delay offset
jitter
==============================================================================
*bioinf.nzcorp.n 10.192.96.19 4 u 8 16 377 0.098 -80.184
0.673
jk@quad12:~$ ntpdc -c kerinfo
***Command `kerinfo' unknown
jk@quad12:~$ ntpdc -c kerninfo
pll offset: -0.06619 s
pll frequency: -500.000 ppm
maximum error: 0.130081 s
estimated error: 0.001201 s
status: 0001 pll
pll time constant: 4
precision: 1e-06 s
frequency tolerance: 500 ppm

--
Jesper

2009-03-04 15:31:28

by Jesper Krogh

[permalink] [raw]
Subject: Re: Linux 2.6.29-rc6

john stultz wrote:
> I guess something to test my idea above (that the drift is bad enough
> that NTPd isn't making slew adjustments via adjtimex offset) is to
> remove NTPd from the init.d startup.
>
> Then after rebooting (into 2.6.29), run the attached python script for
> 10 minutes or so to get an idea of the ppm drift. Then repeat with
> 2.6.26.
>
> To run:
> ./drift-test.py <ntp server>
>
> It will give some wild ppm numbers, but after a few minutes it should
> settle down to the "natural drift" of the system.

Ok. I removed ntpd from the system... heres is from "non-working
2.6.28.7 kernel".
04 Mar 14:59:16 offset: -0.139829 drift: -656.0 ppm
04 Mar 15:00:16 offset: -0.175233 drift: -591.147540984 ppm
04 Mar 15:01:16 offset: -0.210637 drift: -590.611570248 ppm
04 Mar 15:02:16 offset: -0.246033 drift: -590.386740331 ppm
04 Mar 15:03:17 offset: -0.28144 drift: -587.880165289 ppm
04 Mar 15:04:17 offset: -0.31684 drift: -588.301324503 ppm
04 Mar 15:05:17 offset: -0.352247 drift: -588.602209945 ppm
04 Mar 15:06:17 offset: -0.387649 drift: -588.805687204 ppm
04 Mar 15:07:17 offset: -0.423046 drift: -588.94813278 ppm
04 Mar 15:08:17 offset: -0.458451 drift: -589.073800738 ppm
04 Mar 15:09:18 offset: -0.493856 drift: -588.1973466 ppm
04 Mar 15:10:18 offset: -0.529265 drift: -588.374057315 ppm
04 Mar 15:11:18 offset: -0.564661 drift: -588.503457815 ppm
04 Mar 15:12:18 offset: -0.600063 drift: -588.620689655 ppm
04 Mar 15:13:18 offset: -0.635458 drift: -588.712930012 ppm
04 Mar 15:14:18 offset: -0.040699 drift: 109.052048726 ppm
04 Mar 15:15:18 offset: -0.076098 drift: 65.4984423676 ppm
04 Mar 15:16:18 offset: -0.111495 drift: 27.0557184751 ppm
04 Mar 15:17:18 offset: -0.146885 drift: -7.12096029548 ppm
04 Mar 15:18:19 offset: -0.182285 drift: -37.6853146853 ppm
04 Mar 15:19:19 offset: -0.217688 drift: -65.2117940199 ppm
04 Mar 15:20:19 offset: -0.253085 drift: -90.1202531646 ppm
04 Mar 15:21:19 offset: -0.288479 drift: -112.768882175 ppm
04 Mar 15:22:19 offset: -0.323866 drift: -133.448699422 ppm
04 Mar 15:23:19 offset: -0.359259 drift: -152.414127424 ppm
04 Mar 15:24:20 offset: -0.394648 drift: -169.750830565 ppm
04 Mar 15:25:20 offset: -0.430047 drift: -185.861980831 ppm
04 Mar 15:26:20 offset: -0.46544 drift: -200.779692308 ppm
04 Mar 15:27:20 offset: -0.500835 drift: -214.63620178 ppm
04 Mar 15:28:20 offset: -0.536221 drift: -227.534670487 ppm
04 Mar 15:29:20 offset: -0.571605 drift: -239.574515235 ppm
04 Mar 15:30:21 offset: -0.606992 drift: -250.706859593 ppm
04 Mar 15:31:21 offset: -0.64241 drift: -261.286085151 ppm
04 Mar 15:32:21 offset: -0.677792 drift: -271.20795569 ppm
04 Mar 15:33:21 offset: -0.713187 drift: -280.554252199 ppm
04 Mar 15:34:21 offset: -0.040744 drift: 46.7374169041 ppm
04 Mar 15:35:21 offset: -0.076145 drift: 29.0987996307 ppm
04 Mar 15:36:21 offset: -0.111551 drift: 12.4088050314 ppm
04 Mar 15:37:21 offset: -0.146952 drift: -3.40288713911 ppm

And from working 2.6.27.19 kernel.

jk@quad12:~$ python drift-test.py 10.192.96.19
04 Mar 16:17:23 offset: -0.006929 drift: -62.0 ppm
04 Mar 16:18:24 offset: -0.010252 drift: -54.5967741935 ppm
04 Mar 16:19:24 offset: -0.013574 drift: -54.9754098361 ppm
04 Mar 16:20:24 offset: -0.016897 drift: -55.1098901099 ppm
04 Mar 16:21:24 offset: -0.020233 drift: -55.2314049587 ppm
04 Mar 16:22:24 offset: -0.023566 drift: -55.2947019868 ppm
04 Mar 16:23:24 offset: -0.026895 drift: -55.3259668508 ppm
04 Mar 16:24:24 offset: -0.030217 drift: -55.3317535545 ppm
04 Mar 16:25:24 offset: -0.033539 drift: -55.3360995851 ppm
04 Mar 16:26:24 offset: -0.036865 drift: -55.3468634686 ppm
04 Mar 16:27:25 offset: -0.038266 drift: -52.0713101161 ppm
04 Mar 16:28:25 offset: -0.039747 drift: -49.592760181 ppm
04 Mar 16:29:25 offset: -0.041331 drift: -47.6680497925 ppm


--
Jesper

2009-03-04 18:37:21

by Jesper Krogh

[permalink] [raw]
Subject: Re: Linux 2.6.29-rc6

Jesper Krogh wrote:
> john stultz wrote:
>> I guess something to test my idea above (that the drift is bad enough
>> that NTPd isn't making slew adjustments via adjtimex offset) is to
>> remove NTPd from the init.d startup.
>>
>> Then after rebooting (into 2.6.29), run the attached python script for
>> 10 minutes or so to get an idea of the ppm drift. Then repeat with
>> 2.6.26.
>>
>> To run: ./drift-test.py <ntp server>
>>
>> It will give some wild ppm numbers, but after a few minutes it should
>> settle down to the "natural drift" of the system.
>
> Ok. I removed ntpd from the system... heres is from "non-working

Updated. I think I has NTPd running in the former "non-working" test. I
just tried to reproduce the numbers, and they look like this
(reproducible on 2.6.29-rc6).

jk@quad12:~$ python drift-test.py 10.192.96.19
04 Mar 19:27:10 offset: -0.157696 drift: -693.0 ppm
04 Mar 19:28:10 offset: -0.195134 drift: -625.098360656 ppm
04 Mar 19:29:10 offset: -0.232579 drift: -624.595041322 ppm
04 Mar 19:30:10 offset: -0.270021 drift: -624.408839779 ppm
04 Mar 19:31:11 offset: -0.307461 drift: -621.727272727 ppm
04 Mar 19:32:11 offset: -0.344903 drift: -622.185430464 ppm
04 Mar 19:33:11 offset: -0.382345 drift: -622.491712707 ppm
04 Mar 19:34:11 offset: -0.419794 drift: -622.727488152 ppm
04 Mar 19:35:11 offset: -0.457239 drift: -622.89626556 ppm

Still the same.

> And from working 2.6.27.19 kernel.
>
> jk@quad12:~$ python drift-test.py 10.192.96.19
> 04 Mar 16:17:23 offset: -0.006929 drift: -62.0 ppm
> 04 Mar 16:18:24 offset: -0.010252 drift: -54.5967741935 ppm
> 04 Mar 16:19:24 offset: -0.013574 drift: -54.9754098361 ppm
> 04 Mar 16:20:24 offset: -0.016897 drift: -55.1098901099 ppm
> 04 Mar 16:21:24 offset: -0.020233 drift: -55.2314049587 ppm
> 04 Mar 16:22:24 offset: -0.023566 drift: -55.2947019868 ppm
> 04 Mar 16:23:24 offset: -0.026895 drift: -55.3259668508 ppm
> 04 Mar 16:24:24 offset: -0.030217 drift: -55.3317535545 ppm
> 04 Mar 16:25:24 offset: -0.033539 drift: -55.3360995851 ppm
> 04 Mar 16:26:24 offset: -0.036865 drift: -55.3468634686 ppm
> 04 Mar 16:27:25 offset: -0.038266 drift: -52.0713101161 ppm
> 04 Mar 16:28:25 offset: -0.039747 drift: -49.592760181 ppm
> 04 Mar 16:29:25 offset: -0.041331 drift: -47.6680497925 ppm



--
Jesper

2009-03-04 18:59:44

by john stultz

[permalink] [raw]
Subject: Re: Linux 2.6.29-rc6

On Wed, 2009-03-04 at 19:36 +0100, Jesper Krogh wrote:
> Jesper Krogh wrote:
> > john stultz wrote:
> >> I guess something to test my idea above (that the drift is bad enough
> >> that NTPd isn't making slew adjustments via adjtimex offset) is to
> >> remove NTPd from the init.d startup.
> >>
> >> Then after rebooting (into 2.6.29), run the attached python script for
> >> 10 minutes or so to get an idea of the ppm drift. Then repeat with
> >> 2.6.26.
> >>
> >> To run: ./drift-test.py <ntp server>
> >>
> >> It will give some wild ppm numbers, but after a few minutes it should
> >> settle down to the "natural drift" of the system.
> >
> > Ok. I removed ntpd from the system... heres is from "non-working
>
> Updated. I think I has NTPd running in the former "non-working" test. I
> just tried to reproduce the numbers, and they look like this
> (reproducible on 2.6.29-rc6).

Yea, the last numbers did look odd :)

> jk@quad12:~$ python drift-test.py 10.192.96.19
> 04 Mar 19:27:10 offset: -0.157696 drift: -693.0 ppm
> 04 Mar 19:28:10 offset: -0.195134 drift: -625.098360656 ppm
> 04 Mar 19:29:10 offset: -0.232579 drift: -624.595041322 ppm
> 04 Mar 19:30:10 offset: -0.270021 drift: -624.408839779 ppm
> 04 Mar 19:31:11 offset: -0.307461 drift: -621.727272727 ppm
> 04 Mar 19:32:11 offset: -0.344903 drift: -622.185430464 ppm
> 04 Mar 19:33:11 offset: -0.382345 drift: -622.491712707 ppm
> 04 Mar 19:34:11 offset: -0.419794 drift: -622.727488152 ppm
> 04 Mar 19:35:11 offset: -0.457239 drift: -622.89626556 ppm


Yea, so from this and the settled ntpdc -c kerninfo data before, we can
see that the drift is further out then the 500ppm NTP can handle.

So with that at least confirmed, we can focus back on to the fast-pit
tsc calibration code.

Ingo, Thomas: I'm missing a bit of the context to that patch, other then
just speeding up boot times, was there other rational for moving away
from the ACPI PM timer based calibration?

Could we maybe add a quick test that the pit reads actually take the
assumed 2us max? Doing this maybe via the HPET/ACPI PM?

thanks
-john

2009-03-05 02:39:33

by john stultz

[permalink] [raw]
Subject: Re: Linux 2.6.29-rc6

On Wed, 2009-03-04 at 10:57 -0800, John Stultz wrote:
> On Wed, 2009-03-04 at 19:36 +0100, Jesper Krogh wrote:
> > jk@quad12:~$ python drift-test.py 10.192.96.19
> > 04 Mar 19:27:10 offset: -0.157696 drift: -693.0 ppm
> > 04 Mar 19:28:10 offset: -0.195134 drift: -625.098360656 ppm
> > 04 Mar 19:29:10 offset: -0.232579 drift: -624.595041322 ppm
> > 04 Mar 19:30:10 offset: -0.270021 drift: -624.408839779 ppm
> > 04 Mar 19:31:11 offset: -0.307461 drift: -621.727272727 ppm
> > 04 Mar 19:32:11 offset: -0.344903 drift: -622.185430464 ppm
> > 04 Mar 19:33:11 offset: -0.382345 drift: -622.491712707 ppm
> > 04 Mar 19:34:11 offset: -0.419794 drift: -622.727488152 ppm
> > 04 Mar 19:35:11 offset: -0.457239 drift: -622.89626556 ppm
>
>
> Yea, so from this and the settled ntpdc -c kerninfo data before, we can
> see that the drift is further out then the 500ppm NTP can handle.
>
> So with that at least confirmed, we can focus back on to the fast-pit
> tsc calibration code.
>
> Ingo, Thomas: I'm missing a bit of the context to that patch, other then
> just speeding up boot times, was there other rational for moving away
> from the ACPI PM timer based calibration?
>
> Could we maybe add a quick test that the pit reads actually take the
> assumed 2us max? Doing this maybe via the HPET/ACPI PM?

Hey Jesper,

Here's a very-hackish patch to see if the approach I'm considering
might fix the issue you're hitting. Could you apply it, boot the kernel
a few times and send me the following segments of the dmesg for each of
those boots (the example below is from my test box)?

tsc delta: 44418024
ref_freq: 3000100 pit_freq: 3000384
TSC: Fast PIT calibration matches PMTIMER.
TSC: PIT calibration matches PMTIMER. 1 loops
Detected 3000.045 MHz processor.

I'm trying to see how regular the mis-calculation is, as well as see how
well the alternate calibration method does to handle this on your
hardware.

Its likely the fat pit calibration can be better integrated with the
other calibration methods, so this probably isn't anything close to what
the actual fix will look like.

Ingo, Thomas: On the hardware I'm testing the fast-pit calibration only
triggers probably 80-90% of the time. About 10-20% of the time, the
initial check to pit_expect_msb(0xff) fails (count=0), so we may need to
look more at this approach.

thanks
-john



2009-03-05 02:52:26

by john stultz

[permalink] [raw]
Subject: Re: Linux 2.6.29-rc6

On Wed, 2009-03-04 at 18:39 -0800, john stultz wrote:
> On Wed, 2009-03-04 at 10:57 -0800, John Stultz wrote:
> > On Wed, 2009-03-04 at 19:36 +0100, Jesper Krogh wrote:
> > > jk@quad12:~$ python drift-test.py 10.192.96.19
> > > 04 Mar 19:27:10 offset: -0.157696 drift: -693.0 ppm
> > > 04 Mar 19:28:10 offset: -0.195134 drift: -625.098360656 ppm
> > > 04 Mar 19:29:10 offset: -0.232579 drift: -624.595041322 ppm
> > > 04 Mar 19:30:10 offset: -0.270021 drift: -624.408839779 ppm
> > > 04 Mar 19:31:11 offset: -0.307461 drift: -621.727272727 ppm
> > > 04 Mar 19:32:11 offset: -0.344903 drift: -622.185430464 ppm
> > > 04 Mar 19:33:11 offset: -0.382345 drift: -622.491712707 ppm
> > > 04 Mar 19:34:11 offset: -0.419794 drift: -622.727488152 ppm
> > > 04 Mar 19:35:11 offset: -0.457239 drift: -622.89626556 ppm
> >
> >
> > Yea, so from this and the settled ntpdc -c kerninfo data before, we can
> > see that the drift is further out then the 500ppm NTP can handle.
> >
> > So with that at least confirmed, we can focus back on to the fast-pit
> > tsc calibration code.
> >
> > Ingo, Thomas: I'm missing a bit of the context to that patch, other then
> > just speeding up boot times, was there other rational for moving away
> > from the ACPI PM timer based calibration?
> >
> > Could we maybe add a quick test that the pit reads actually take the
> > assumed 2us max? Doing this maybe via the HPET/ACPI PM?
>
> Hey Jesper,
>
> Here's a very-hackish patch to see if the approach I'm considering
> might fix the issue you're hitting. Could you apply it, boot the kernel
> a few times and send me the following segments of the dmesg for each of
> those boots (the example below is from my test box)?
>
> tsc delta: 44418024
> ref_freq: 3000100 pit_freq: 3000384
> TSC: Fast PIT calibration matches PMTIMER.
> TSC: PIT calibration matches PMTIMER. 1 loops
> Detected 3000.045 MHz processor.
>
> I'm trying to see how regular the mis-calculation is, as well as see how
> well the alternate calibration method does to handle this on your
> hardware.
>
> Its likely the fat pit calibration can be better integrated with the
> other calibration methods, so this probably isn't anything close to what
> the actual fix will look like.
>
> Ingo, Thomas: On the hardware I'm testing the fast-pit calibration only
> triggers probably 80-90% of the time. About 10-20% of the time, the
> initial check to pit_expect_msb(0xff) fails (count=0), so we may need to
> look more at this approach.

Err. Sorry, hit send before I included the patch.

-john



Not for inclusion.

Signed-off-by: John Stultz <[email protected]>

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 599e581..2e16d30 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -317,15 +317,17 @@ static unsigned long quick_pit_calibrate(void)

if (pit_expect_msb(0xff)) {
int i;
- u64 t1, t2, delta;
+ u64 t1, t2, delta, ref1, ref2;
+ u64 ref_freq = 0, pit_freq = 0;
+ int hpet = is_hpet_enabled();
unsigned char expect = 0xfe;

- t1 = get_cycles();
+ t1 = tsc_read_refs(&ref1, hpet);
for (i = 0; i < QUICK_PIT_ITERATIONS; i++, expect--) {
if (!pit_expect_msb(expect))
goto failed;
}
- t2 = get_cycles();
+ t2 = tsc_read_refs(&ref2, hpet);

/*
* Make sure we can rely on the second TSC timestamp:
@@ -333,6 +335,13 @@ static unsigned long quick_pit_calibrate(void)
if (!pit_expect_msb(expect))
goto failed;

+
+ delta = (t2 - t1);
+ if (hpet)
+ ref_freq = calc_hpet_ref(delta*1000000LL, ref1, ref2);
+ else
+ ref_freq = calc_pmtimer_ref(delta*1000000LL, ref1, ref2);
+
/*
* Ok, if we get here, then we've seen the
* MSB of the PIT decrement QUICK_PIT_ITERATIONS
@@ -347,10 +356,32 @@ static unsigned long quick_pit_calibrate(void)
* kHz = (t2 - t1) / (QPI * 256 / PIT_TICK_RATE) / 1000
* kHz = ((t2 - t1) * PIT_TICK_RATE) / (QPI * 256 * 1000)
*/
- delta = (t2 - t1)*PIT_TICK_RATE;
- do_div(delta, QUICK_PIT_ITERATIONS*256*1000);
+ printk("tsc delta: %lld\n", t2-t1);
+
+ pit_freq = delta * PIT_TICK_RATE;
+ do_div(pit_freq, QUICK_PIT_ITERATIONS*256*1000);
+
+ printk("ref_freq: %lld pit_freq: %lld\n", ref_freq, pit_freq);
+
+ /* Check the reference deviation */
+ delta = ((u64) pit_freq) * 100;
+ do_div(delta, ref_freq);
+
+ /*
+ * If both calibration results are inside a 10% window
+ * then we can be sure, that the calibration
+ * succeeded. We break out of the loop right away. We
+ * use the reference value, as it is more precise.
+ */
+ if (delta >= 90 && delta <= 110) {
+ printk(KERN_INFO
+ "TSC: Fast PIT calibration matches %s.\n",
+ hpet ? "HPET" : "PMTIMER");
+ return ref_freq;
+ }
+
printk("Fast TSC calibration using PIT\n");
- return delta;
+ return pit_freq;
}
failed:
return 0;
@@ -375,7 +406,7 @@ unsigned long native_calibrate_tsc(void)
local_irq_save(flags);
fast_calibrate = quick_pit_calibrate();
local_irq_restore(flags);
- if (fast_calibrate)
+ if (0 && fast_calibrate)
return fast_calibrate;

/*

2009-03-05 08:44:20

by Ingo Molnar

[permalink] [raw]
Subject: Re: Linux 2.6.29-rc6


* john stultz <[email protected]> wrote:

> > Ingo, Thomas: On the hardware I'm testing the fast-pit
> > calibration only triggers probably 80-90% of the time. About
> > 10-20% of the time, the initial check to
> > pit_expect_msb(0xff) fails (count=0), so we may need to look
> > more at this approach.

We definitely need to improve calibration quality.

The question is - why does fast-calibration fail 10-20% of the
time on your test-system? Also, why exactly do we miscalibrate?
Could you please have a look at that?

One theory would be that the PIT readout is unreliable. Windows
does not make use of it, so it's not the most tested aspect of
the PIT. Is that what happens on your box?

Ingo

2009-03-06 03:13:59

by john stultz

[permalink] [raw]
Subject: Re: Linux 2.6.29-rc6

On Thu, 2009-03-05 at 09:43 +0100, Ingo Molnar wrote:
> * john stultz <[email protected]> wrote:
>
> > > Ingo, Thomas: On the hardware I'm testing the fast-pit
> > > calibration only triggers probably 80-90% of the time. About
> > > 10-20% of the time, the initial check to
> > > pit_expect_msb(0xff) fails (count=0), so we may need to look
> > > more at this approach.
>
> We definitely need to improve calibration quality.
>
> The question is - why does fast-calibration fail 10-20% of the
> time on your test-system? Also, why exactly do we miscalibrate?
> Could you please have a look at that?

Working on it, I just wanted to let you know I was seeing some different
odd behavior then Jesper.

> One theory would be that the PIT readout is unreliable. Windows
> does not make use of it, so it's not the most tested aspect of
> the PIT. Is that what happens on your box?

Still looking into it, but from my initial debugging it seems that by
reading the PIT very quickly after setting it, we may be getting junk
values. If I re-read the PIT again, I see the expected 0xff value.

Its been somewhat of a heisenbug, as if I add any printk's or even just
a mb() after the outb it seems to make the problem go away (or just rare
enough I don't have the patience to reproduce it :)

So I don't know if a small delay is appropriate here (seems counter
productive to the whole fast-pit calibration ;) or if we should just try
to catch these bad reads and try again before failing?

Thoughts?

thanks
-john

2009-03-06 03:54:20

by john stultz

[permalink] [raw]
Subject: Re: Linux 2.6.29-rc6

On Thu, 2009-03-05 at 19:13 -0800, john stultz wrote:
> On Thu, 2009-03-05 at 09:43 +0100, Ingo Molnar wrote:
> > * john stultz <[email protected]> wrote:
> >
> > > > Ingo, Thomas: On the hardware I'm testing the fast-pit
> > > > calibration only triggers probably 80-90% of the time. About
> > > > 10-20% of the time, the initial check to
> > > > pit_expect_msb(0xff) fails (count=0), so we may need to look
> > > > more at this approach.
> >
> > We definitely need to improve calibration quality.
> >
> > The question is - why does fast-calibration fail 10-20% of the
> > time on your test-system? Also, why exactly do we miscalibrate?
> > Could you please have a look at that?
>
> Working on it, I just wanted to let you know I was seeing some different
> odd behavior then Jesper.
>
> > One theory would be that the PIT readout is unreliable. Windows
> > does not make use of it, so it's not the most tested aspect of
> > the PIT. Is that what happens on your box?
>
> Still looking into it, but from my initial debugging it seems that by
> reading the PIT very quickly after setting it, we may be getting junk
> values. If I re-read the PIT again, I see the expected 0xff value.
>
> Its been somewhat of a heisenbug, as if I add any printk's or even just
> a mb() after the outb it seems to make the problem go away (or just rare
> enough I don't have the patience to reproduce it :)
>
> So I don't know if a small delay is appropriate here (seems counter
> productive to the whole fast-pit calibration ;) or if we should just try
> to catch these bad reads and try again before failing?

Maybe something like the following? (Not tested heavily yet!)

Again, just for clarity, as we've mixed a few issues here, this patch is
for a side issue and not related to the original regression reported by
Jesper. I'm still waiting on debug output from Jesper to further
diagnose whats going wrong with his TSC calibration.

thanks
-john


Apparently some hardware may occasionally return junk values if you try
to read the pit immediately after setting it. This causes the
pit_expect_msb() to occasionally fail (~10% of the time).

This patch tries to work around this issue by not failing if the first
read right after setting the PIT is not what we expect.

NOT FOR INCLUSION (yet!)

Signed-off-by: John Stultz <[email protected]>

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 599e581..2ca5ba4 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -280,8 +280,17 @@ static inline int pit_expect_msb(unsigned char val)
for (count = 0; count < 50000; count++) {
/* Ignore LSB */
inb(0x42);
- if (inb(0x42) != val)
+ if (inb(0x42) != val) {
+ /*
+ * If we're too fast, we may read
+ * junk values right after we set
+ * the PIT. So if this is the first
+ * read, try again
+ */
+ if (val == 0xff && count == 0)
+ continue;
break;
+ }
}
return count > 50;
}

2009-03-06 11:35:30

by Ingo Molnar

[permalink] [raw]
Subject: Re: Linux 2.6.29-rc6


* john stultz <[email protected]> wrote:

> On Thu, 2009-03-05 at 19:13 -0800, john stultz wrote:
> > On Thu, 2009-03-05 at 09:43 +0100, Ingo Molnar wrote:
> > > * john stultz <[email protected]> wrote:
> > >
> > > > > Ingo, Thomas: On the hardware I'm testing the fast-pit
> > > > > calibration only triggers probably 80-90% of the time. About
> > > > > 10-20% of the time, the initial check to
> > > > > pit_expect_msb(0xff) fails (count=0), so we may need to look
> > > > > more at this approach.
> > >
> > > We definitely need to improve calibration quality.
> > >
> > > The question is - why does fast-calibration fail 10-20% of the
> > > time on your test-system? Also, why exactly do we miscalibrate?
> > > Could you please have a look at that?
> >
> > Working on it, I just wanted to let you know I was seeing some different
> > odd behavior then Jesper.
> >
> > > One theory would be that the PIT readout is unreliable. Windows
> > > does not make use of it, so it's not the most tested aspect of
> > > the PIT. Is that what happens on your box?
> >
> > Still looking into it, but from my initial debugging it seems that by
> > reading the PIT very quickly after setting it, we may be getting junk
> > values. If I re-read the PIT again, I see the expected 0xff value.
> >
> > Its been somewhat of a heisenbug, as if I add any printk's or even just
> > a mb() after the outb it seems to make the problem go away (or just rare
> > enough I don't have the patience to reproduce it :)
> >
> > So I don't know if a small delay is appropriate here (seems counter
> > productive to the whole fast-pit calibration ;) or if we should just try
> > to catch these bad reads and try again before failing?
>
> Maybe something like the following? (Not tested heavily yet!)
>
> Again, just for clarity, as we've mixed a few issues here, this patch is
> for a side issue and not related to the original regression reported by
> Jesper. I'm still waiting on debug output from Jesper to further
> diagnose whats going wrong with his TSC calibration.
>
> thanks
> -john
>
>
> Apparently some hardware may occasionally return junk values if you try
> to read the pit immediately after setting it. This causes the
> pit_expect_msb() to occasionally fail (~10% of the time).
>
> This patch tries to work around this issue by not failing if the first
> read right after setting the PIT is not what we expect.
>
> NOT FOR INCLUSION (yet!)
>
> Signed-off-by: John Stultz <[email protected]>
>
> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> index 599e581..2ca5ba4 100644
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -280,8 +280,17 @@ static inline int pit_expect_msb(unsigned char val)
> for (count = 0; count < 50000; count++) {
> /* Ignore LSB */
> inb(0x42);
> - if (inb(0x42) != val)
> + if (inb(0x42) != val) {
> + /*
> + * If we're too fast, we may read
> + * junk values right after we set
> + * the PIT. So if this is the first
> + * read, try again
> + */
> + if (val == 0xff && count == 0)
> + continue;
> break;

We could do something like that if it helps the end result. But
this special thing inside the loop should just be an
unconditional inb(0x42) outside the loop. It does not hurt
performance there, and we'll get simpler code that way.

But ... i really dont like how we rely on PIT readouts and how
we work around PIT readout artifacts. Only Linux does PIT
readouts while Windows does not - so we rely on a under-tested
aspect of PC hardware.

I think we should think about a fundamentally different, IRQ
driven way of calibration. For example we could program a 27
milliseconds PIT periodic interrupt with the maximum count and
measure its arrival timestamp in two subsequent interrupts.

We could do that with about 1-2 usecs precision realistically
(this early during bootup we are really quiescent) - and over a
27,000 usecs period that gives us an accuracy of 1:13500, or
about 75 ppm. That's still only about 50 milliseconds spent
calibrating, so very fast.

We can re-write the IRQ#0 vector with a special temporary
calibration interrupt handler to make this really single-purpose
and precise.

Hm?

Ingo

2009-03-09 20:42:55

by Jesper Krogh

[permalink] [raw]
Subject: Re: Linux 2.6.29-rc6

john stultz wrote:
> On Wed, 2009-03-04 at 18:39 -0800, john stultz wrote:
>> On Wed, 2009-03-04 at 10:57 -0800, John Stultz wrote:
>>> On Wed, 2009-03-04 at 19:36 +0100, Jesper Krogh wrote:
>>>> jk@quad12:~$ python drift-test.py 10.192.96.19
>>>> 04 Mar 19:27:10 offset: -0.157696 drift: -693.0 ppm
>>>> 04 Mar 19:28:10 offset: -0.195134 drift: -625.098360656 ppm
>>>> 04 Mar 19:29:10 offset: -0.232579 drift: -624.595041322 ppm
>>>> 04 Mar 19:30:10 offset: -0.270021 drift: -624.408839779 ppm
>>>> 04 Mar 19:31:11 offset: -0.307461 drift: -621.727272727 ppm
>>>> 04 Mar 19:32:11 offset: -0.344903 drift: -622.185430464 ppm
>>>> 04 Mar 19:33:11 offset: -0.382345 drift: -622.491712707 ppm
>>>> 04 Mar 19:34:11 offset: -0.419794 drift: -622.727488152 ppm
>>>> 04 Mar 19:35:11 offset: -0.457239 drift: -622.89626556 ppm
>>>
>>> Yea, so from this and the settled ntpdc -c kerninfo data before, we can
>>> see that the drift is further out then the 500ppm NTP can handle.
>>>
>>> So with that at least confirmed, we can focus back on to the fast-pit
>>> tsc calibration code.
>>>
>>> Ingo, Thomas: I'm missing a bit of the context to that patch, other then
>>> just speeding up boot times, was there other rational for moving away
>>> from the ACPI PM timer based calibration?
>>>
>>> Could we maybe add a quick test that the pit reads actually take the
>>> assumed 2us max? Doing this maybe via the HPET/ACPI PM?
>> Hey Jesper,
>>
>> Here's a very-hackish patch to see if the approach I'm considering
>> might fix the issue you're hitting. Could you apply it, boot the kernel
>> a few times and send me the following segments of the dmesg for each of
>> those boots (the example below is from my test box)?
>>
>> tsc delta: 44418024
>> ref_freq: 3000100 pit_freq: 3000384
>> TSC: Fast PIT calibration matches PMTIMER.
>> TSC: PIT calibration matches PMTIMER. 1 loops
>> Detected 3000.045 MHz processor.

Hi John.

Patched into 2.6.28.7 ..

First boot.
[ 0.000000] tsc delta: 34203220
[ 0.000000] ref_freq: 2311825 pit_freq: 2310386
[ 0.000000] TSC: Fast PIT calibration matches PMTIMER.
[ 0.000000] TSC: PIT calibration matches PMTIMER. 2 loops
[ 0.000000] Detected 2311.877 MHz processor.
Second boot:
[ 0.000000] tsc delta: 34200313
[ 0.000000] ref_freq: 2311803 pit_freq: 2310190
[ 0.000000] TSC: Fast PIT calibration matches PMTIMER.
[ 0.000000] TSC: PIT calibration matches PMTIMER. 2 loops
[ 0.000000] Detected 2311.876 MHz processor.
Third boot:
[ 0.000000] tsc delta: 34198686
[ 0.000000] ref_freq: 2311824 pit_freq: 2310080
[ 0.000000] TSC: Fast PIT calibration matches PMTIMER.
[ 0.000000] TSC: PIT calibration matches PMTIMER. 1 loops
[ 0.000000] Detected 2311.872 MHz processor.
Fourth boot:
[ 0.000000] tsc delta: 34199433
[ 0.000000] ref_freq: 2311831 pit_freq: 2310130
[ 0.000000] TSC: Fast PIT calibration matches PMTIMER.
[ 0.000000] TSC: PIT calibration matches PMTIMER. 2 loops
[ 0.000000] Detected 2311.821 MHz processor.



--
Jesper

2009-03-10 04:27:59

by Linus Torvalds

[permalink] [raw]
Subject: Re: Linux 2.6.29-rc6



On Mon, 9 Mar 2009, Jesper Krogh wrote:
>
> First boot.
> [ 0.000000] ref_freq: 2311825 pit_freq: 2310386
> Second boot:
> [ 0.000000] ref_freq: 2311803 pit_freq: 2310190
> Third boot:
> [ 0.000000] ref_freq: 2311824 pit_freq: 2310080
> Fourth boot:
> [ 0.000000] ref_freq: 2311831 pit_freq: 2310130

It's really quite impressively stable, but the fast-PIT calibration
frequency is reliably about 3/4 of a promille low. Or, put another way,
the TSC difference over the pit calibration is just a _tad_ too small
compared to the value we'd expect if that loop of pit_expect_msb() would
really run at the expected delay of a 1.193182MHz clock divided by 256.

And it's stable in that it really always seems to be off by a very similar
amount. It's not moving around very much.

I also wonder why it seems to happen mainly just to _you_. There's
absolutely nothing odd in your system, neither a slow CPU or anything
else that would stand out.

Grr. Very annoyingly non-obvious.

Linus

2009-03-10 11:30:50

by Thomas Gleixner

[permalink] [raw]
Subject: Re: Linux 2.6.29-rc6

On Mon, 9 Mar 2009, Linus Torvalds wrote:
> On Mon, 9 Mar 2009, Jesper Krogh wrote:
> >
> > First boot.
> > [ 0.000000] ref_freq: 2311825 pit_freq: 2310386
> > Second boot:
> > [ 0.000000] ref_freq: 2311803 pit_freq: 2310190
> > Third boot:
> > [ 0.000000] ref_freq: 2311824 pit_freq: 2310080
> > Fourth boot:
> > [ 0.000000] ref_freq: 2311831 pit_freq: 2310130
>
> It's really quite impressively stable, but the fast-PIT calibration
> frequency is reliably about 3/4 of a promille low. Or, put another way,
> the TSC difference over the pit calibration is just a _tad_ too small
> compared to the value we'd expect if that loop of pit_expect_msb() would
> really run at the expected delay of a 1.193182MHz clock divided by 256.
>
> And it's stable in that it really always seems to be off by a very similar
> amount. It's not moving around very much.
>
> I also wonder why it seems to happen mainly just to _you_. There's
> absolutely nothing odd in your system, neither a slow CPU or anything
> else that would stand out.
>
> Grr. Very annoyingly non-obvious.

Indeed. One hint is in the slow calibration path. 3 of 4 boots have:

> > [ 0.000000] TSC: PIT calibration matches PMTIMER. 2 loops

So the slow calibration path detects some disturbance.

Jesper, can you please apply the following patch instead of Johns and
provide the output for a couple of boots? The output is:

Fast TSC calibration using PIT
tsc 43425305 tscmin 624008 tscmax 632610

Thanks,

tglx

--- linux-2.6.orig/arch/x86/kernel/tsc.c
+++ linux-2.6/arch/x86/kernel/tsc.c
@@ -317,15 +317,22 @@ static unsigned long quick_pit_calibrate

if (pit_expect_msb(0xff)) {
int i;
- u64 t1, t2, delta;
+ u64 t1, t2, t3, delta;
unsigned char expect = 0xfe;
+ unsigned long tscmin = ULONG_MAX, tscmax = 0;

- t1 = get_cycles();
+ t1 = t2 = get_cycles();
for (i = 0; i < QUICK_PIT_ITERATIONS; i++, expect--) {
if (!pit_expect_msb(expect))
goto failed;
+ t3 = get_cycles();
+ delta = t3 - t2;
+ t2 = t3;
+ if ((unsigned long) delta < tscmin)
+ tscmin = (unsigned int) delta;
+ if ((unsigned long) delta > tscmax)
+ tscmax = (unsigned int) delta;
}
- t2 = get_cycles();

/*
* Make sure we can rely on the second TSC timestamp:
@@ -350,6 +357,8 @@ static unsigned long quick_pit_calibrate
delta = (t2 - t1)*PIT_TICK_RATE;
do_div(delta, QUICK_PIT_ITERATIONS*256*1000);
printk("Fast TSC calibration using PIT\n");
+ printk("tsc %ld tscmin %ld tscmax %ld\n",
+ (unsigned long) (t2 - t1), tscmin, tscmax);
return delta;
}
failed:

2009-03-10 19:43:16

by Jesper Krogh

[permalink] [raw]
Subject: Re: Linux 2.6.29-rc6

Thomas Gleixner wrote:
> On Mon, 9 Mar 2009, Linus Torvalds wrote:
>> On Mon, 9 Mar 2009, Jesper Krogh wrote:
>>> First boot.
>>> [ 0.000000] ref_freq: 2311825 pit_freq: 2310386
>>> Second boot:
>>> [ 0.000000] ref_freq: 2311803 pit_freq: 2310190
>>> Third boot:
>>> [ 0.000000] ref_freq: 2311824 pit_freq: 2310080
>>> Fourth boot:
>>> [ 0.000000] ref_freq: 2311831 pit_freq: 2310130
>> It's really quite impressively stable, but the fast-PIT calibration
>> frequency is reliably about 3/4 of a promille low. Or, put another way,
>> the TSC difference over the pit calibration is just a _tad_ too small
>> compared to the value we'd expect if that loop of pit_expect_msb() would
>> really run at the expected delay of a 1.193182MHz clock divided by 256.
>>
>> And it's stable in that it really always seems to be off by a very similar
>> amount. It's not moving around very much.
>>
>> I also wonder why it seems to happen mainly just to _you_. There's
>> absolutely nothing odd in your system, neither a slow CPU or anything
>> else that would stand out.
>>
>> Grr. Very annoyingly non-obvious.
>
> Indeed. One hint is in the slow calibration path. 3 of 4 boots have:
>
>>> [ 0.000000] TSC: PIT calibration matches PMTIMER. 2 loops
>
> So the slow calibration path detects some disturbance.
>
> Jesper, can you please apply the following patch instead of Johns and
> provide the output for a couple of boots? The output is:
>
> Fast TSC calibration using PIT
> tsc 43425305 tscmin 624008 tscmax 632610

First boot:
[ 0.000000] Fast TSC calibration using PIT
[ 0.000000] tsc 34202223 tscmin 474069 tscmax 500664
Second boot:
Here I didnt get above messages.. http://krogh.cc/~jesper/dmesg-boot2.txt
Third boot:
[ 0.000000] Fast TSC calibration using PIT
[ 0.000000] tsc 34199856 tscmin 470321 tscmax 502182
Forth boot:
[ 0.000000] Fast TSC calibration using PIT
[ 0.000000] tsc 34202008 tscmin 475510 tscmax 501501

The second one is really strange.. is'nt it?

While booting up I saw this one on the serial console..
root@quad12:~# hwclock --systohc
Cannot access the Hardware Clock via any known method.
Use the --debug option to see the details of our search for an access
method.
root@quad12:~# hwclock --systohc --debug
hwclock from util-linux-ng 2.13.1
hwclock: Open of /dev/rtc failed, errno=2: No such file or directory.
No usable clock interface found.
Cannot access the Hardware Clock via any known method.

Jesper
--
Jesper

2009-03-10 22:23:19

by Thomas Gleixner

[permalink] [raw]
Subject: Re: Linux 2.6.29-rc6

Jesper,

On Tue, 10 Mar 2009, Jesper Krogh wrote:
> First boot:
> [ 0.000000] Fast TSC calibration using PIT
> [ 0.000000] tsc 34202223 tscmin 474069 tscmax 500664
> Second boot:
> Here I didnt get above messages.. http://krogh.cc/~jesper/dmesg-boot2.txt
> Third boot:
> [ 0.000000] Fast TSC calibration using PIT
> [ 0.000000] tsc 34199856 tscmin 470321 tscmax 502182
> Forth boot:
> [ 0.000000] Fast TSC calibration using PIT
> [ 0.000000] tsc 34202008 tscmin 475510 tscmax 501501
>
> The second one is really strange.. is'nt it?

No, there simply the fast PIT calibration failed and it dropped into
the slow path:
[ 0.000000] TSC: PIT calibration matches PMTIMER. 1 loops
[ 0.000000] Detected 2311.878 MHz processor.

But the variance of the third run is interesting:

avg = tsc / loops = 495650
avg - tscmin = 25329 (~ 10.9 us)
tscmax - avg = 6532 (~ 2.8 us)

While this is in the range which the PIT calibration code accepts the
resulting CPU frequency of this run is 2310.159 MHz which is way off
the result of the slow path in the 2nd run. The 1st and the 4th run
have significant high variance as well.

I run the same patch on a couple of test machines and all have
deviations from avg in the range of +/- 2 us and the calibration
result is stable and correct.

I have no idea what might cause the problem with your machine. PIT via
SMM emulation comes to mind :)

But we can use the tscmin/max method to figure out whether the fast
PIT result is reliable. See patch below. It should drop out into the
slow calibration path on every boot on your machine.

(tscmax - tscmin) / avg = 0.064 (result from third run)

On my test machines I get values below 0.02

While it's statistically not really correct we still can use that info
to catch cases like we see on your machines.

> While booting up I saw this one on the serial console..
> root@quad12:~# hwclock --systohc
> Cannot access the Hardware Clock via any known method.
> Use the --debug option to see the details of our search for an access method.
> root@quad12:~# hwclock --systohc --debug
> hwclock from util-linux-ng 2.13.1
> hwclock: Open of /dev/rtc failed, errno=2: No such file or directory.
> No usable clock interface found.
> Cannot access the Hardware Clock via any known method.

Can you provide your .config file please ?

Thanks,

tglx

--------->

Subject: x86: make TSC fast calibration more robust
From: Thomas Gleixner <[email protected]>
Date: Tue, 10 Mar 2009 11:12:03 +0100

Check the min/max duration of each PIT loop against the resulting
average value and dismiss the fast calibration if it's larger than
2.5%. 2.5% is in the range of +/- 2us, which is a reasonable range
when we assume that a PIT read can easily take 1 us.

Signed-off-by: Thomas Gleixner <[email protected]>
---
arch/x86/kernel/tsc.c | 30 +++++++++++++++++++++++++++---
1 file changed, 27 insertions(+), 3 deletions(-)

Index: linux-2.6/arch/x86/kernel/tsc.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/tsc.c
+++ linux-2.6/arch/x86/kernel/tsc.c
@@ -317,15 +317,22 @@ static unsigned long quick_pit_calibrate

if (pit_expect_msb(0xff)) {
int i;
- u64 t1, t2, delta;
+ u64 t1, t2, t3, delta;
unsigned char expect = 0xfe;
+ unsigned long tscmin = ULONG_MAX, tscmax = 0;

- t1 = get_cycles();
+ t1 = t2 = get_cycles();
for (i = 0; i < QUICK_PIT_ITERATIONS; i++, expect--) {
if (!pit_expect_msb(expect))
goto failed;
+ t3 = get_cycles();
+ delta = t3 - t2;
+ t2 = t3;
+ if ((unsigned long) delta < tscmin)
+ tscmin = (unsigned long) delta;
+ if ((unsigned long) delta > tscmax)
+ tscmax = (unsigned long) delta;
}
- t2 = get_cycles();

/*
* Make sure we can rely on the second TSC timestamp:
@@ -334,6 +341,23 @@ static unsigned long quick_pit_calibrate
goto failed;

/*
+ * Sanity check the min max values:
+ *
+ * We calculate the average tsc increment per loop
+ * step. Now we take the tscmin and tscmax value and
+ * check whether the deviation is inside an acceptable
+ * range.
+ */
+ delta = (t2 - t1);
+ do_div(delta, QUICK_PIT_ITERATIONS);
+ t3 = (unsigned long) delta;
+ delta = tscmax - tscmin;
+ delta *= 10000;
+ do_div(delta, t3);
+ /* Fail if the deviation is > 2.5 % */
+ if (delta > 250)
+ goto failed;
+ /*
* Ok, if we get here, then we've seen the
* MSB of the PIT decrement QUICK_PIT_ITERATIONS
* times, and each MSB had many hits, so we never

2009-03-15 01:22:56

by Linus Torvalds

[permalink] [raw]
Subject: Re: Linux 2.6.29-rc6



Jesper, here's a patch that actually tries to take teh TSC error really
into account, and which I suspect will result (on your machine) in failing
the fast PIT calibration.

It also has a few extra printk's for debugging, and to see just what the
values are on your machine.

The idea behind the patch is to just keep track of how big the difference
was in TSC values between two successive reads of the PIT timer. We only
really care about the difference when the MSB turns around, and we only
really care about the two end points. The maximum error in TSC estimation
will simply be the sum of the differences at those points (d1 and d2).

We can then compare the maximum error with the actual TSC differences
between those points, and see if the max error is within 500 ppm. That
_should_ mean that it all works - assuming that the PIT itself is running
at the correct frequency, of course!

Regardless of whether is succeeds or not, it will print out some debug
messages, which will be interesting to see.

What's nice about this is that it really should make that whole "yes, it's
really within 500ppm" assertion have some solid legs to stand on. Rather
than depend on us being able to read the PIT a certain number of times, we
can literally give an estimation of the max error.

Linus

---
arch/x86/kernel/tsc.c | 41 +++++++++++++++++++++++++++++------------
1 files changed, 29 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 599e581..8e1db42 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -273,17 +273,26 @@ static unsigned long pit_calibrate_tsc(u32 latch, unsigned long ms, int loopmin)
* use the TSC value at the transitions to calculate a pretty
* good value for the TSC frequencty.
*/
-static inline int pit_expect_msb(unsigned char val)
+static unsigned long pit_expect_msb(unsigned char val, u64 *tscp, unsigned long *deltap)
{
- int count = 0;
+ int count;
+ u64 tsc = 0;

for (count = 0; count < 50000; count++) {
/* Ignore LSB */
inb(0x42);
if (inb(0x42) != val)
break;
+ tsc = get_cycles();
}
- return count > 50;
+ *deltap = get_cycles() - tsc;
+ *tscp = tsc;
+
+ /*
+ * We require _some_ success, but the quality control
+ * will be based on the error terms on the TSC values.
+ */
+ return count > 5;
}

/*
@@ -297,6 +306,10 @@ static inline int pit_expect_msb(unsigned char val)

static unsigned long quick_pit_calibrate(void)
{
+ u64 t1, t2;
+ unsigned long d1, d2;
+ unsigned char expect = 0xff;
+
/* Set the Gate high, disable speaker */
outb((inb(0x61) & ~0x02) | 0x01, 0x61);

@@ -315,22 +328,24 @@ static unsigned long quick_pit_calibrate(void)
outb(0xff, 0x42);
outb(0xff, 0x42);

- if (pit_expect_msb(0xff)) {
+ if (pit_expect_msb(0xff, &t1, &d1)) {
int i;
- u64 t1, t2, delta;
- unsigned char expect = 0xfe;
+ u64 delta;

- t1 = get_cycles();
+ expect--;
for (i = 0; i < QUICK_PIT_ITERATIONS; i++, expect--) {
- if (!pit_expect_msb(expect))
+ if (!pit_expect_msb(expect, &t2, &d2))
goto failed;
}
- t2 = get_cycles();

/*
- * Make sure we can rely on the second TSC timestamp:
+ * We require the max error on the calibration to be
+ * within 500 ppm, since that's the limit of ntpd
+ * drift correction. So the TSC delta must be more
+ * than 2000x the possible error term (d1+d2).
*/
- if (!pit_expect_msb(expect))
+ delta = t2 - t1;
+ if (d1+d2 > delta >> 11)
goto failed;

/*
@@ -347,12 +362,14 @@ static unsigned long quick_pit_calibrate(void)
* kHz = (t2 - t1) / (QPI * 256 / PIT_TICK_RATE) / 1000
* kHz = ((t2 - t1) * PIT_TICK_RATE) / (QPI * 256 * 1000)
*/
- delta = (t2 - t1)*PIT_TICK_RATE;
+ printk("Fast TSC delta=%lld, error=%lu+%lu=%lu\n", delta, d1, d2, d1+d2);
+ delta *= PIT_TICK_RATE;
do_div(delta, QUICK_PIT_ITERATIONS*256*1000);
printk("Fast TSC calibration using PIT\n");
return delta;
}
failed:
+ printk("Fast TSC calibration failed at %u %llu(%lu) %llu(%lu)\n", expect, t1, d1, t2, d2);
return 0;
}

2009-03-15 15:44:52

by Jesper Krogh

[permalink] [raw]
Subject: Re: Linux 2.6.29-rc6

Linus Torvalds wrote:
>
> Jesper, here's a patch that actually tries to take teh TSC error really
> into account, and which I suspect will result (on your machine) in failing
> the fast PIT calibration.
>
> It also has a few extra printk's for debugging, and to see just what the
> values are on your machine.
>
> The idea behind the patch is to just keep track of how big the difference
> was in TSC values between two successive reads of the PIT timer. We only
> really care about the difference when the MSB turns around, and we only
> really care about the two end points. The maximum error in TSC estimation
> will simply be the sum of the differences at those points (d1 and d2).
>
> We can then compare the maximum error with the actual TSC differences
> between those points, and see if the max error is within 500 ppm. That
> _should_ mean that it all works - assuming that the PIT itself is running
> at the correct frequency, of course!
>
> Regardless of whether is succeeds or not, it will print out some debug
> messages, which will be interesting to see.


[ 0.000000] Fast TSC delta=34227730, error=6223+6219=12442
[ 0.000000] Fast TSC calibration using PIT
[ 0.000000] Detected 2312.045 MHz processor.

Using "ntpq -c peers" .. the offset steadily grows as time goes.

Full dmesg: http://krogh.cc/~jesper/dmesg-linux-2.6.29-rc8-linus1.txt

jk@quad11:~$ ntpdc -c kerninfo
pll offset: 0.085167 s
pll frequency: -18.722 ppm
maximum error: 0.137231 s
estimated error: 0.008823 s
status: 0001 pll
pll time constant: 6
precision: 1e-06 s
frequency tolerance: 500 ppm



--
Jesper

2009-03-15 18:13:34

by Linus Torvalds

[permalink] [raw]
Subject: Re: Linux 2.6.29-rc6



On Sun, 15 Mar 2009, Jesper Krogh wrote:
> Linus Torvalds wrote:
> >
> > Regardless of whether is succeeds or not, it will print out some debug
> > messages, which will be interesting to see.
>
>
> [ 0.000000] Fast TSC delta=34227730, error=6223+6219=12442
> [ 0.000000] Fast TSC calibration using PIT
> [ 0.000000] Detected 2312.045 MHz processor.

Ok. This claims that the error really is smaller than 500ppm (it's about
360 ppm). Which is about what we're aiming for (in real life, the actual
error is about half that - we're just adding up the error terms for
maximum theoretical error).

> Using "ntpq -c peers" .. the offset steadily grows as time goes.
>
> Full dmesg: http://krogh.cc/~jesper/dmesg-linux-2.6.29-rc8-linus1.txt
>
> jk@quad11:~$ ntpdc -c kerninfo
> pll offset: 0.085167 s
> pll frequency: -18.722 ppm
> maximum error: 0.137231 s
> estimated error: 0.008823 s
> status: 0001 pll
> pll time constant: 6
> precision: 1e-06 s
> frequency tolerance: 500 ppm

Hmm. But now it all seems to _work_, no? Or do you still get time resets?
Now your "pll frequency" and "estimated error" are real values, not just
"0s" like in your previous failure cases.

Of course, maybe that happens only after the time reset actually kicks in.

But one thing my patch did - apart from the error estimation - was to
synchronize the TSC read with the actual PIT MSB wrap event. Maybe that
mattered.

The other possibility (if the time reset actually happens) is that your
PIT is simply not running at the expected frequency. That would be really
quite odd, since that nominal 1193181.8181 Hz frequency is very standard,
and has been around foreve.

I do not know how to test that. We need a reference timer to sync to, and
the PIT has traditionally been a _lot_ more reliable than the other timers
in the system (the PM timer may be reliable on modern machines, but almost
certainly not on anything a few years old).

Linus

2009-03-15 18:39:03

by Jesper Krogh

[permalink] [raw]
Subject: Re: Linux 2.6.29-rc6

Linus Torvalds wrote:
>
> On Sun, 15 Mar 2009, Jesper Krogh wrote:
>> Linus Torvalds wrote:
>>> Regardless of whether is succeeds or not, it will print out some debug
>>> messages, which will be interesting to see.
>>
>> [ 0.000000] Fast TSC delta=34227730, error=6223+6219=12442
>> [ 0.000000] Fast TSC calibration using PIT
>> [ 0.000000] Detected 2312.045 MHz processor.
>
> Ok. This claims that the error really is smaller than 500ppm (it's about
> 360 ppm). Which is about what we're aiming for (in real life, the actual
> error is about half that - we're just adding up the error terms for
> maximum theoretical error).
>
>> Using "ntpq -c peers" .. the offset steadily grows as time goes.
>>
>> Full dmesg: http://krogh.cc/~jesper/dmesg-linux-2.6.29-rc8-linus1.txt
>>
>> jk@quad11:~$ ntpdc -c kerninfo
>> pll offset: 0.085167 s
>> pll frequency: -18.722 ppm
>> maximum error: 0.137231 s
>> estimated error: 0.008823 s
>> status: 0001 pll
>> pll time constant: 6
>> precision: 1e-06 s
>> frequency tolerance: 500 ppm
>
> Hmm. But now it all seems to _work_, no? Or do you still get time resets?

My conclusion was that I would get a time reset after some time since
the offset just increased as time went by (being reasonably small at the
beginning).

I had it up for around 30 minutes... Should I have tested longer?

I went on to trying Thomas Gleixners patch (which seems to do excactly
the same .. ), I'll write a reply in to that message in a few minutes.

--
Jesper

2009-03-15 19:05:43

by Linus Torvalds

[permalink] [raw]
Subject: Re: Linux 2.6.29-rc6



On Sun, 15 Mar 2009, Jesper Krogh wrote:
> > >
> > > [ 0.000000] Fast TSC delta=34227730, error=6223+6219=12442
> > > [ 0.000000] Fast TSC calibration using PIT
> > > [ 0.000000] Detected 2312.045 MHz processor.
>
> My conclusion was that I would get a time reset after some time since the
> offset just increased as time went by (being reasonably small at the
> beginning).
>
> I had it up for around 30 minutes... Should I have tested longer?

It would be good to test longer. Your previous emails showed:

2.6.26: time.c: Detected 2311.847 MHz processor.
2.6.29: Detected 2310.029 MHz processor.

where that first one was a successful boot, and the second one was a
failing one. So let's assume that 2311.847 is the "correct" frequency.

The difference between the correct one and your failing one is ~790 ppm,
which is above the 500ppm ntpd threshhold. And as we saw earlier, those
differences were pretty consistent, ie in your list of four successive
boots, the old code consistently gave a frequency error that was roughly
.7 permille off (ie exactly that 700 ppm).

HOWEVER! With that patch you just tried, you got

Detected 2312.045 MHz processor.

and the difference between _that_ and the assumed-correct-one is actually
just 85 ppm. Which should be perfectly fine.

[ With the "test against PM timer, you had:

[ 0.000000] ref_freq: 2311825 pit_freq: 2310386
[ 0.000000] ref_freq: 2311803 pit_freq: 2310190
[ 0.000000] ref_freq: 2311824 pit_freq: 2310080
[ 0.000000] ref_freq: 2311831 pit_freq: 2310130

on four boots, so averaging them gives 2311.82 Mhz, and the 2312.045MHz
you got with the improved fast-PIT code is still _way_ below 500ppm from
that - it's ~95 ppm away.

IOW, the new frequency realy looks likely to work. ]

Quite frankly, we don't know how exact the PM-timer is either - we just
know that the detection is "stable" (but so was the old PIT timer
detection: it was stably at 700ppm lower from the PM timer. So there is
nothing that says that 2311.82Mhz is the "correct" frequency, but we
obviously know from your ntpd saga that it is much closer to correct than
the old 2310.029 was.

End result of all this: I'd really like you to try the modified PIT
frequency code for longer. Also, remember that getting one (or a couple)
"time reset" messages from ntpd while it's trying to sync up is not a
problem per se - it can validly take a while to synchronize. The problem
is literally only if it doesn't synchonize over time at all.

Linus

2009-03-15 19:52:51

by Jesper Krogh

[permalink] [raw]
Subject: Re: Linux 2.6.29-rc6

Linus Torvalds wrote:
> End result of all this: I'd really like you to try the modified PIT
> frequency code for longer. Also, remember that getting one (or a couple)
> "time reset" messages from ntpd while it's trying to sync up is not a
> problem per se - it can validly take a while to synchronize. The problem
> is literally only if it doesn't synchonize over time at all.

Ok. I'll get it on and report back in 24 hours or so..


--
Jesper

2009-03-15 19:53:48

by Jesper Krogh

[permalink] [raw]
Subject: Re: Linux 2.6.29-rc6

Thomas Gleixner wrote:

> slow calibration path on every boot on your machine.
>
> (tscmax - tscmin) / avg = 0.064 (result from third run)
>
> On my test machines I get values below 0.02
>
> While it's statistically not really correct we still can use that info
> to catch cases like we see on your machines.
>
>> While booting up I saw this one on the serial console..
>> root@quad12:~# hwclock --systohc
>> Cannot access the Hardware Clock via any known method.
>> Use the --debug option to see the details of our search for an access method.
>> root@quad12:~# hwclock --systohc --debug
>> hwclock from util-linux-ng 2.13.1
>> hwclock: Open of /dev/rtc failed, errno=2: No such file or directory.
>> No usable clock interface found.
>> Cannot access the Hardware Clock via any known method.
>
> Can you provide your .config file please ?

http://krogh.cc/~jesper/config-2.6.29-rc8.txt

I testet the attached patch.. and after 1.5 hours it seems to work. I'll
remain on this one at least a day to see how it works. I'll keep it on
for now and report back in 24 hours or so.

Its still using tsc as clock-source.

Jesper

--
Jesper

2009-03-15 20:36:12

by Linus Torvalds

[permalink] [raw]
Subject: Re: Linux 2.6.29-rc6



On Sun, 15 Mar 2009, Jesper Krogh wrote:
>
> I went on to trying Thomas Gleixners patch (which seems to do excactly the
> same .. ), I'll write a reply in to that message in a few minutes.

Side note: no, Thomas' patch doesn't do at all exactly the same. It does
something similar, in that it looks at the time differences between calls
to the whole "wait for the PIT MSB to change" function, but those
differences _could_ in theory be very small, even if the error is very
big.

That's especially true if the PIT read ends up serializing with the PIT,
so that the "wait for MSB" essentially always takes exactly the same
amount of cycles (giving a zero error estimation in Thomas' version), but
the reads themselves can still be quite slow (giving a non-zero error term
in the end result).

IOW, Thomas' patch is good at finding variability in the reads - which
could be the result of SMM interaction, while my patch literally measures
how long it takes to read the MSB change.

Now in practice I suspect the variability in the MSB reads _probably_
correlate reasonably well with how long a single PIT read will take (ie
rather than finding variability due to SMM interaction, it will find
variability due to the "quanitization" effect of the reads taking a
reasonably long time), so I suspect that in many cases Thomas' patch will
error out for the same cases mine does.

But the two patches are rather fundamentally different.

Linus

2009-03-16 18:41:03

by Jesper Krogh

[permalink] [raw]
Subject: Re: Linux 2.6.29-rc6

Jesper Krogh wrote:
> Thomas Gleixner wrote:
>
>> slow calibration path on every boot on your machine.
>>
>> (tscmax - tscmin) / avg = 0.064 (result from third run)
>>
>> On my test machines I get values below 0.02
>>
>> While it's statistically not really correct we still can use that info
>> to catch cases like we see on your machines.
>>
>>> While booting up I saw this one on the serial console..
>>> root@quad12:~# hwclock --systohc
>>> Cannot access the Hardware Clock via any known method.
>>> Use the --debug option to see the details of our search for an access
>>> method.
>>> root@quad12:~# hwclock --systohc --debug
>>> hwclock from util-linux-ng 2.13.1
>>> hwclock: Open of /dev/rtc failed, errno=2: No such file or directory.
>>> No usable clock interface found.
>>> Cannot access the Hardware Clock via any known method.
>>
>> Can you provide your .config file please ?
>
> http://krogh.cc/~jesper/config-2.6.29-rc8.txt
>
> I testet the attached patch.. and after 1.5 hours it seems to work. I'll
> remain on this one at least a day to see how it works. I'll keep it on
> for now and report back in 24 hours or so.
>
> Its still using tsc as clock-source.

No resets after 24 hours.. it works.

--
Jesper

2009-03-16 19:00:30

by Jesper Krogh

[permalink] [raw]
Subject: Re: Linux 2.6.29-rc6

Jesper Krogh wrote:
> Linus Torvalds wrote:
>> End result of all this: I'd really like you to try the modified PIT
>> frequency code for longer. Also, remember that getting one (or a
>> couple) "time reset" messages from ntpd while it's trying to sync up
>> is not a problem per se - it can validly take a while to synchronize.
>> The problem is literally only if it doesn't synchonize over time at all.
>
> Ok. I'll get it on and report back in 24 hours or so..

you were right. It works. No resets so far.

--
Jesper

2009-03-16 19:36:06

by Linus Torvalds

[permalink] [raw]
Subject: Re: Linux 2.6.29-rc6



On Mon, 16 Mar 2009, Jesper Krogh wrote:
>
> you were right. It works. No resets so far.

Goodie.

Here's a slightly cleaned-up patch that removes the debug messages, and
also re-organizes the code a bit so that it actually uses the "better than
500 ppm" as the way to decide when to stop calibrating.

Why?

I tested the 500 ppm check on some slower machines, and the old algorithm
of just waiting for 15ms actually failed that 500 ppm test. It was _very_
close - 16ms was enough - but it convinced me that the logic was too damn
fragile.

I also think I know why John reported this:

> Ingo, Thomas: On the hardware I'm testing the fast-pit calibration only
> triggers probably 80-90% of the time. About 10-20% of the time, the
> initial check to pit_expect_msb(0xff) fails (count=0), so we may need to
> look more at this approach.

and the reason is that when we re-program the PIT, it will actually take
until the next timer edge (the incoming 1.1MHz timer) for the new values
to take effect. So before the first call to pit_expect_msb(), we should
make sure to delay for at least one PIT cycle. The simplest way to do that
is to simply read the PIT latch once, it will take about 2us.

So this patch fixes that too.

John, does that make the PIT calibration work reliably on your machine?

The patch looks bigger than it is: most of the noise is just
re-indentation and some trivial re-organizing.

Linus

---
arch/x86/kernel/tsc.c | 110 +++++++++++++++++++++++++++++--------------------
1 files changed, 65 insertions(+), 45 deletions(-)

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 599e581..d5cebb5 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -273,30 +273,43 @@ static unsigned long pit_calibrate_tsc(u32 latch, unsigned long ms, int loopmin)
* use the TSC value at the transitions to calculate a pretty
* good value for the TSC frequencty.
*/
-static inline int pit_expect_msb(unsigned char val)
+static inline int pit_expect_msb(unsigned char val, u64 *tscp, unsigned long *deltap)
{
- int count = 0;
+ int count;
+ u64 tsc = 0;

for (count = 0; count < 50000; count++) {
/* Ignore LSB */
inb(0x42);
if (inb(0x42) != val)
break;
+ tsc = get_cycles();
}
- return count > 50;
+ *deltap = get_cycles() - tsc;
+ *tscp = tsc;
+
+ /*
+ * We require _some_ success, but the quality control
+ * will be based on the error terms on the TSC values.
+ */
+ return count > 5;
}

/*
- * How many MSB values do we want to see? We aim for a
- * 15ms calibration, which assuming a 2us counter read
- * error should give us roughly 150 ppm precision for
- * the calibration.
+ * How many MSB values do we want to see? We aim for
+ * a maximum error rate of 500ppm (in practice the
+ * real error is much smaller), but refuse to spend
+ * more than 25ms on it.
*/
-#define QUICK_PIT_MS 15
-#define QUICK_PIT_ITERATIONS (QUICK_PIT_MS * PIT_TICK_RATE / 1000 / 256)
+#define MAX_QUICK_PIT_MS 25
+#define MAX_QUICK_PIT_ITERATIONS (MAX_QUICK_PIT_MS * PIT_TICK_RATE / 1000 / 256)

static unsigned long quick_pit_calibrate(void)
{
+ int i;
+ u64 tsc, delta;
+ unsigned long d1, d2;
+
/* Set the Gate high, disable speaker */
outb((inb(0x61) & ~0x02) | 0x01, 0x61);

@@ -315,45 +328,52 @@ static unsigned long quick_pit_calibrate(void)
outb(0xff, 0x42);
outb(0xff, 0x42);

- if (pit_expect_msb(0xff)) {
- int i;
- u64 t1, t2, delta;
- unsigned char expect = 0xfe;
-
- t1 = get_cycles();
- for (i = 0; i < QUICK_PIT_ITERATIONS; i++, expect--) {
- if (!pit_expect_msb(expect))
- goto failed;
+ /*
+ * The PIT starts counting at the next edge, so we
+ * need to delay for a microsecond. The easiest way
+ * to do that is to just read back the 16-bit counter
+ * once from the PIT.
+ */
+ inb(0x42);
+ inb(0x42);
+
+ if (pit_expect_msb(0xff, &tsc, &d1)) {
+ for (i = 1; i <= MAX_QUICK_PIT_ITERATIONS; i++) {
+ if (!pit_expect_msb(0xff-i, &delta, &d2))
+ break;
+
+ /*
+ * Iterate until the error is less than 500 ppm
+ */
+ delta -= tsc;
+ if (d1+d2 < delta >> 11)
+ goto success;
}
- t2 = get_cycles();
-
- /*
- * Make sure we can rely on the second TSC timestamp:
- */
- if (!pit_expect_msb(expect))
- goto failed;
-
- /*
- * Ok, if we get here, then we've seen the
- * MSB of the PIT decrement QUICK_PIT_ITERATIONS
- * times, and each MSB had many hits, so we never
- * had any sudden jumps.
- *
- * As a result, we can depend on there not being
- * any odd delays anywhere, and the TSC reads are
- * reliable.
- *
- * kHz = ticks / time-in-seconds / 1000;
- * kHz = (t2 - t1) / (QPI * 256 / PIT_TICK_RATE) / 1000
- * kHz = ((t2 - t1) * PIT_TICK_RATE) / (QPI * 256 * 1000)
- */
- delta = (t2 - t1)*PIT_TICK_RATE;
- do_div(delta, QUICK_PIT_ITERATIONS*256*1000);
- printk("Fast TSC calibration using PIT\n");
- return delta;
}
-failed:
+ printk("Fast TSC calibration failed\n");
return 0;
+
+success:
+ /*
+ * Ok, if we get here, then we've seen the
+ * MSB of the PIT decrement 'i' times, and the
+ * error has shrunk to less than 500 ppm.
+ *
+ * As a result, we can depend on there not being
+ * any odd delays anywhere, and the TSC reads are
+ * reliable (within the error). We also adjust the
+ * delta to the middle of the error bars, just
+ * because it looks nicer.
+ *
+ * kHz = ticks / time-in-seconds / 1000;
+ * kHz = (t2 - t1) / (I * 256 / PIT_TICK_RATE) / 1000
+ * kHz = ((t2 - t1) * PIT_TICK_RATE) / (I * 256 * 1000)
+ */
+ delta += (long)(d2 - d1)/2;
+ delta *= PIT_TICK_RATE;
+ do_div(delta, i*256*1000);
+ printk("Fast TSC calibration using PIT\n");
+ return delta;
}

/**

2009-03-17 01:43:45

by john stultz

[permalink] [raw]
Subject: Re: Linux 2.6.29-rc6

On Mon, 2009-03-16 at 12:32 -0700, Linus Torvalds wrote:
> I also think I know why John reported this:
>
> > Ingo, Thomas: On the hardware I'm testing the fast-pit calibration only
> > triggers probably 80-90% of the time. About 10-20% of the time, the
> > initial check to pit_expect_msb(0xff) fails (count=0), so we may need to
> > look more at this approach.
>
> and the reason is that when we re-program the PIT, it will actually take
> until the next timer edge (the incoming 1.1MHz timer) for the new values
> to take effect. So before the first call to pit_expect_msb(), we should
> make sure to delay for at least one PIT cycle. The simplest way to do that
> is to simply read the PIT latch once, it will take about 2us.
>
> So this patch fixes that too.
>
> John, does that make the PIT calibration work reliably on your machine?

Yep, I haven't seen a failure with it so far. And it's the same net
effect change my earlier patch was doing (one extra read cycle) just
without all the conditionals, so it should be fine.

thanks
-john

2009-03-17 08:14:57

by Ingo Molnar

[permalink] [raw]
Subject: Re: Linux 2.6.29-rc6


* Linus Torvalds <[email protected]> wrote:

> On Mon, 16 Mar 2009, Jesper Krogh wrote:
> >
> > you were right. It works. No resets so far.
>
> Goodie.
>
> Here's a slightly cleaned-up patch that removes the debug
> messages, and also re-organizes the code a bit so that it
> actually uses the "better than 500 ppm" as the way to decide
> when to stop calibrating.
>
> Why?
>
> I tested the 500 ppm check on some slower machines, and the
> old algorithm of just waiting for 15ms actually failed that
> 500 ppm test. It was _very_ close - 16ms was enough - but it
> convinced me that the logic was too damn fragile.
>
> I also think I know why John reported this:
>
> > Ingo, Thomas: On the hardware I'm testing the fast-pit calibration only
> > triggers probably 80-90% of the time. About 10-20% of the time, the
> > initial check to pit_expect_msb(0xff) fails (count=0), so we may need to
> > look more at this approach.
>
> and the reason is that when we re-program the PIT, it will
> actually take until the next timer edge (the incoming 1.1MHz
> timer) for the new values to take effect. So before the first
> call to pit_expect_msb(), we should make sure to delay for at
> least one PIT cycle. The simplest way to do that is to simply
> read the PIT latch once, it will take about 2us.
>
> So this patch fixes that too.
>
> John, does that make the PIT calibration work reliably on your
> machine?
>
> The patch looks bigger than it is: most of the noise is just
> re-indentation and some trivial re-organizing.

Cool. Will you apply it yourself (in the merge window) or should
we pick it up?

Incidentally, yesterday i wrote a PIT auto-calibration routine
(see WIP patch below).

The core idea is to use _all_ thousands of measurement points
(not just two) to calculate the frequency ratio, with a built-in
noise detector which drops out of the loop if the observed noise
goes below ~10 ppm.

It is free-running: i.e. it observes noise and if the result
stabilizes quickly it can exit quickly. (with an upper bound for
unreliable PITs or virtualized systems, etc.)

It's WIP because it's not working yet (or at all?): i couldnt
get the statistical model right - it's too noisy at 1000-2000
ppm and the frequency result is off by 5000 ppm. Totally against
expectations. I traced it on a box with a good PIT and in the
trace the calculations look sane and the noise levels go down
nicely - except that the result sucks.

I also like yours more because it's simpler.

Ingo

Index: linux/arch/x86/kernel/tsc.c
===================================================================
--- linux.orig/arch/x86/kernel/tsc.c
+++ linux/arch/x86/kernel/tsc.c
@@ -240,63 +240,201 @@ static unsigned long pit_calibrate_tsc(u
}

/*
- * This reads the current MSB of the PIT counter, and
- * checks if we are running on sufficiently fast and
- * non-virtualized hardware.
+ * Rolling statistical analysis of (PIT,TSC) measurement deltas.
*
- * Our expectations are:
- *
- * - the PIT is running at roughly 1.19MHz
- *
- * - each IO is going to take about 1us on real hardware,
- * but we allow it to be much faster (by a factor of 10) or
- * _slightly_ slower (ie we allow up to a 2us read+counter
- * update - anything else implies a unacceptably slow CPU
- * or PIT for the fast calibration to work.
- *
- * - with 256 PIT ticks to read the value, we have 214us to
- * see the same MSB (and overhead like doing a single TSC
- * read per MSB value etc).
- *
- * - We're doing 2 reads per loop (LSB, MSB), and we expect
- * them each to take about a microsecond on real hardware.
- * So we expect a count value of around 100. But we'll be
- * generous, and accept anything over 50.
- *
- * - if the PIT is stuck, and we see *many* more reads, we
- * return early (and the next caller of pit_expect_msb()
- * then consider it a failure when they don't see the
- * next expected value).
- *
- * These expectations mean that we know that we have seen the
- * transition from one expected value to another with a fairly
- * high accuracy, and we didn't miss any events. We can thus
- * use the TSC value at the transitions to calculate a pretty
- * good value for the TSC frequencty.
+ * We use a decaying average to estimate current noise levels.
+ * If noise falls below the expected threshold we exit the loop
+ * with the result.
+ *
+ * If this never happens - for example because the PIT is unreliable,
+ * then we break out after a limit and fail this type of calibration.
+ *
+ * Note that this method observes the statistical noise as-is without
+ * making any assumptions, so it is fundamentally robust against
+ * occasional PIT blips or SMI related system activities that can
+ * disturb calibration. An SMI in the wrong moment pushes up the
+ * noise level and causes the calibration loop to exit a tiny bit
+ * later - but still with a precise and reliable result.
*/
-static inline int pit_expect_msb(unsigned char val)
+static s64 sum_slope;
+static s64 sum_slope_noise;
+static s64 prev_slope;
+
+static int nr_measurements;
+
+#define MAX_MEASUREMENTS 10000
+
+#define MIN_MEASUREMENTS 100
+
+struct entry {
+ u64 tsc;
+ unsigned int pit;
+};
+
+/*
+ * A single measurement is as simple as possible:
+ */
+static inline void do_one_measurement(struct entry *entry)
{
- int count = 0;
+ unsigned char pit_lsb, pit_msb;
+ u64 tsc;

- for (count = 0; count < 50000; count++) {
- /* Ignore LSB */
- inb(0x42);
- if (inb(0x42) != val)
- break;
- }
- return count > 50;
+ /*
+ * We use the PIO accesses as natural TSC serialization barriers:
+ */
+ pit_lsb = inb(0x42);
+ tsc = get_cycles();
+ pit_msb = inb(0x42);
+
+ entry->tsc = tsc;
+ entry->pit = pit_msb*256 + pit_lsb;
+
+ trace_printk("tsc: %Ld, count: %d, nr: %d\n",
+ entry->tsc, entry->pit, nr_measurements);
}

/*
- * How many MSB values do we want to see? We aim for a
- * 15ms calibration, which assuming a 2us counter read
- * error should give us roughly 150 ppm precision for
- * the calibration.
+ * We scale numbers up by 1024 to reduce quantization effects:
*/
-#define QUICK_PIT_MS 15
-#define QUICK_PIT_ITERATIONS (QUICK_PIT_MS * PIT_TICK_RATE / 1000 / 256)
+static unsigned long do_delta_analysis(struct entry *e0, struct entry *e1)
+{
+ s64 slope, dslope;
+ s64 noise;
+ int decay;
+ int dc;
+ s64 dt;
+
+ dt = e1->tsc - e0->tsc; /* TSC is going up */
+ dc = e0->pit - e1->pit; /* PIT counter is going down */
+
+ /*
+ * Delta-PIT-count can be positive (or negative in case of
+ * an anomaly), but we made sure in do_measurement() that
+ * it can never be zero:
+ */
+ slope = 1024 * dt / dc;
+
+ dslope = slope - prev_slope;
+ noise = dslope;
+
+ trace_printk(" dt: %20Ld\n", dt);
+ trace_printk(" dc: %20d\n", dc);
+ trace_printk(" slope: %20Ld\n", slope);
+ trace_printk(" dslope: %20Ld\n", dslope);
+
+ /*
+ * Add a gentle decaying average to the slope and noise averages:
+ */
+ trace_printk(" prev sum_slope: %20Ld\n", sum_slope);

-static unsigned long quick_pit_calibrate(void)
+ /*
+ * Dynamic decay - starts with low values then later on
+ * the system cools down:
+ */
+ decay = 1;
+ if (sum_slope_noise)
+ decay = sum_slope / 64 / sum_slope_noise;
+ decay = min(2000, decay);
+ decay = max(nr_measurements/4, decay);
+
+ sum_slope = ((decay - 1)*sum_slope + slope)/decay;
+ trace_printk(" new sum_slope: %20Ld [decay: %d]\n",
+ sum_slope, decay);
+
+ trace_printk(" prev sum_slope_noise: %20Ld\n", sum_slope_noise);
+ sum_slope_noise = (1023*sum_slope_noise + noise)/1024;
+ trace_printk(" new sum_slope_noise: %20Ld\n", sum_slope_noise);
+
+ prev_slope = slope;
+
+ if (nr_measurements >= 64*MIN_MEASUREMENTS && sum_slope_noise < 10 ) {
+ trace_printk(" => low noise early exit!\n");
+ return 1;
+ }
+
+ return 0;
+}
+
+static int do_measurements(void)
+{
+ unsigned int pit_stuck;
+ unsigned long flags;
+ struct entry e0, e1;
+ int err = 0;
+
+ sum_slope_noise = 0;
+ sum_slope = 0;
+ prev_slope = 0;
+
+ nr_measurements = 0;
+
+ local_irq_save(flags);
+
+ trace_printk("PIT begin\n");
+ do_one_measurement(&e0);
+
+ do_one_measurement(&e0);
+
+ for (;;) {
+ pit_stuck = 0;
+repeat_e1:
+ do_one_measurement(&e1);
+ /*
+ * The typical case is that the PIT advanced a bit
+ * since we last read it (the PIOs take time, etc.).
+ * In case it did not advance (some really fast
+ * PIO implementation or virtualization) we will allow
+ * the count to stay 'stuck' up to 100 times:
+ *
+ * (Note that making sure that the count progresses also
+ * simplifies data processing later on.)
+ */
+ if (e0.pit != e1.pit) {
+ nr_measurements++;
+ if (nr_measurements >= MAX_MEASUREMENTS) {
+ printk("PIT: final count: %d\n", e1.pit);
+ break;
+ }
+ if (do_delta_analysis(&e0, &e1)) {
+ printk("PIT: low-noise count: %d\n", e1.pit);
+ break;
+ }
+ /*
+ * Reuse the second measurement point for the
+ * next delta measurement:
+ */
+ e0 = e1;
+ trace_printk("\n");
+ continue;
+ }
+ if (pit_stuck++ < 100)
+ goto repeat_e1;
+
+ printk(KERN_INFO "PIT auto-calibration: counter stuck at %d!\n",
+ e1.pit);
+ err = -EINVAL;
+ }
+
+ trace_printk("PIT end\n");
+ local_irq_restore(flags);
+
+ return err;
+}
+
+static unsigned long auto_pit_calibrate(void)
+{
+ if (do_measurements() < 0)
+ return 0;
+
+ printk("PIT: sum_slope: %Ld\n", sum_slope);
+ printk("PIT: Hz: %Ld\n", sum_slope * PIT_TICK_RATE);
+ printk("PIT: sum_slope_noise: %Ld\n", sum_slope_noise);
+ printk("PIT: nr_measurements: %d\n", nr_measurements);
+
+ return sum_slope * PIT_TICK_RATE / 1024 / 1000;
+}
+
+unsigned long quick_pit_calibrate(void)
{
/* Set the Gate high, disable speaker */
outb((inb(0x61) & ~0x02) | 0x01, 0x61);
@@ -316,45 +454,7 @@ static unsigned long quick_pit_calibrate
outb(0xff, 0x42);
outb(0xff, 0x42);

- if (pit_expect_msb(0xff)) {
- int i;
- u64 t1, t2, delta;
- unsigned char expect = 0xfe;
-
- t1 = get_cycles();
- for (i = 0; i < QUICK_PIT_ITERATIONS; i++, expect--) {
- if (!pit_expect_msb(expect))
- goto failed;
- }
- t2 = get_cycles();
-
- /*
- * Make sure we can rely on the second TSC timestamp:
- */
- if (!pit_expect_msb(expect))
- goto failed;
-
- /*
- * Ok, if we get here, then we've seen the
- * MSB of the PIT decrement QUICK_PIT_ITERATIONS
- * times, and each MSB had many hits, so we never
- * had any sudden jumps.
- *
- * As a result, we can depend on there not being
- * any odd delays anywhere, and the TSC reads are
- * reliable.
- *
- * kHz = ticks / time-in-seconds / 1000;
- * kHz = (t2 - t1) / (QPI * 256 / PIT_TICK_RATE) / 1000
- * kHz = ((t2 - t1) * PIT_TICK_RATE) / (QPI * 256 * 1000)
- */
- delta = (t2 - t1)*PIT_TICK_RATE;
- do_div(delta, QUICK_PIT_ITERATIONS*256*1000);
- printk("Fast TSC calibration using PIT\n");
- return delta;
- }
-failed:
- return 0;
+ return auto_pit_calibrate();
}

/**

2009-03-17 16:00:36

by Linus Torvalds

[permalink] [raw]
Subject: Re: Linux 2.6.29-rc6



On Tue, 17 Mar 2009, Ingo Molnar wrote:
>
> Cool. Will you apply it yourself (in the merge window) or should
> we pick it up?

I'll commit it. I already split it into two commits - one for the trivial
startup problem that John had, one for the "estimate error and exit when
smaller than 500ppm" part.

> Incidentally, yesterday i wrote a PIT auto-calibration routine
> (see WIP patch below).
>
> The core idea is to use _all_ thousands of measurement points
> (not just two) to calculate the frequency ratio, with a built-in
> noise detector which drops out of the loop if the observed noise
> goes below ~10 ppm.

I suspect that reaching 10 ppm is going to take too long in general.
Considering that I found a machine where reaching 500ppm took 16ms,
getting to 10ppm would take almost a second. That's a long time at bootup,
considering that people want the whole boot to take about that time ;)

I also do think it's a bit unnecessarily complicated. We really only care
about the end points - obviously we can end up being unlucky and get a
very noisy end-point due to something like SMI or virtualization, but if
that happens, we're really just better off failing quickly instead, and
we'll go on to the slower calibration routines.

On real hardware without SMI or virtualization overhead, the delays
_should_ be very stable. On my main machine, for example, the PIT read
really seems very stable at about 2.5us (which matches the expectation
that one 'inb' should take roughly one microsecond pretty closely). So
that should be the default case, and the case that the fast calibration is
designed for.

For the other cases, we really can just exit and do something else.

> It's WIP because it's not working yet (or at all?): i couldnt
> get the statistical model right - it's too noisy at 1000-2000
> ppm and the frequency result is off by 5000 ppm.

I suspect your measurement overhead is getting noticeable. You do all
those divides, but even more so, you do all those traces. Also, it looks
like you do purely local pairwise analysis at subsequent PIT modelling
points, which can't work - you need to average over a long time to
stabilize it.

So you _can_ do something like what you do, but you'd need to find a
low-noise start and end point, and do analysis over that longer range
instead of trying to do it over individual cases.

> I also like yours more because it's simpler.

In fact, it's much simpler than what we used to do. No real assumptions
about how quickly we can read the PIT, no need for magic values ("we can
distinguish a slow virtual environment from real hardware by the fact that
we can do at least 50 PIT reads in one cycle"), no nothing. Just a simple
"is it below 500ppm yet?".

(Well, technically, it compares to 1 in 2048 rather than 500 in a million,
since that is much cheaper, so it's really looking for "better than
488ppm")

Linus

2009-03-17 16:13:57

by Ingo Molnar

[permalink] [raw]
Subject: Re: Linux 2.6.29-rc6


* Linus Torvalds <[email protected]> wrote:

> On Tue, 17 Mar 2009, Ingo Molnar wrote:
> >
> > Cool. Will you apply it yourself (in the merge window) or should
> > we pick it up?
>
> I'll commit it. I already split it into two commits - one for the
> trivial startup problem that John had, one for the "estimate error
> and exit when smaller than 500ppm" part.

ok.

> > Incidentally, yesterday i wrote a PIT auto-calibration routine
> > (see WIP patch below).
> >
> > The core idea is to use _all_ thousands of measurement points
> > (not just two) to calculate the frequency ratio, with a built-in
> > noise detector which drops out of the loop if the observed noise
> > goes below ~10 ppm.
>
> I suspect that reaching 10 ppm is going to take too long in
> general. Considering that I found a machine where reaching 500ppm
> took 16ms, getting to 10ppm would take almost a second. That's a
> long time at bootup, considering that people want the whole boot
> to take about that time ;)
>
> I also do think it's a bit unnecessarily complicated. We really
> only care about the end points - obviously we can end up being
> unlucky and get a very noisy end-point due to something like SMI
> or virtualization, but if that happens, we're really just better
> off failing quickly instead, and we'll go on to the slower
> calibration routines.

That's the idea of my patch: to use not two endpoints but thousands
of measurement points. That way we dont have to worry about the
precision of the endpoints - any 'bad' measurement will be
counter-acted by thousands of 'good' measurements.

That's the theory at least - practice got in my way ;-)

By measuring more we can get a more precise result, and we also do
not assume anything about how much time passes between two
measurement points. A single measurement is:

+ /*
+ * We use the PIO accesses as natural TSC serialization barriers:
+ */
+ pit_lsb = inb(0x42);
+ tsc = get_cycles();
+ pit_msb = inb(0x42);

Just like we can prove that there's an exoplanet around a star, just
by doing a _ton_ of measurements of a very noisy data source. As
long as there's an underlying physical value to be measured (and we
are not measuring pure noise) that value is recoverable, with enough
measurements.

> On real hardware without SMI or virtualization overhead, the
> delays _should_ be very stable. On my main machine, for example,
> the PIT read really seems very stable at about 2.5us (which
> matches the expectation that one 'inb' should take roughly one
> microsecond pretty closely). So that should be the default case,
> and the case that the fast calibration is designed for.
>
> For the other cases, we really can just exit and do something
> else.
>
> > It's WIP because it's not working yet (or at all?): i couldnt
> > get the statistical model right - it's too noisy at 1000-2000
> > ppm and the frequency result is off by 5000 ppm.
>
> I suspect your measurement overhead is getting noticeable. You do
> all those divides, but even more so, you do all those traces.
> Also, it looks like you do purely local pairwise analysis at
> subsequent PIT modelling points, which can't work - you need to
> average over a long time to stabilize it.

Actually, it's key to my trick that what happens _between_ the
measurement points does not matter _at all_.

My 'delta' algorithm does not assume anything about how much time
passes between two measurement points - it calculates the slope and
keeps a rolling average of that slope.

That's why i could put the delta analysis there. We are capturing
thousands of measurement points, and what matters is the precision
of the 'pair' of (PIT,TSC) timestamp measurements.

I got roughly the same end result noise and the same anomalies with
tracing enabled and disabled. (and the number of data points was cut
in half with tracing enabled)

Ingo

2009-03-17 16:33:24

by Linus Torvalds

[permalink] [raw]
Subject: Re: Linux 2.6.29-rc6



On Tue, 17 Mar 2009, Ingo Molnar wrote:
>
> That's the idea of my patch: to use not two endpoints but thousands
> of measurement points.

Umm. Except you don't.

> By measuring more we can get a more precise result, and we also do
> not assume anything about how much time passes between two
> measurement points.

That's fine, but your actual code doesn't _do_ that.

> My 'delta' algorithm does not assume anything about how much time
> passes between two measurement points - it calculates the slope and
> keeps a rolling average of that slope.

No, you keep a very bad measure of "some kind of random average of the
last few points", which - if I read things right:

- lacks precision (you really need to use 'double' floating point to do
it well, otherwise the rounding errors will kill you). You seem to be
aiming for a 10-bit fixed point thing, which may or may not work if
done cleverly, but:

- seems to be based on a rather weak averaging function which certainly
will lose data over time.

The thing is, the only _accurate_ average is the one done over long time
distances. It's very true that your slope thing works very well over such
long times, and you'd get accurate measurement if you did it that way, BUT
THAT IS NOT WHAT YOU DO. You have a very tight loop, so you get very bad
slopes, and then you use a weak averaging function to try to make them
better, but it never does.

Also, there seems to be a fundamental bug in your PIT reading routine. My
fast-TSC calibration only looks at the MSB of the PIT read for a very good
reason: if you don't use the explicit LATCH command, you may be getting
the MSB of one counter value, and then the LSB of another. So your PIT
read can easily be off by ~256 PIT cycles. Only by caring only for the MSB
can you do an unlatched read!

That is why pit_expect_msb() looks for the "edge" where the MSB changes,
and never actually looks at the LSB.

This issue may be an additional reason for your problems, although maybe
your noise correction will be able to avoid those cases.

Linus

2009-03-17 16:41:29

by Ingo Molnar

[permalink] [raw]
Subject: Re: Linux 2.6.29-rc6


* Linus Torvalds <[email protected]> wrote:

> On Tue, 17 Mar 2009, Ingo Molnar wrote:
> >
> > That's the idea of my patch: to use not two endpoints but thousands
> > of measurement points.
>
> Umm. Except you don't.
>
> > By measuring more we can get a more precise result, and we also do
> > not assume anything about how much time passes between two
> > measurement points.
>
> That's fine, but your actual code doesn't _do_ that.
>
> > My 'delta' algorithm does not assume anything about how much time
> > passes between two measurement points - it calculates the slope and
> > keeps a rolling average of that slope.
>
> No, you keep a very bad measure of "some kind of random average of the
> last few points", which - if I read things right:
>
> - lacks precision (you really need to use 'double' floating point to do
> it well, otherwise the rounding errors will kill you). You seem to be
> aiming for a 10-bit fixed point thing, which may or may not work if
> done cleverly, but:
>
> - seems to be based on a rather weak averaging function which certainly
> will lose data over time.
>
> The thing is, the only _accurate_ average is the one done over
> long time distances. It's very true that your slope thing works
> very well over such long times, and you'd get accurate measurement
> if you did it that way, BUT THAT IS NOT WHAT YOU DO. You have a
> very tight loop, so you get very bad slopes, and then you use a
> weak averaging function to try to make them better, but it never
> does.

Hm, the intention there was to have a memory of ~1000 entries via a
decaying average of 1:1000.

In parallel to that there's also a noise estimator (which too decays
over time). So basically when observed noise is very low we
essentially use the data from the last ~1000 measurements. (well,
not exactly - as the 'memory' of more recent data will be stronger
than that of older ones.)

Again ... it's a clearly non-working patch so it's not really a
defendable concept :-)

> Also, there seems to be a fundamental bug in your PIT reading
> routine. My fast-TSC calibration only looks at the MSB of the PIT
> read for a very good reason: if you don't use the explicit LATCH
> command, you may be getting the MSB of one counter value, and then
> the LSB of another. So your PIT read can easily be off by ~256 PIT
> cycles. Only by caring only for the MSB can you do an unlatched
> read!
>
> That is why pit_expect_msb() looks for the "edge" where the MSB
> changes, and never actually looks at the LSB.
>
> This issue may be an additional reason for your problems, although
> maybe your noise correction will be able to avoid those cases.

indeed. I did check the trace results though via gnuplot yesterday
(suspectig PIT readout outliers) and there were no outliers.

For any final patch it's still a showstopper issue.

But the source of error and miscalibration is elsewhere.

Ingo

2009-03-17 17:28:26

by Olivier Galibert

[permalink] [raw]
Subject: Re: Linux 2.6.29-rc6

On Tue, Mar 17, 2009 at 05:13:22PM +0100, Ingo Molnar wrote:
> That's why i could put the delta analysis there. We are capturing
> thousands of measurement points, and what matters is the precision
> of the 'pair' of (PIT,TSC) timestamp measurements.
>
> I got roughly the same end result noise and the same anomalies with
> tracing enabled and disabled. (and the number of data points was cut
> in half with tracing enabled)

Any reason for not doing a bog-standard linear regression?

OG.

2009-03-21 09:12:17

by Jesper Krogh

[permalink] [raw]
Subject: Re: Linux 2.6.29-rc6

Linus Torvalds wrote:
>
> On Mon, 16 Mar 2009, Jesper Krogh wrote:
>> you were right. It works. No resets so far.
>
> Goodie.
>
> Here's a slightly cleaned-up patch that removes the debug messages, and
> also re-organizes the code a bit so that it actually uses the "better than
> 500 ppm" as the way to decide when to stop calibrating.

Can we ship:
commit a6a80e1d8cf82b46a69f88e659da02749231eb36
Author: Linus Torvalds <[email protected]>
Date: Tue Mar 17 07:58:26 2009 -0700

Fix potential fast PIT TSC calibration startup glitch

and
commit 9e8912e04e612b43897b4b722205408b92f423e5
Author: Linus Torvalds <[email protected]>
Date: Tue Mar 17 08:13:17 2009 -0700

Fast TSC calibration: calculate proper frequency error bounds


to the 2.6.28-stable series.. The first one needed to apply the second.

Jesper
--
Jesper

2009-03-21 10:07:20

by Ingo Molnar

[permalink] [raw]
Subject: Re: Linux 2.6.29-rc6


* Jesper Krogh <[email protected]> wrote:

> Linus Torvalds wrote:
>>
>> On Mon, 16 Mar 2009, Jesper Krogh wrote:
>>> you were right. It works. No resets so far.
>>
>> Goodie.
>>
>> Here's a slightly cleaned-up patch that removes the debug messages, and
>> also re-organizes the code a bit so that it actually uses the "better
>> than 500 ppm" as the way to decide when to stop calibrating.
>
> Can we ship:
> commit a6a80e1d8cf82b46a69f88e659da02749231eb36
> Author: Linus Torvalds <[email protected]>
> Date: Tue Mar 17 07:58:26 2009 -0700
>
> Fix potential fast PIT TSC calibration startup glitch
>
> and
> commit 9e8912e04e612b43897b4b722205408b92f423e5
> Author: Linus Torvalds <[email protected]>
> Date: Tue Mar 17 08:13:17 2009 -0700
>
> Fast TSC calibration: calculate proper frequency error bounds
>
>
> to the 2.6.28-stable series.. The first one needed to apply the second.

Yes, would be nice to have these fixes in .28.9.

Ingo