2003-03-25 16:21:04

by Fionn Behrens

[permalink] [raw]
Subject: System time warping around real time problem - please help


Hello all,


I have got an increasingly annoying problem with our fairly new (fall
'02) Dual Athlon2k+ Gigabyte 7dpxdw linux system running 2.4.20.
The only kernel patch applied is Alan Cox's ptrace patch.

To say it right away: the system is not overclocked or anything and
never was. It has decent cooling and is used as a combined workstation
and server.

I cant say exactly when it started but the system clock tends to begin
jumping around real time in an erratic manner, usually after about 12-48
hours of uptime. The maximum time jump is about 5 seconds back or forth
so the time is always "about" right.
To give you an example to visualize, you can watch asclock in X and see
the second clock-hand jumping like 3 seconds backwards, then 5 seconds
forth, 2 back and 1 forth or so within 2 or 3 seconds.
For a demonstration I wrote the following short example in python:

t = 0
while 1:
n = time()
if t > n: print t, ">", n
t = n

Running this loop returned the following lines:

1048608745.61 > 1048608745.60
1048608745.63 > 1048608745.62
1048608745.65 > 1048608745.64
1048608748.23 > 1048608745.67
1048608748.27 > 1048608745.71
1048608748.30 > 1048608745.74
1048608748.34 > 1048608745.78
1048608748.42 > 1048608745.86
1048608748.47 > 1048608745.91
1048608748.52 > 1048608745.96
[----cut----]

So you see the time() on this system is constantly overtaking itself and
jumping back. It almost looks like two parallel time()s are there and it
switches back and forth between them.

I recompiled the kernel, I upgraded the BIOS to the latest version
available, I disabled ntp and tried some more I dont recall yet - no
success. Due to the erratic timer, working on the machine is no fun.
Software crashes are regularly - naturally. No programmer expects system
timers going back in time.

I am pretty desperate and I'd appreciate any hints on what to check.
I'll glady present any system detail that you might miss for a proper
analysis on request per email or on freenode (Fionn).

Thank you in advance,
F. Behrens (Not a subscriber of this list)
--
I believe we have been called by history to lead the world.
G.W. Bush, 2002-03-01


2003-03-25 16:56:57

by Richard B. Johnson

[permalink] [raw]
Subject: Re: System time warping around real time problem - please help

On Tue, 25 Mar 2003, Fionn Behrens wrote:

>
> Hello all,
>
>
> I have got an increasingly annoying problem with our fairly new (fall
> '02) Dual Athlon2k+ Gigabyte 7dpxdw linux system running 2.4.20.
> The only kernel patch applied is Alan Cox's ptrace patch.
>

I am using the exact same kernel (a lot of folks are). There
is no such jumping on my system.
Try this program:

#include <stdio.h>
#include <time.h>
int main() {
time_t x,y;
(void)time(&x);
(void)time(&y);
for(;;) {
(void)time(&x);
if(x < y)
printf("Prev %ld New %ld\n", y, x);
y = x;
}
return 0;
}
If this shows time jumping around you have one of either:

(1) Bad timer channel 0 chip (PIT).
(2) Some daemon trying to sync time with another system.
(3) You are traveling too close to the speed of light.

Now, your script shows time in fractional seconds.

> 1048608745.61 > 1048608745.60

You can modify the program to do this:


#include <stdio.h>
#include <sys/time.h>
int main() {
struct timeval tv;
double x, y;
(void)gettimeofday(&tv, NULL);
x = (double) tv.tv_sec * 1e6;
x += (double) tv.tv_usec;
y = x;
for(;;) {
(void)gettimeofday(&tv, NULL);
x = (double) tv.tv_sec * 1e6;
x += (double) tv.tv_usec;
if(x < y)
printf("Prev %f New %f\n", y, x);
y = x;
}
return 0;
}

There should be no jumping around -- and there isn't on
any system I've tested this on.

> Software crashes are regularly - naturally. No programmer expects system
> timers going back in time.
>

Hmmm, software should never crash. Even if the timers jump backwards
as you say, they should eventually time-out. If you have crashes, this
may point to other hardware problems as well.


Cheers,
Dick Johnson
Penguin : Linux version 2.4.20 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.

2003-03-25 17:06:23

by Tim Schmielau

[permalink] [raw]
Subject: Re: System time warping around real time problem - please help

On Tue, 25 Mar 2003, Richard B. Johnson wrote:

> On Tue, 25 Mar 2003, Fionn Behrens wrote:
>
> > I have got an increasingly annoying problem with our fairly new (fall
> > '02) Dual Athlon2k+ Gigabyte 7dpxdw linux system running 2.4.20.
> > The only kernel patch applied is Alan Cox's ptrace patch.
> >
>
[...]
> If this shows time jumping around you have one of either:
>
> (1) Bad timer channel 0 chip (PIT).
> (2) Some daemon trying to sync time with another system.
> (3) You are traveling too close to the speed of light.

(4) Unsync'ed TSCs?

See help text for CONFIG_X86_TSC_DISABLE. Never had this problem
myself, though.

2003-03-25 18:01:31

by Fionn Behrens

[permalink] [raw]
Subject: Re: System time warping around real time problem - please help

On Die, 2003-03-25 at 18:07, Richard B. Johnson wrote:
> On Tue, 25 Mar 2003, Fionn Behrens wrote:

> > I have got an increasingly annoying problem with our fairly new (fall
> > '02) Dual Athlon2k+ Gigabyte 7dpxdw linux system running 2.4.20.

> I am using the exact same kernel (a lot of folks are). There
> is no such jumping on my system.
> Try this program:

[... prg1.c ...]

> If this shows time jumping around you have one of either:
>
> (1) Bad timer channel 0 chip (PIT).
> (2) Some daemon trying to sync time with another system.
> (3) You are traveling too close to the speed of light.

It just exits immediately with exit code 1. (*shrug*)

> Now, your script shows time in fractional seconds.
>
> > 1048608745.61 > 1048608745.60
>
> You can modify the program to do this:

[... prg2.c ...]

> There should be no jumping around -- and there isn't on
> any system I've tested this on.

When I run this code it begins to put out Prev N New M lines.

Prev 1048615862810879.000000 New 1048615862759879.000000
Prev 1048615862870879.000000 New 1048615862819878.000000
Prev 1048615862900879.000000 New 1048615862849902.000000
Prev 1048615862960882.000000 New 1048615862909875.000000
[-------- cut --------]

After a few seconds of run time random processes on my machine begin to
crash, or I get kernel oopses and kernel freezes. Looks very much like
heavy use of gettimeofday() causes random writes in system memory.

> > Software crashes are regularly - naturally. No programmer expects system
> > timers going back in time.

> Hmmm, software should never crash. Even if the timers jump backwards
> as you say, they should eventually time-out. If you have crashes, this
> may point to other hardware problems as well.

E.g. which type of hardware problem?

Thanks a million for your help so far, it is great to experience how
fast people are respoding!

I'll evaluate that other suggestion about TSC_DISABLE now and will get
back to you as soon as I can tell you more.

Kind regards,
F. Behrens

2003-03-25 18:18:36

by Richard B. Johnson

[permalink] [raw]
Subject: Re: System time warping around real time problem - please help

On Tue, 25 Mar 2003, Fionn Behrens wrote:

> On Die, 2003-03-25 at 18:07, Richard B. Johnson wrote:
> > On Tue, 25 Mar 2003, Fionn Behrens wrote:
>
> > > I have got an increasingly annoying problem with our fairly new (fall
> > > '02) Dual Athlon2k+ Gigabyte 7dpxdw linux system running 2.4.20.
>
> > I am using the exact same kernel (a lot of folks are). There
> > is no such jumping on my system.
> > Try this program:
>
> [... prg1.c ...]
>
> > If this shows time jumping around you have one of either:
> >
> > (1) Bad timer channel 0 chip (PIT).
> > (2) Some daemon trying to sync time with another system.
> > (3) You are traveling too close to the speed of light.
>
> It just exits immediately with exit code 1. (*shrug*)
>

Hmmm. Note that the for(;;) { } provides no exit path.
So, you probably have some bad RAM or your CPU is too
hot (broken fan??), or something like that.


> > Now, your script shows time in fractional seconds.
> >
> > > 1048608745.61 > 1048608745.60
> >
> > You can modify the program to do this:
>
> [... prg2.c ...]
>
> > There should be no jumping around -- and there isn't on
> > any system I've tested this on.
>
> When I run this code it begins to put out Prev N New M lines.
>
> Prev 1048615862810879.000000 New 1048615862759879.000000
> Prev 1048615862870879.000000 New 1048615862819878.000000
> Prev 1048615862900879.000000 New 1048615862849902.000000
> Prev 1048615862960882.000000 New 1048615862909875.000000
> [-------- cut --------]
>
> After a few seconds of run time random processes on my machine begin to
> crash, or I get kernel oopses and kernel freezes. Looks very much like
> heavy use of gettimeofday() causes random writes in system memory.
>

Looks very much like you have a real bad hardware problem.


>
> E.g. which type of hardware problem?
>

Look inside and see if your CPU fan has stopped. Also move your RAM
sticks around after wiping any dirt off the contacts. Since the
machine used to work last fall, It's probably just a FAN or RAM
problems.


> Thanks a million for your help so far, it is great to experience how
> fast people are respoding!
>
> I'll evaluate that other suggestion about TSC_DISABLE now and will get
> back to you as soon as I can tell you more.
>

I doubt that this will help you, but it's worth trying.


Cheers,
Dick Johnson
Penguin : Linux version 2.4.20 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.

2003-03-25 21:05:50

by Fionn Behrens

[permalink] [raw]
Subject: Re: System time warping around real time problem - please help

Richard B. Johnson wrote:

> On Tue, 25 Mar 2003, Fionn Behrens wrote:
> > On Die, 2003-03-25 at 18:07, Richard B. Johnson wrote:
> > > On Tue, 25 Mar 2003, Fionn Behrens wrote:
> >
> > > > I have got an increasingly annoying problem with our fairly new
> > > > (fall '02) Dual Athlon2k+ Gigabyte 7dpxdw linux system running
> > > > 2.4.20.
> >
> > > I am using the exact same kernel (a lot of folks are). There
> > > is no such jumping on my system.
> > > Try this program:
> >
> > [... prg1.c ...]
> >
> > > If this shows time jumping around you have one of either:
> > >
> > > (1) Bad timer channel 0 chip (PIT).
> > > (2) Some daemon trying to sync time with another system.
> > > (3) You are traveling too close to the speed of light.
> >
> > It just exits immediately with exit code 1. (*shrug*)

> Hmmm. Note that the for(;;) { } provides no exit path.

I noticed that well and investigated the issue using ddd. Funnily enough
the program runs well in ddd until X crashes. But in the shell it still
behaves like it would be nothing but exit(1);

> So, you probably have some bad RAM or your CPU is too
> hot (broken fan??), or something like that.

None of the above. The system is liquid cooled and subject to contiuous
thermal monitoring. The RAM is 1GB Infineon ECC. Before the weekend I
had the machine running overnight with memtest86 - 14 hours, all tests
activated. Not a single error.
I also tried an endless kernel compile loop the other day and the
machine compiled about 100 kernels in approx two hours without a hitch.

> > [... prg2.c ...]
> >
> > When I run this code it begins to put out Prev N New M lines.

> > Prev 1048615862810879.000000 New 1048615862759879.000000

> > After a few seconds of run time random processes on my machine begin
> > to crash, or I get kernel oopses and kernel freezes. Looks very
> > much like heavy use of gettimeofday() causes random writes in system
> > memory.

> Looks very much like you have a real bad hardware problem.

Just what, that is the question. After having activated the notsc
feature the system has not yet exposed the warp symptons but as I noted
in the beginning it may well take a day or two for that to happen.

Yet still, running the first (in ddd) or second test programs - despite
the current absence of any error message - causes random processes to
crash until the program is being stopped (by a crashed terminal, X or
kernel, that is).

Oddly enough, the system runs pretty stable for at least days of normal
use as long as the clock symptoms dont show up (and you dont run those
test programs). Which means it has not crashed a lot recently, just
being rebooted by me because of the jumping clock annoyance which -
among others - results in sluggishly behaving UI components and frequent
short connection freezes in ssh connections.

> > E.g. which type of hardware problem?
> Since the machine used to work last fall, It's probably just a
> FAN or RAM problems.

I'll swap the RAM sticks around for now but I suspect its something
else. I just still fail to grasp how calls to gettimeofday() are able
to cause random writes to memory...

Summary:
- No apparent hardware issue.
- System runs stable as long as you dont for (;;) gettimeofday();
- notsc being evaluated. I will get back to you later.
Does not resolve the odd test software crash, though.

Kind regards,
Fionn

P.S.: Please keep sending me a Cc:, I grabbed this one from the archive

2003-03-25 22:06:32

by George Anzinger

[permalink] [raw]
Subject: Re: System time warping around real time problem - please help

Fionn Behrens wrote:
> Richard B. Johnson wrote:
>
>
>>On Tue, 25 Mar 2003, Fionn Behrens wrote:
>>
>>>On Die, 2003-03-25 at 18:07, Richard B. Johnson wrote:
>>>
>>>>On Tue, 25 Mar 2003, Fionn Behrens wrote:
>>>
>>>>>I have got an increasingly annoying problem with our fairly new
>>>>>(fall '02) Dual Athlon2k+ Gigabyte 7dpxdw linux system running
>>>>>2.4.20.
>>>
>>>>I am using the exact same kernel (a lot of folks are). There
>>>>is no such jumping on my system.
>>>>Try this program:
>>>
>>>[... prg1.c ...]
>>>
>>>
>>>>If this shows time jumping around you have one of either:
>>>>
>>>>(1) Bad timer channel 0 chip (PIT).
>>>>(2) Some daemon trying to sync time with another system.
>>>>(3) You are traveling too close to the speed of light.
>>>
>>>It just exits immediately with exit code 1. (*shrug*)
>
>
>>Hmmm. Note that the for(;;) { } provides no exit path.
>
>
> I noticed that well and investigated the issue using ddd. Funnily enough
> the program runs well in ddd until X crashes. But in the shell it still
> behaves like it would be nothing but exit(1);
>
>
>>So, you probably have some bad RAM or your CPU is too
>>hot (broken fan??), or something like that.
>
>
> None of the above. The system is liquid cooled and subject to contiuous
> thermal monitoring. The RAM is 1GB Infineon ECC. Before the weekend I
> had the machine running overnight with memtest86 - 14 hours, all tests
> activated. Not a single error.
> I also tried an endless kernel compile loop the other day and the
> machine compiled about 100 kernels in approx two hours without a hitch.
>
>
>>>[... prg2.c ...]
>>>
>>>When I run this code it begins to put out Prev N New M lines.
>
>
>>>Prev 1048615862810879.000000 New 1048615862759879.000000
>
>
>>>After a few seconds of run time random processes on my machine begin
>>>to crash, or I get kernel oopses and kernel freezes. Looks very
>>>much like heavy use of gettimeofday() causes random writes in system
>>>memory.
>
>
>>Looks very much like you have a real bad hardware problem.
>
>
> Just what, that is the question. After having activated the notsc
> feature the system has not yet exposed the warp symptons but as I noted
> in the beginning it may well take a day or two for that to happen.
>
> Yet still, running the first (in ddd) or second test programs - despite
> the current absence of any error message - causes random processes to
> crash until the program is being stopped (by a crashed terminal, X or
> kernel, that is).
>
> Oddly enough, the system runs pretty stable for at least days of normal
> use as long as the clock symptoms dont show up (and you dont run those
> test programs). Which means it has not crashed a lot recently, just
> being rebooted by me because of the jumping clock annoyance which -
> among others - results in sluggishly behaving UI components and frequent
> short connection freezes in ssh connections.
>
>
>>>E.g. which type of hardware problem?
>>
>>Since the machine used to work last fall, It's probably just a
>>FAN or RAM problems.
>
>
> I'll swap the RAM sticks around for now but I suspect its something
> else. I just still fail to grasp how calls to gettimeofday() are able
> to cause random writes to memory...
>
> Summary:
> - No apparent hardware issue.
> - System runs stable as long as you dont for (;;) gettimeofday();
> - notsc being evaluated. I will get back to you later.
> Does not resolve the odd test software crash, though.
>
> Kind regards,
> Fionn
>
> P.S.: Please keep sending me a Cc:, I grabbed this one from the archive
> -
This all sounds very much like the TSCs are drifting WRT each other.
Is it possible that you have some power management code (or hardware)
that is slowing one cpu and not the other?

--
George Anzinger [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/
Preemption patch: http://www.kernel.org/pub/linux/kernel/people/rml

2003-03-25 22:44:26

by Fionn Behrens

[permalink] [raw]
Subject: Re: System time warping around real time problem - please help

On Die, 2003-03-25 at 23:14, george anzinger wrote:
> Fionn Behrens wrote:

> > Summary:
> > - No apparent hardware issue.
> > - System runs stable as long as you dont for (;;) gettimeofday();
> > - notsc being evaluated. I will get back to you later.
> > Does not resolve the odd test software crash, though.

> This all sounds very much like the TSCs are drifting WRT each other.
> Is it possible that you have some power management code (or hardware)
> that is slowing one cpu and not the other?

Well, I still don't really know what TSCs actually are (or what TSC
stands for).

The only suspect in that case would be the amd76x_pm.o kernel module
which I am admittedly using. It saves about 90Watts of power when the
machine is idle...

I'll check what happens when the system boots without amd76x_pm.
Will report back tomorrow.

Thanks to all for keeping the suggestions going!

Regards,
F. Behrens

2003-03-25 22:49:10

by Alan

[permalink] [raw]
Subject: Re: System time warping around real time problem - please help

On Tue, 2003-03-25 at 22:55, Fionn Behrens wrote:
> > This all sounds very much like the TSCs are drifting WRT each other.
> > Is it possible that you have some power management code (or hardware)
> > that is slowing one cpu and not the other?
>
> Well, I still don't really know what TSCs actually are (or what TSC
> stands for).
>
> The only suspect in that case would be the amd76x_pm.o kernel module
> which I am admittedly using. It saves about 90Watts of power when the
> machine is idle...

If you are using amd76x_pm boot with "notsc", ditto for that matter
on dual athlons with APM or ACPI in some cases. In fact I wish people
would stop using the tsc for clock timing altogether. It simply doesn't
work on a lot of modern systems


2003-03-26 02:18:03

by George Anzinger

[permalink] [raw]
Subject: Re: System time warping around real time problem - please help

Alan Cox wrote:
> On Tue, 2003-03-25 at 22:55, Fionn Behrens wrote:
>
>>>This all sounds very much like the TSCs are drifting WRT each other.
>>>Is it possible that you have some power management code (or hardware)
>>>that is slowing one cpu and not the other?
>>
>>Well, I still don't really know what TSCs actually are (or what TSC
>>stands for).

Stands for Time Stamp Counter. It is a special cpu register that
basically counts cpu cycles. Some times (incorrectly me thinks) it is
affected by power management code which slows the cpu by changing the
cpu frequency.
>>
>>The only suspect in that case would be the amd76x_pm.o kernel module
>>which I am admittedly using. It saves about 90Watts of power when the
>>machine is idle...
>
>
> If you are using amd76x_pm boot with "notsc", ditto for that matter
> on dual athlons with APM or ACPI in some cases. In fact I wish people
> would stop using the tsc for clock timing altogether. It simply doesn't
> work on a lot of modern systems
>
I agree, however, what is really needed is not available in x86
machines, i.e. a cpu register that has a fixed and stable count rate.
An I/O register is second best because of the long time it takes to
read it.
>
>

--
George Anzinger [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/
Preemption patch: http://www.kernel.org/pub/linux/kernel/people/rml

2003-03-26 03:01:31

by Chris Friesen

[permalink] [raw]
Subject: Re: System time warping around real time problem - please help

Alan Cox wrote:

> If you are using amd76x_pm boot with "notsc", ditto for that matter
> on dual athlons with APM or ACPI in some cases. In fact I wish people
> would stop using the tsc for clock timing altogether. It simply doesn't
> work on a lot of modern systems

But its awfully nice for low-impact high-resolution timestamps.

Maybe someday hardware manufacturers will give us a monotonic GHz+ clock that is
synced across all cpus and is cheap to read...

Chris


--
Chris Friesen | MailStop: 043/33/F10
Nortel Networks | work: (613) 765-0557
3500 Carling Avenue | fax: (613) 765-2986
Nepean, ON K2H 8E9 Canada | email: [email protected]

2003-03-26 09:20:37

by Kay Diederichs

[permalink] [raw]
Subject: Re: System time warping around real time problem - please help

Fionn,

I had similar problems, and reported them on this list on 12/04/2002 .
The reason is the amd76x_pm module which leads to the TSCs of the CPUs
to become unsyncronized.

One way around this is to disable TSC altogether; this in my case
required installing a glibc compiled for i386 (instead of i686) which
slows some things down, and to use the 'notsc' boot option.

However, programs that use rdtsc (in my case Intels Fortran Compiler,
ifc) then fail. As I develop programs using ifc, I therefore have not
been able to use amd76x_pm - I wish there were a better solution, and
wonder why this is not a problem with e.g. dual-processor Xeons.

Kay



Fionn Behrens wrote:
> Hello all,
>
>
> I have got an increasingly annoying problem with our fairly new (fall
> '02) Dual Athlon2k+ Gigabyte 7dpxdw linux system running 2.4.20.
> The only kernel patch applied is Alan Cox's ptrace patch.
>
> To say it right away: the system is not overclocked or anything and
> never was. It has decent cooling and is used as a combined workstation
> and server.
>
> I cant say exactly when it started but the system clock tends to begin
> jumping around real time in an erratic manner, usually after about 12-48
> hours of uptime. The maximum time jump is about 5 seconds back or forth
> so the time is always "about" right.
> To give you an example to visualize, you can watch asclock in X and see
> the second clock-hand jumping like 3 seconds backwards, then 5 seconds
> forth, 2 back and 1 forth or so within 2 or 3 seconds.
> For a demonstration I wrote the following short example in python:
>
> t = 0
> while 1:
> n = time()
> if t > n: print t, ">", n
> t = n
>
> Running this loop returned the following lines:
>
> 1048608745.61 > 1048608745.60
> 1048608745.63 > 1048608745.62
> 1048608745.65 > 1048608745.64
> 1048608748.23 > 1048608745.67
> 1048608748.27 > 1048608745.71
> 1048608748.30 > 1048608745.74
> 1048608748.34 > 1048608745.78
> 1048608748.42 > 1048608745.86
> 1048608748.47 > 1048608745.91
> 1048608748.52 > 1048608745.96
> [----cut----]
>
> So you see the time() on this system is constantly overtaking itself and
> jumping back. It almost looks like two parallel time()s are there and it
> switches back and forth between them.
>
> I recompiled the kernel, I upgraded the BIOS to the latest version
> available, I disabled ntp and tried some more I dont recall yet - no
> success. Due to the erratic timer, working on the machine is no fun.
> Software crashes are regularly - naturally. No programmer expects system
> timers going back in time.
>
> I am pretty desperate and I'd appreciate any hints on what to check.
> I'll glady present any system detail that you might miss for a proper
> analysis on request per email or on freenode (Fionn).
>
> Thank you in advance,
> F. Behrens (Not a subscriber of this list)

--
Kay Diederichs http://strucbio.biologie.uni-konstanz.de/~kay
email: Kay.Diederichs @ uni-konstanz.de Tel +49 7531 88 4049 Fax 3183
When replying to my email, please remove the blanks before and after the
"@" !
Fakultaet fuer Biologie, Universitaet Konstanz, Box M656, D-78457 Konstanz

2003-03-26 10:37:15

by Fionn Behrens

[permalink] [raw]
Subject: Re: System time warping around real time problem - please help

On Mit, 2003-03-26 at 01:13, Alan Cox wrote:
> On Tue, 2003-03-25 at 22:55, Fionn Behrens wrote:
> > > This all sounds very much like the TSCs are drifting WRT each other.
> > > Is it possible that you have some power management code (or hardware)
> > > that is slowing one cpu and not the other?
> >
> > The only suspect in that case would be the amd76x_pm.o kernel module
> > which I am admittedly using. It saves about 90Watts of power when the
> > machine is idle...
>
> If you are using amd76x_pm boot with "notsc", ditto for that matter
> on dual athlons with APM or ACPI in some cases.

I booted without amd76x_pm today and the problems are gone. I tried
notsc yesterday and dmesg said TSC had been deactivated on both CPUs. No
libc6 problems - debian is using the i386 version by default.
Oddly enough the system still crashed on those two for (;;) time(); test
loops posted earlier in this thread. So the only (unsatisfying) solution
I see for now is to keep the CPUs glowing hot for the sake of stability.

Any idea what else could cause the crashes in the absence of TSC usage?

As a yet unresolved side note I am still unable to execute the first
test program with my default user (immediately exits with retval 1).
Being run as root or as the system test user, the program runs as
expected (including crash with amd76x_pm). ldd shows no difference. Same
shell being used.

With kind regards,
F. Behrens

2003-03-26 13:11:50

by Alan

[permalink] [raw]
Subject: Re: System time warping around real time problem - please help

On Wed, 2003-03-26 at 03:11, Chris Friesen wrote:
> But its awfully nice for low-impact high-resolution timestamps.
>
> Maybe someday hardware manufacturers will give us a monotonic GHz+ clock that is
> synced across all cpus and is cheap to read...

x86-64 has HPET

2003-03-26 13:13:21

by Alan

[permalink] [raw]
Subject: Re: System time warping around real time problem - please help

On Wed, 2003-03-26 at 02:28, george anzinger wrote:
> Stands for Time Stamp Counter. It is a special cpu register that
> basically counts cpu cycles. Some times (incorrectly me thinks) it is
> affected by power management code which slows the cpu by changing the
> cpu frequency.

Not incorrectly. It counts cpu clocks, its designed for profiling and
the like. There is no guarantee in any Intel MP standard that the clocks
are synched up.

2003-03-26 16:02:18

by George Anzinger

[permalink] [raw]
Subject: Re: System time warping around real time problem - please help

Alan Cox wrote:
> On Wed, 2003-03-26 at 02:28, george anzinger wrote:
>
>>Stands for Time Stamp Counter. It is a special cpu register that
>>basically counts cpu cycles. Some times (incorrectly me thinks) it is
>>affected by power management code which slows the cpu by changing the
>>cpu frequency.
>
>
> Not incorrectly. It counts cpu clocks, its designed for profiling and
> the like. There is no guarantee in any Intel MP standard that the clocks
> are synched up.


>
I seem to recall a different notion of correctness from Andy Grover...
but memory may deceive :(


As for sync, I would think it is a mother board issue.

But as you say, Intel should put in a usable counter. The HPET seems
like it has the capabilities, however, I suspect that it is a slow
read. Any idea how many cycles it takes to do a memory mapped I/O access?
--
George Anzinger [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/
Preemption patch: http://www.kernel.org/pub/linux/kernel/people/rml

2003-03-26 16:53:40

by Richard B. Johnson

[permalink] [raw]
Subject: Re: System time warping around real time problem - please help

On Wed, 26 Mar 2003, george anzinger wrote:

> Alan Cox wrote:
> > On Wed, 2003-03-26 at 02:28, george anzinger wrote:
> >
> >>Stands for Time Stamp Counter. It is a special cpu register that
> >>basically counts cpu cycles. Some times (incorrectly me thinks) it is
> >>affected by power management code which slows the cpu by changing the
> >>cpu frequency.
> >
> >
> > Not incorrectly. It counts cpu clocks, its designed for profiling and
> > the like. There is no guarantee in any Intel MP standard that the clocks
> > are synched up.
>
>
> >
> I seem to recall a different notion of correctness from Andy Grover...
> but memory may deceive :(
>
>
> As for sync, I would think it is a mother board issue.
>
> But as you say, Intel should put in a usable counter. The HPET seems
> like it has the capabilities, however, I suspect that it is a slow
> read. Any idea how many cycles it takes to do a memory mapped I/O access?
> --

It depends how the read is made. A direct read of non-cached RAM
like this:

movl (MY_IODEV), %eax

... takes 4 CPU clocks if the device doesn't insert any wait-states.
With fast CPUs, such is in possible. You are I/O bound by the front-side
bus speed. A good guess of the time to read is 4/(bus MHz) because
there are 4 bus-cycles for a read or write.

If the 'C' compiler decides to do indexed addressing, where the
address gets calculated, the read times are greater. For instance,

movl (%ebx), %eax

... takes 8 clocks even ignoring the fact that the virtual address
needs to be put into register ebx. If the address is on certain
boundaries (not necessarily the same for different CPUs), the
reads can be slower.

Slow reads don't really hurt. In fact, they make sure that subsequent
reads will always return positive time. It's just a bias in the
time that affects everybody the same way. What hurts is trying to
synchronize to some external clock. In my opinion, this is not
the correct way to get the time. `rdtsc` returns a long long
in two registers. This should be saved as "reference time"
every time the system clock is set. Setting the system clock
means saving (only) the time_t object, in seconds, at the
time one saves the rdtsc time. This time_t object is never
changed otherwise. The PIT only generates interrupts. It is
not used for time. When somebody needs the time, it is calculated
from the present `rdtsc`, the saved long long value, and the
time_t time at which that value was saved. This guarantees
that all time is positive and no CPU cycles are wasted trying
to read anything.

The number of CPU cycles per second are calculated once upon
startup just as they are now. If you shut-down, or slow the
CPU for power-saving, you just recalculate the CPU cycles
and reset the time from CMOS. Any time, when the machine is
in 'slow' mode is still correct.

time_t set_time;
long long cpu_cycles_sec;
long long rd_tsc_at_set_time;

To read time:

current_time = ((rdtsc() - rd_tsc_at_set_time)/ cpu_cycles_sec) +
set_time;

To set time:

set_time = get_time_from_CMOS();
rd_tsc_at_set_time = rdtsc();

... That's all you need.


Cheers,
Dick Johnson
Penguin : Linux version 2.4.20 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.

2003-03-26 18:02:05

by George Anzinger

[permalink] [raw]
Subject: Re: System time warping around real time problem - please help

Richard B. Johnson wrote:
> On Wed, 26 Mar 2003, george anzinger wrote:
>
>
>>Alan Cox wrote:
>>
>>>On Wed, 2003-03-26 at 02:28, george anzinger wrote:
>>>
>>>
>>>>Stands for Time Stamp Counter. It is a special cpu register that
>>>>basically counts cpu cycles. Some times (incorrectly me thinks) it is
>>>>affected by power management code which slows the cpu by changing the
>>>>cpu frequency.
>>>
>>>
>>>Not incorrectly. It counts cpu clocks, its designed for profiling and
>>>the like. There is no guarantee in any Intel MP standard that the clocks
>>>are synched up.
>>
>>
>>I seem to recall a different notion of correctness from Andy Grover...
>> but memory may deceive :(
>>
>>
>>As for sync, I would think it is a mother board issue.
>>
>>But as you say, Intel should put in a usable counter. The HPET seems
>>like it has the capabilities, however, I suspect that it is a slow
>>read. Any idea how many cycles it takes to do a memory mapped I/O access?
>>--
>
>
> It depends how the read is made. A direct read of non-cached RAM
> like this:
>
> movl (MY_IODEV), %eax
>
> ... takes 4 CPU clocks if the device doesn't insert any wait-states.
> With fast CPUs, such is in possible. You are I/O bound by the front-side
> bus speed. A good guess of the time to read is 4/(bus MHz) because
> there are 4 bus-cycles for a read or write.
>
> If the 'C' compiler decides to do indexed addressing, where the
> address gets calculated, the read times are greater. For instance,
>
> movl (%ebx), %eax
>
> ... takes 8 clocks even ignoring the fact that the virtual address
> needs to be put into register ebx. If the address is on certain
> boundaries (not necessarily the same for different CPUs), the
> reads can be slower.
>
> Slow reads don't really hurt.

gettimeofday() is small enough that a couple of extra cycles DOES show
up. And, as cpus get faster and faster WRT the front-side bus, this
can easily be the majority of the time it takes to do gettimeofday().

> In fact, they make sure that subsequent
> reads will always return positive time. It's just a bias in the
> time that affects everybody the same way. What hurts is trying to
> synchronize to some external clock. In my opinion, this is not
> the correct way to get the time. `rdtsc` returns a long long
> in two registers. This should be saved as "reference time"
> every time the system clock is set. Setting the system clock
> means saving (only) the time_t object, in seconds, at the
> time one saves the rdtsc time. This time_t object is never
> changed otherwise. The PIT only generates interrupts. It is
> not used for time. When somebody needs the time, it is calculated
> from the present `rdtsc`, the saved long long value, and the
> time_t time at which that value was saved. This guarantees
> that all time is positive and no CPU cycles are wasted trying
> to read anything.
>
> The number of CPU cycles per second are calculated once upon
> startup just as they are now. If you shut-down, or slow the
> CPU for power-saving, you just recalculate the CPU cycles
> and reset the time from CMOS. Any time, when the machine is
> in 'slow' mode is still correct.

I did something close to this in the high-res-timers patch. There are
several problems WRT TSC:

First, and the reason this thread started, there are SMP boxes with pm
code that causes the TSC to run at different speeds on the various cpus.

Second, I am lead to believe that there are boxes that adjust the cpu
speed with out telling software about it.

Third, there are boxes that just don't have TSCs, (yeah, I know they
are old...).

Because of all of this, I have a configure option to use the ACPI pm
counter. The down side of this is a) it is slow (an I/O instruction)
and b.) the resolution is much less than the TSC.
>
> time_t set_time;
> long long cpu_cycles_sec;
> long long rd_tsc_at_set_time;
>
> To read time:
>
> current_time = ((rdtsc() - rd_tsc_at_set_time)/ cpu_cycles_sec) +
> set_time;
>
> To set time:
>
> set_time = get_time_from_CMOS();
> rd_tsc_at_set_time = rdtsc();
>
> ... That's all you need.

In my code, since the time needs to be "fresh" each tick, I keep a
counter up to date each tick. This allows me to speed up the code by
only reading the low half of the TSC, and at the same time, avoid some
64-bit math.

The other thing this way of doing things needs is code to discipline
the interrupt source (the PIT in this case) so that the interrupts
come "reasonably" close to the 1/Hz boundary. Without this the timers
have too much jitter.



--
George Anzinger [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/
Preemption patch: http://www.kernel.org/pub/linux/kernel/people/rml