2002-11-12 22:07:49

by Paul Larson

[permalink] [raw]
Subject: LTP - gettimeofday02 FAIL

I've been getting a somewhat random error in a few of the recent 2.5
kernels with SMP machines. I noticed this on a 2.5.47 bk pull, but I
was also able to reproduce it on 2.5.46. I haven't tried any earlier
kernels yet. The LTP gettimeofday02 test sometimes fails with this
message:
gettimeofday02 0 INFO : checking if gettimeofday is monotonous,
takes 30s
gettimeofday02 1 FAIL : Time is going backwards (old
1037138184.846333 vs new 1037138184.843346!

I have not been able to reproduce this on a single processor machine
though.

Basically, all the test does is:
gettimeofday(&tv1, NULL);
while(!done) {
gettimeofday(&tv2, NULL);
FAIL if tv2 < tv1
tv1 = tv2;
}

Any ideas on what could be causing this?

Thanks,
Paul Larson


Attachments:
signature.asc (240.00 B)
This is a digitally signed message part

2002-11-14 21:45:16

by Chris Wedgwood

[permalink] [raw]
Subject: Re: LTP - gettimeofday02 FAIL

On Tue, Nov 12, 2002 at 04:11:14PM -0600, Paul Larson wrote:

I have not been able to reproduce this on a single processor machine
though.

Basically, all the test does is:
gettimeofday(&tv1, NULL);
while(!done) {
gettimeofday(&tv2, NULL);
FAIL if tv2 < tv1
tv1 = tv2;
}

Any ideas on what could be causing this?


The TSC's aren't synchronized between CPUs.

This is becoming more and more of a problem and in-escapable on some
hardware so I'm starting to wonder if assuming the TSCs are even
roughly synchronized *anywhere* is a good idea.


--cw

2002-11-18 17:31:47

by Paul Larson

[permalink] [raw]
Subject: Re: LTP - gettimeofday02 FAIL

On Thu, 2002-11-14 at 15:52, Chris Wedgwood wrote:
> The TSC's aren't synchronized between CPUs.
>
> This is becoming more and more of a problem and in-escapable on some
> hardware so I'm starting to wonder if assuming the TSCs are even
> roughly synchronized *anywhere* is a good idea.

So this is a hardware issue? no way around this?

-Paul Larson


Attachments:
signature.asc (240.00 B)
This is a digitally signed message part

2002-11-18 21:56:04

by Chris Wedgwood

[permalink] [raw]
Subject: Re: LTP - gettimeofday02 FAIL

On Mon, Nov 18, 2002 at 11:34:04AM -0600, Paul Larson wrote:

> So this is a hardware issue? no way around this?

For some people it is a hardware issue. The way around is not not use
the TSC for certain things.


--cw

2002-11-19 01:21:05

by Jim Houston

[permalink] [raw]
Subject: Re: LTP - gettimeofday02 FAIL


Hi Everyone,

I just tried gettimeofday02 on an old pentium-pro dual processor, and yes
the time goes backwards with a 2.5.48 kernel.

I believe that this is the result of lost ticks. It has gotten much
easier to lose a tick since HZ was changed to 1000. When the timer
interrupt is delayed, the other processors will continue to keep reasonable
time (based on the TSC), but when the timer interrupt eventually happens,
it will add one tick's worth of nanoseconds to xtime.tv_nsec and set
last_tsc_low to the current tsc value. The other processors now base
their time on this new last_tsc_low and will see time go backwards.
I accidentally configured in the ACPI power management code and was
disappointed to find that it routinely caused a 9 milli-second interrupt
lock-out (on my 1GHz Athlon). With the old 100 Hz clock, this delay would
be detected by reading the PIT timer. With 1000 Hz, the timer would reload
several times and all we see is a fraction of a tick.

I'm interested in this because I'm working on my "alternative Posix timers
patch". It gets confused when time backs up.

Jim Houston - Concurrent Computer Corp.

2002-11-19 11:57:01

by Andi Kleen

[permalink] [raw]
Subject: Re: LTP - gettimeofday02 FAIL

Jim Houston <[email protected]> writes:

> I believe that this is the result of lost ticks. It has gotten much
> easier to lose a tick since HZ was changed to 1000. When the timer
> interrupt is delayed, the other processors will continue to keep reasonable
> time (based on the TSC), but when the timer interrupt eventually happens,
> it will add one tick's worth of nanoseconds to xtime.tv_nsec and set
> last_tsc_low to the current tsc value. The other processors now base
> their time on this new last_tsc_low and will see time go backwards.

It could be detected by keeping a per cpu last_tsc.

Best would be to use a global timer like HPET, but it's not available
everywhere and much slower than rdtsc too.

-Andi

2002-11-19 13:35:31

by Paul Larson

[permalink] [raw]
Subject: Re: [LTP] Re: LTP - gettimeofday02 FAIL

On Mon, 2002-11-18 at 19:27, Jim Houston wrote:
>
> Hi Everyone,
>
> I just tried gettimeofday02 on an old pentium-pro dual processor, and yes
> the time goes backwards with a 2.5.48 kernel.
This has been noticed, I've posted to lkml about it. The only person
who replied to me seems to be suggesting it is a hardware issue, but I
can't believe it is impossible to work around.

-Paul Larson


Attachments:
signature.asc (240.00 B)
This is a digitally signed message part

2002-11-19 14:00:07

by Dave Jones

[permalink] [raw]
Subject: Re: [LTP] Re: LTP - gettimeofday02 FAIL

On Tue, Nov 19, 2002 at 07:37:23AM -0600, Paul Larson wrote:
> > I just tried gettimeofday02 on an old pentium-pro dual processor, and yes
> > the time goes backwards with a 2.5.48 kernel.
> This has been noticed, I've posted to lkml about it. The only person
> who replied to me seems to be suggesting it is a hardware issue, but I
> can't believe it is impossible to work around.

Especially if earlier kernels got it right..

Dave

--
| Dave Jones. http://www.codemonkey.org.uk
| SuSE Labs

2002-11-19 14:48:55

by Paul Larson

[permalink] [raw]
Subject: Re: [LTP] Re: LTP - gettimeofday02 FAIL

On Tue, 2002-11-19 at 08:02, Dave Jones wrote:
> On Tue, Nov 19, 2002 at 07:37:23AM -0600, Paul Larson wrote:
> > > I just tried gettimeofday02 on an old pentium-pro dual processor, and yes
> > > the time goes backwards with a 2.5.48 kernel.
> > This has been noticed, I've posted to lkml about it. The only person
> > who replied to me seems to be suggesting it is a hardware issue, but I
> > can't believe it is impossible to work around.
>
> Especially if earlier kernels got it right..
This is bug #100 in bugme if anyone wants to track it.
http://bugme.osdl.org/show_bug.cgi?id=100

-Paul Larson


Attachments:
signature.asc (240.00 B)
This is a digitally signed message part