2002-06-07 20:21:27

by ashieh

[permalink] [raw]
Subject: gettimeofday clock jump bug

Hi,

time() occasionally returns a bogus value (>1 hour jump forward, and a few microseconds later jumps back to the right time) on my box (Thunderbird 750, Asus K7V (KX133) kernel 2.4.17). This behavior sets in after the box is up for some period of time. I don't think this is related to the 686a configuration reset bug.

I suspect that somehow the either do_gettimeoffset() or xtime.tv_usec in do_gettimeofday is returning a ridiculously large value. I would like to get to the bottom of this problem, but am not familiar with this part of the timing infrastructure. Has anyone seen this bug before? Would using a different locking mode (SMP vs none SMP, or wahtever) possibly help this problem? Is there a document online describing how this works in Linux?

In the meantime, I want to hack up a patch to fix this on my box. I'm thinking I could give up a few seconds of clock precision in exchange for monotonic clock behavior, and so I want to comment out the adjustments to usec. What are the possible ramifications of this hack?

Alan

Original do_gettimeofday:

void do_gettimeofday(struct timeval *tv)
{
unsigned long flags;
unsigned long usec, sec;

read_lock_irqsave(&xtime_lock, flags);
usec = do_gettimeoffset();
{
unsigned long lost = jiffies - wall_jiffies;
if (lost)
usec += lost * (1000000 / HZ);
}
sec = xtime.tv_sec;
usec += xtime.tv_usec;
read_unlock_irqrestore(&xtime_lock, flags);

while (usec >= 1000000) {
usec -= 1000000;
sec++;
}

tv->tv_sec = sec;
tv->tv_usec = usec;
}

My proposed hack (for my system):

void do_gettimeofday(struct timeval *tv)
{
unsigned long flags;
unsigned long usec, sec;

read_lock_irqsave(&xtime_lock, flags);
usec = do_gettimeoffset();
/* {
unsigned long lost = jiffies - wall_jiffies;
if (lost)
usec += lost * (1000000 / HZ);
} */
sec = xtime.tv_sec;
usec = xtime.tv_usec;
read_unlock_irqrestore(&xtime_lock, flags);

while (usec >= 1000000) {
usec -= 1000000;
sec++;
}

tv->tv_sec = sec;
tv->tv_usec = usec;
}


2002-06-07 20:38:35

by Karim Yaghmour

[permalink] [raw]
Subject: Re: gettimeofday clock jump bug


Hello Alan,

While developing the Linux Trace Toolkit, I've seen this bug occur on
at least the i386 and the PPC. I had written down a description of
how time-keeping is done in the kernel in order to get to the bottom of
the problem. I'm not sure if this is still acurate, but it's a starting
point nevetheless:
http://www.embeddedlinuxworks.com/articles/kernel_time.html

Cheers,

Karim

[email protected] wrote:
>
> Hi,
>
> time() occasionally returns a bogus value (>1 hour jump forward, and a few microseconds later jumps back to the right time) on my box (Thunderbird 750, Asus K7V (KX133) kernel 2.4.17). This behavior sets in after the box is up for some period of time. I don't think this is related to the 686a configuration reset bug.
>
> I suspect that somehow the either do_gettimeoffset() or xtime.tv_usec in do_gettimeofday is returning a ridiculously large value. I would like to get to the bottom of this problem, but am not familiar with this part of the timing infrastructure. Has anyone seen this bug before? Would using a different locking mode (SMP vs none SMP, or wahtever) possibly help this problem? Is there a document online describing how this works in Linux?
>
> In the meantime, I want to hack up a patch to fix this on my box. I'm thinking I could give up a few seconds of clock precision in exchange for monotonic clock behavior, and so I want to comment out the adjustments to usec. What are the possible ramifications of this hack?
>
> Alan
>
> Original do_gettimeofday:
>
> void do_gettimeofday(struct timeval *tv)
> {
> unsigned long flags;
> unsigned long usec, sec;
>
> read_lock_irqsave(&xtime_lock, flags);
> usec = do_gettimeoffset();
> {
> unsigned long lost = jiffies - wall_jiffies;
> if (lost)
> usec += lost * (1000000 / HZ);
> }
> sec = xtime.tv_sec;
> usec += xtime.tv_usec;
> read_unlock_irqrestore(&xtime_lock, flags);
>
> while (usec >= 1000000) {
> usec -= 1000000;
> sec++;
> }
>
> tv->tv_sec = sec;
> tv->tv_usec = usec;
> }
>
> My proposed hack (for my system):
>
> void do_gettimeofday(struct timeval *tv)
> {
> unsigned long flags;
> unsigned long usec, sec;
>
> read_lock_irqsave(&xtime_lock, flags);
> usec = do_gettimeoffset();
> /* {
> unsigned long lost = jiffies - wall_jiffies;
> if (lost)
> usec += lost * (1000000 / HZ);
> } */
> sec = xtime.tv_sec;
> usec = xtime.tv_usec;
> read_unlock_irqrestore(&xtime_lock, flags);
>
> while (usec >= 1000000) {
> usec -= 1000000;
> sec++;
> }
>
> tv->tv_sec = sec;
> tv->tv_usec = usec;
> }
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
===================================================
Karim Yaghmour
[email protected]
Embedded and Real-Time Linux Expert
===================================================

2002-06-07 21:12:03

by George Anzinger

[permalink] [raw]
Subject: Re: gettimeofday clock jump bug

[email protected] wrote:
>
> Hi,
>
> time() occasionally returns a bogus value (>1 hour jump forward, and a few microseconds later jumps back to the right time) on my box (Thunderbird 750, Asus K7V (KX133) kernel 2.4.17). This behavior sets in after the box is up for some period of time. I don't think this is related to the 686a configuration reset bug.

I suspect that do_gettimeoffset() may be, on occasion,
returning a negative number. The normalizing code then
works with this (unsigned) value until it is < 1,000,000.
If it came back as -1, this would generate an error of about
1.19 hours. I suspect the best fix would be to test the
result from do_gettimeoffset() for something greater than
say 20ms and if so set it to 0.

-g
>
> I suspect that somehow the either do_gettimeoffset() or xtime.tv_usec in do_gettimeofday is returning a ridiculously large value. I would like to get to the bottom of this problem, but am not familiar with this part of the timing infrastructure. Has anyone seen this bug before? Would using a different locking mode (SMP vs none SMP, or wahtever) possibly help this problem? Is there a document online describing how this works in Linux?
>
> In the meantime, I want to hack up a patch to fix this on my box. I'm thinking I could give up a few seconds of clock precision in exchange for monotonic clock behavior, and so I want to comment out the adjustments to usec. What are the possible ramifications of this hack?
>
> Alan
>
> Original do_gettimeofday:
>
> void do_gettimeofday(struct timeval *tv)
> {
> unsigned long flags;
> unsigned long usec, sec;
>
> read_lock_irqsave(&xtime_lock, flags);
> usec = do_gettimeoffset();
> {
> unsigned long lost = jiffies - wall_jiffies;
> if (lost)
> usec += lost * (1000000 / HZ);
> }
> sec = xtime.tv_sec;
> usec += xtime.tv_usec;
> read_unlock_irqrestore(&xtime_lock, flags);
>
> while (usec >= 1000000) {
> usec -= 1000000;
> sec++;
> }
>
> tv->tv_sec = sec;
> tv->tv_usec = usec;
> }
>
> My proposed hack (for my system):
>
> void do_gettimeofday(struct timeval *tv)
> {
> unsigned long flags;
> unsigned long usec, sec;
>
> read_lock_irqsave(&xtime_lock, flags);
> usec = do_gettimeoffset();
> /* {
> unsigned long lost = jiffies - wall_jiffies;
> if (lost)
> usec += lost * (1000000 / HZ);
> } */
> sec = xtime.tv_sec;
> usec = xtime.tv_usec;
> read_unlock_irqrestore(&xtime_lock, flags);
>
> while (usec >= 1000000) {
> usec -= 1000000;
> sec++;
> }
>
> tv->tv_sec = sec;
> tv->tv_usec = usec;
> }
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
George Anzinger [email protected]
High-res-timers:
http://sourceforge.net/projects/high-res-timers/
Real time sched: http://sourceforge.net/projects/rtsched/
Preemption patch:
http://www.kernel.org/pub/linux/kernel/people/rml

2002-06-07 23:52:13

by Ruth Ivimey-Cook

[permalink] [raw]
Subject: Re: gettimeofday clock jump bug


[email protected] wrote:
> time() occasionally returns a
>bogus value (>1 hour jump forward, and a few microseconds later jumps back to
>the right time) on my box (Thunderbird 750, Asus K7V (KX133) kernel 2.4.17).
>This behavior sets in after the box is up for some period of time. I don't
>think this is related to the 686a configuration reset bug.
>
On Fri, 7 Jun 2002, george anzinger wrote:
>I suspect that do_gettimeoffset() may be, on occasion,
>returning a negative number. The normalizing code then
>works with this (unsigned) value until it is < 1,000,000.
>If it came back as -1, this would generate an error of about
>1.19 hours. I suspect the best fix would be to test the
>result from do_gettimeoffset() for something greater than
>say 20ms and if so set it to 0.

I've just looked at the i386 time.c source and can see no obvious way for -1
to be returned by do_gettimeoffset(). I note that this error is fixed in the
next time() call, so I would instead expect the error to be one involving the
conversion of tv_secs/tv_usecs into the seconds return from time().

One possible way to check this out would be to change the test program from
using the time() call to using gettimeofday(), and to ignore tv_usecs.

Hope this helps,

Ruth

--
Ruth Ivimey-Cook
Software engineer and technical writer.

2002-06-08 00:33:24

by Tomas Vanderka

[permalink] [raw]
Subject: Re: gettimeofday clock jump bug

Hi,
I've got Abit KT7 and AMD Athlon system, and i have seen this happen while
running 2.4.18 (with 2.4.16 and later 2.4.19-pre8/9 it didn't happen)
There has been some discussion about it here
http://marc.theaimsgroup.com/?t=92577204900001&r=1&w=2

VanTo

2002-06-08 03:37:48

by George Anzinger

[permalink] [raw]
Subject: Re: gettimeofday clock jump bug

Ruth Ivimey-Cook wrote:
>
> [email protected] wrote:
> > time() occasionally returns a
> >bogus value (>1 hour jump forward, and a few microseconds later jumps back to
> >the right time) on my box (Thunderbird 750, Asus K7V (KX133) kernel 2.4.17).
> >This behavior sets in after the box is up for some period of time. I don't
> >think this is related to the 686a configuration reset bug.
> >
> On Fri, 7 Jun 2002, george anzinger wrote:
> >I suspect that do_gettimeoffset() may be, on occasion,
> >returning a negative number. The normalizing code then
> >works with this (unsigned) value until it is < 1,000,000.
> >If it came back as -1, this would generate an error of about
> >1.19 hours. I suspect the best fix would be to test the
> >result from do_gettimeoffset() for something greater than
> >say 20ms and if so set it to 0.
>
> I've just looked at the i386 time.c source and can see no obvious way for -1
> to be returned by do_gettimeoffset().

It can happen two ways (and I am not saying it returns -1,
but some large number ~ 1hr worth of usecs). The first
possibility is an overflow in the conversion of tsc ticks to
usec. This is more likely a problem if the processor is
running at high speeds, AND interrupts have been held off
for a while. The second possibility is that either the
latch read failed to latch or the PIT is actually not
resetting at count = 0. This is, I think, the VIA chip
bug. If there is no "correction" code for this bug, the net
result will be a slow and erratic clock. If there is
correction software in place, it could result in the
observed time jump by providing an invalid count which is
then used by do_gettimeoffset(). The reason the fault
clears is a.) on the next tick a valid count will be
obtained, and b) the value from do_gettimeoffset() is never
rolled into the wall clock.

> I note that this error is fixed in the
> next time() call, so I would instead expect the error to be one involving the
> conversion of tv_secs/tv_usecs into the seconds return from time().
>
> One possible way to check this out would be to change the test program from
> using the time() call to using gettimeofday(), and to ignore tv_usecs.

And just where does time() get the time of day if not from
gettimeofday()?
>
> Hope this helps,
>
> Ruth
>
> --
> Ruth Ivimey-Cook
> Software engineer and technical writer.

--
George Anzinger [email protected]
High-res-timers:
http://sourceforge.net/projects/high-res-timers/
Real time sched: http://sourceforge.net/projects/rtsched/
Preemption patch:
http://www.kernel.org/pub/linux/kernel/people/rml

2002-08-02 07:20:42

by Hartwig. Thomas

[permalink] [raw]
Subject: Re: gettimeofday clock jump bug

I was referenced to a gettimeofday problem spoken of in the kernel
discussion summary:
http://kt.zork.net/kernel-traffic/kt20020708_174.html#1

There is spoken of .01% reproducibility and less of this problem.

I got to this problem here running a version of GNU wget (1.8.2) and
following kernel: Linux version 2.4.18 (gcc version 2.96 20000731 (Red
Hat Linux 7.3 2.96-110)) #4 Sun Jul 28 09:01:06 CEST 2002

In this configuration I get the error even more times. In 2300 calls I
get 319 failures, this is something about 0.07% not too much, but
significant.

However it is not my skill to analyze nor get to deep in wget and the
kernel. It's just a note and a offer of some further tests if you need.

Greetings
Thomas

PS: I'm sorry this message is out of the thread index. I got to late on
the list.