2003-08-04 16:35:14

by Patrick Moor

[permalink] [raw]
Subject: time jumps (again)

Hi

Some days ago I started noticing strange time jumps on my Athlon system.
(Asus board, VIA chipset, AMD Athlon 650MHz processor). I haven't
noticed them before and I am pretty sure there weren't any for the last
few years! Uptime of the machine is now 218 days, and problems began
appearing after 215 days approximately.

What happens: when doing a
$ while true; do date; done
I'm noticing time jumps _exactly_ at the beginning of a "new" second (or
at the end of an "old" one). the jump is exactly 4294 (4295) seconds
into the future. Example:
...
Mon Aug 4 18:11:06 CEST 2003
Mon Aug 4 18:11:06 CEST 2003
Mon Aug 4 19:22:41 CEST 2003
Mon Aug 4 19:22:41 CEST 2003
Mon Aug 4 19:22:41 CEST 2003
Mon Aug 4 18:11:07 CEST 2003
Mon Aug 4 18:11:07 CEST 2003
...

I've found some previous discussions about this about a year ago:

http://www.ussg.iu.edu/hypermail/linux/kernel/0203.3/0557.html
http://www.ussg.iu.edu/hypermail/linux/kernel/0206.0/1505.html

What seems strange to me is, that these jumps have never occured before.
The machine is running a plain 2.4.20 kernel.

So my question is: will disabling the CONFIG_X86_TSC option and passing
"notsc" as boot parameter fix the problem? Or did I get something wrong
there?

thanks
patrick



2003-08-04 16:41:20

by Alan

[permalink] [raw]
Subject: Re: time jumps (again)

On Llu, 2003-08-04 at 17:35, Patrick Moor wrote:
> few years! Uptime of the machine is now 218 days, and problems began
> appearing after 215 days approximately.

Not sure why 215 days should be significant

> What happens: when doing a
> $ while true; do date; done
> I'm noticing time jumps _exactly_ at the beginning of a "new" second (or
> at the end of an "old" one). the jump is exactly 4294 (4295) seconds
> into the future. Example:

4294.. top of -1

Smells of some kind of sign propogation bug

2003-08-04 21:49:41

by Tim Schmielau

[permalink] [raw]
Subject: Re: time jumps (again)

> Some days ago I started noticing strange time jumps on my Athlon system.
> (Asus board, VIA chipset, AMD Athlon 650MHz processor). I haven't
> noticed them before and I am pretty sure there weren't any for the last
> few years! Uptime of the machine is now 218 days, and problems began
> appearing after 215 days approximately.
>
> What happens: when doing a
> $ while true; do date; done
> I'm noticing time jumps _exactly_ at the beginning of a "new" second (or
> at the end of an "old" one). the jump is exactly 4294 (4295) seconds
> into the future. Example:
> ...
> Mon Aug 4 18:11:06 CEST 2003
> Mon Aug 4 18:11:06 CEST 2003
> Mon Aug 4 19:22:41 CEST 2003
> Mon Aug 4 19:22:41 CEST 2003
> Mon Aug 4 19:22:41 CEST 2003
> Mon Aug 4 18:11:07 CEST 2003
> Mon Aug 4 18:11:07 CEST 2003
> ...
>

Wild guess - does the following patch fix it?

Tim


--- linux-2.4.20/arch/i386/kernel/time.c.orig Mon Aug 4 23:38:47 2003
+++ linux-2.4.20/arch/i386/kernel/time.c Mon Aug 4 23:40:53 2003
@@ -274,8 +274,8 @@
read_lock_irqsave(&xtime_lock, flags);
usec = do_gettimeoffset();
{
- unsigned long lost = jiffies - wall_jiffies;
- if (lost)
+ long lost = jiffies - wall_jiffies;
+ if (lost>0)
usec += lost * (1000000 / HZ);
}
sec = xtime.tv_sec;

2003-08-04 22:38:55

by George Anzinger

[permalink] [raw]
Subject: Re: time jumps (again)

Tim Schmielau wrote:
>>Some days ago I started noticing strange time jumps on my Athlon system.
>>(Asus board, VIA chipset, AMD Athlon 650MHz processor). I haven't
>>noticed them before and I am pretty sure there weren't any for the last
>>few years! Uptime of the machine is now 218 days, and problems began
>>appearing after 215 days approximately.
>>
>>What happens: when doing a
>> $ while true; do date; done
>>I'm noticing time jumps _exactly_ at the beginning of a "new" second (or
>>at the end of an "old" one). the jump is exactly 4294 (4295) seconds
>>into the future. Example:
>>...
>>Mon Aug 4 18:11:06 CEST 2003
>>Mon Aug 4 18:11:06 CEST 2003
>>Mon Aug 4 19:22:41 CEST 2003
>>Mon Aug 4 19:22:41 CEST 2003
>>Mon Aug 4 19:22:41 CEST 2003
>>Mon Aug 4 18:11:07 CEST 2003
>>Mon Aug 4 18:11:07 CEST 2003
>>...
>>
>
>
> Wild guess - does the following patch fix it?

And your theory is that wall_jiffies > jiffies. How does this happen?
Both of these are only changed under the write_irq lock....

I would feel better with a patch that made jiffies volatile, but it
already is.

I agree that the jump implies overflow here, but just HOW is it happening?

Time for some dianostic code...

Tim
>
>
> --- linux-2.4.20/arch/i386/kernel/time.c.orig Mon Aug 4 23:38:47 2003
> +++ linux-2.4.20/arch/i386/kernel/time.c Mon Aug 4 23:40:53 2003
> @@ -274,8 +274,8 @@
> read_lock_irqsave(&xtime_lock, flags);
> usec = do_gettimeoffset();
> {
> - unsigned long lost = jiffies - wall_jiffies;
> - if (lost)
> + long lost = jiffies - wall_jiffies;
> + if (lost>0)
> usec += lost * (1000000 / HZ);
> }
> sec = xtime.tv_sec;
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

--
George Anzinger [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/
Preemption patch: http://www.kernel.org/pub/linux/kernel/people/rml

2003-08-05 01:08:55

by Andries Brouwer

[permalink] [raw]
Subject: Re: time jumps (again)

> Tim Schmielau wrote:

> >>What happens: when doing a
> >> $ while true; do date; done
> >>I'm noticing time jumps _exactly_ at the beginning of a "new" second (or
> >>at the end of an "old" one). the jump is exactly 4294 (4295) seconds
> >>into the future. Example:
> >>...
> >>Mon Aug 4 18:11:06 CEST 2003
> >>Mon Aug 4 19:22:41 CEST 2003
> >>Mon Aug 4 18:11:07 CEST 2003
> >>...

> >--- linux-2.4.20/arch/i386/kernel/time.c.orig Mon Aug 4 23:38:47 2003
> >+++ linux-2.4.20/arch/i386/kernel/time.c Mon Aug 4 23:40:53 2003
> >@@ -274,8 +274,8 @@
> > read_lock_irqsave(&xtime_lock, flags);
> > usec = do_gettimeoffset();
> > {
> >- unsigned long lost = jiffies - wall_jiffies;
> >- if (lost)
> >+ long lost = jiffies - wall_jiffies;
> >+ if (lost>0)
> > usec += lost * (1000000 / HZ);
> > }
> > sec = xtime.tv_sec;

At first sight jiffies and wall_jiffies increase monotonically, and
wall_jiffies always has a value jiffies had a moment earlier, so the
difference jiffies - wall_jiffies ought to be nonnegative.

On the other hand, do_gettimeoffset() is a much more obscure function,
and the jumps are also explained if that can return a negative value.

Depending on CONFIG_X86_TSC it does do_slow_gettimeoffset or
do_fast_gettimeoffset. Both offer plenty of opportunities to
return a negative value. Things depend on hardware details.

So, instead of adding a test inside { } I would propose to catch
problems after the {}, e.g. by
if (usec < 0)
usec = 0;

There should be a clue in the fact that the jump happens at the
start of a new second. I don't know what it is.

Andries

2003-08-05 10:32:44

by Jan Niehusmann

[permalink] [raw]
Subject: Re: time jumps (again)

On Mon, Aug 04, 2003 at 06:35:07PM +0200, Patrick Moor wrote:
> I'm noticing time jumps _exactly_ at the beginning of a "new" second (or
> at the end of an "old" one). the jump is exactly 4294 (4295) seconds
> into the future. Example:

We had the same problem with a similar setup (ASUS board, VIA chipset,
AMD CPU).

The solution is in the following thread, and AFAIK the patch went into
2.4.21:
http://www.ussg.iu.edu/hypermail/linux/kernel/0211.0/0330.html

Jan

2003-08-06 18:03:51

by Timothy Miller

[permalink] [raw]
Subject: Re: time jumps (again)

Is there any way the kernel could detect clock problems like drift and
jumps by comparing the effects of different timers? And when a problem
is detected, it can correct the situation automatically.

How many interrupt timers are there in various systems? How much can we
rely on the accuracy of each one?

2003-08-06 18:55:42

by George Anzinger

[permalink] [raw]
Subject: Re: time jumps (again)

Timothy Miller wrote:
> Is there any way the kernel could detect clock problems like drift and
> jumps by comparing the effects of different timers? And when a problem
> is detected, it can correct the situation automatically.
>
> How many interrupt timers are there in various systems? How much can we
> rely on the accuracy of each one?
>
In my high-res-timers model I don't rely on interrupts to "clock"
time, but rather pick some stable time source such as the ACPIC
pm_timer. The interrupts are just used to remind the system to read
the clock.

In this model, the gettimeofday() request just reads that clock.
There is also code to keep the interrupts occurring on the proper
"boundaries" as defined by that clock.

The problem is finding a stable fast (as in time to read) clock
source. The TSC is not stable in a fair number of machines. The
pm_timer is an I/O access which is sloooow and will only get slower
WRT cpu cycle time as the boxes get faster.

Archs other than the x86 seem to do much better in this regard.

As for fixing what is in the x86 now, I would suggest that, if we are
using the TSC, we trust it with a bit of a longer time than the tick
time. It is relatively easy to detect drift WRT the PIT and correct
the TSC base line, but this should be done over a second or so and not
each tick as is done now. This would eliminate the PIT as well as the
TSC reference read at each interrupt and result in a more stable result.

To work correctly with NTP we would also need to adjust the TSC to
useconds multiplier to match what NTP thinks the TSC rate should be at
the moment.

I don't know if this work should be attempted at this point in the
development cycle, however. Possibly waiting for 2.7 is better.
>

--
George Anzinger [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/
Preemption patch: http://www.kernel.org/pub/linux/kernel/people/rml

2003-08-07 00:30:08

by Andries Brouwer

[permalink] [raw]
Subject: Re: time jumps (again)

On Wed, Aug 06, 2003 at 02:16:35PM -0400, Timothy Miller wrote:

> Is there any way the kernel could detect clock problems like drift and
> jumps by comparing the effects of different timers? And when a problem
> is detected, it can correct the situation automatically.

In this particular case, I think my stopgap
if ((long) usec < 0)
usec = 0;
would suffice to eliminate the jumps.
Of course it would be better to understand the hardware details,
but perhaps we are insufficiently documented.