2006-01-08 19:31:21

by Chaitanya Hazarey

[permalink] [raw]
Subject: Back to the Future ? or some thing sinister ?

I think this is a problem that does not come along quite frequently.

We have got a machine, lets say X , make is IBM and the CPU is Intel
Pentium 4 2.60 GHz. Its running a 2.6.13.1 Kernel and previously,
2.6.27-4 Kernel the distribution is Debian Sagre.

processor : 0
vendor_id : GenuineIntel
cpu family : 15
model : 2
model name : Intel(R) Pentium(R) 4 CPU 2.60GHz
stepping : 9
cpu MHz : 2591.888
cache size : 512 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid
xtpr
bogomips : 5188.79




The problem is that, after a some time ( fuzzy , but I think like 2
hours ) of inactivity or because of some esoteric factor which triggers
a state in which the time on the machine starts going around in a loop.
if I do cat /proc/uptime, it goes 4 ticks ahead and again rewinds back
to the starting count ( not zero, but the moment in time when the event
was triggred. )

The problem seems to be specific to the 2.6 series of kernel, not the
2.4 series.

I would like to know how to go about the debugging of the problem, and
that which specific part of the kernel will be directly interacting with
the rtc / system clock.

Thanks,

Chaitanya


2006-01-09 04:03:30

by Nathan Lynch

[permalink] [raw]
Subject: Re: Back to the Future ? or some thing sinister ?

Chaitanya Hazarey wrote:
>
> We have got a machine, lets say X , make is IBM and the CPU is Intel
> Pentium 4 2.60 GHz. Its running a 2.6.13.1 Kernel and previously,
> 2.6.27-4 Kernel the distribution is Debian Sagre.
>
> processor : 0
> vendor_id : GenuineIntel
> cpu family : 15
> model : 2
> model name : Intel(R) Pentium(R) 4 CPU 2.60GHz
> stepping : 9
> cpu MHz : 2591.888
> cache size : 512 KB
> fdiv_bug : no
> hlt_bug : no
> f00f_bug : no
> coma_bug : no
> fpu : yes
> fpu_exception : yes
> cpuid level : 2
> wp : yes
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
> mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid
> xtpr
> bogomips : 5188.79
>
>
>
>
> The problem is that, after a some time ( fuzzy , but I think like 2
> hours ) of inactivity or because of some esoteric factor which triggers
> a state in which the time on the machine starts going around in a loop.
> if I do cat /proc/uptime, it goes 4 ticks ahead and again rewinds back
> to the starting count ( not zero, but the moment in time when the event
> was triggred. )
>
> The problem seems to be specific to the 2.6 series of kernel, not the
> 2.4 series.
>
> I would like to know how to go about the debugging of the problem, and
> that which specific part of the kernel will be directly interacting with
> the rtc / system clock.

Look into upgrading the BIOS on that machine; I've had similar
problems on a IBM P4 workstation that were fixed in this way.

2006-01-09 15:26:48

by Ram Gupta

[permalink] [raw]
Subject: Re: Back to the Future ? or some thing sinister ?

On 1/8/06, Nathan Lynch <[email protected]> wrote:
> Chaitanya Hazarey wrote:
> >
> > We have got a machine, lets say X , make is IBM and the CPU is Intel
> > Pentium 4 2.60 GHz. Its running a 2.6.13.1 Kernel and previously,


Is this machine's time is synchronized with some server using ntp. I
had seen some very similar issue when the clock deviation was more
than a second .If clock is adjusted and time difference becomes more
than 2 sec the diffence becomes negative because timeval has its
members as signed int.It think that issue might be playing a role
here.

Ram

2006-01-11 18:32:38

by Chaitanya Hazarey

[permalink] [raw]
Subject: Re: Back to the Future ? or some thing sinister ?

Ram Gupta wrote:

>On 1/8/06, Nathan Lynch <[email protected]> wrote:
>
>
>>Chaitanya Hazarey wrote:
>>
>>
>>>We have got a machine, lets say X , make is IBM and the CPU is Intel
>>>Pentium 4 2.60 GHz. Its running a 2.6.13.1 Kernel and previously,
>>>
>>>
>
>
>Is this machine's time is synchronized with some server using ntp. I
>had seen some very similar issue when the clock deviation was more
>than a second .If clock is adjusted and time difference becomes more
>than 2 sec the diffence becomes negative because timeval has its
>members as signed int.It think that issue might be playing a role
>here.
>
>
>
Nope tried every thing, shutting down the ntp server, changing the Ntp
server, any thing I do it still will hang intermittently. And if the
problem is because of the Ntp why should it hang only on 2.6 not 2.4
kernels ?

And the point is that when it reaches that stage all the commands seem
to execute ultra slow.

Any help for diagnosing the problem is most welcome.

Thanks,

Chaitanya


2006-01-11 22:03:10

by john stultz

[permalink] [raw]
Subject: Re: Back to the Future ? or some thing sinister ?

On Sun, 2006-01-08 at 22:03 -0600, Nathan Lynch wrote:
> Chaitanya Hazarey wrote:
> >
> > We have got a machine, lets say X , make is IBM and the CPU is Intel
> > Pentium 4 2.60 GHz. Its running a 2.6.13.1 Kernel and previously,
> > 2.6.27-4 Kernel the distribution is Debian Sagre.
> >
[snip]
> >
> > The problem is that, after a some time ( fuzzy , but I think like 2
> > hours ) of inactivity or because of some esoteric factor which triggers
> > a state in which the time on the machine starts going around in a loop.
> > if I do cat /proc/uptime, it goes 4 ticks ahead and again rewinds back
> > to the starting count ( not zero, but the moment in time when the event
> > was triggred. )
> >
> > The problem seems to be specific to the 2.6 series of kernel, not the
> > 2.4 series.
> >
> > I would like to know how to go about the debugging of the problem, and
> > that which specific part of the kernel will be directly interacting with
> > the rtc / system clock.
>
> Look into upgrading the BIOS on that machine; I've had similar
> problems on a IBM P4 workstation that were fixed in this way.

Yes, there was a problematic BIOS on some IBM P4 systems that after a
few hours messed up the apic's timer interrupt frequency. I believe
booting w/ noapic will work around the issue, but the correct fix is to
update your BIOS.

Please file a bugzilla bug if upgrading your BIOS does resolve the
issue.

thanks
-john

2006-01-12 14:33:54

by Ram Gupta

[permalink] [raw]
Subject: Re: Back to the Future ? or some thing sinister ?

On 1/11/06, john stultz <[email protected]> wrote:
> On Sun, 2006-01-08 at 22:03 -0600, Nathan Lynch wrote:
> > Chaitanya Hazarey wrote:
> > >
> > > We have got a machine, lets say X , make is IBM and the CPU is Intel
> > > Pentium 4 2.60 GHz. Its running a 2.6.13.1 Kernel and previously,
> > > 2.6.27-4 Kernel the distribution is Debian Sagre.

It may be BIOS related. But I feel it might be an overflow related
issue. If the variable is signed int then there will be a transition
from 0x7fffffff ns to 0x80000000 ns which is basically from +2 sec to
-2 sec which will result in 4 sec loss.

Ram

2006-01-12 18:08:44

by john stultz

[permalink] [raw]
Subject: Re: Back to the Future ? or some thing sinister ?

On Thu, 2006-01-12 at 08:33 -0600, Ram Gupta wrote:
> On 1/11/06, john stultz <[email protected]> wrote:
> > On Sun, 2006-01-08 at 22:03 -0600, Nathan Lynch wrote:
> > > Chaitanya Hazarey wrote:
> > > >
> > > > We have got a machine, lets say X , make is IBM and the CPU is Intel
> > > > Pentium 4 2.60 GHz. Its running a 2.6.13.1 Kernel and previously,
> > > > 2.6.27-4 Kernel the distribution is Debian Sagre.
>
> It may be BIOS related. But I feel it might be an overflow related
> issue. If the variable is signed int then there will be a transition
> from 0x7fffffff ns to 0x80000000 ns which is basically from +2 sec to
> -2 sec which will result in 4 sec loss.

I'm pretty sure this is the BIOS issue. If your hesitant about updating
the BIOS, try booting w/ noapic, and see if that works around the issue.

The 4 second loss is the tv_nsec portion of the xtime timespec wrapping.
Since time is not accumulated (timer_interrupt isn't being called at the
normal HZ frequency), the TSC offset grows and grows (and finally will
wrap repeating the processes), causing the xtime.tv_nsec to wrap.

Thus you are correct that the symptom is overflow related, but the cause
is most likely the BIOS.

thanks
-john