From: "Petr Vandrovec" <VANDROVE@vc.cvut.cz>
Organization: CC CTU Prague
To: john stultz <johnstul@us.ibm.com>
Date: Thu, 20 Nov 2003 00:10:50 +0200
MIME-Version: 1.0
Content-type: text/plain; charset=US-ASCII
Content-transfer-encoding: 7BIT
Subject: Re: Losing too many ticks! TSC cannot be used as time sourc
Cc: lkml <linux-kernel@vger.kernel.org>
Message-ID: <3A9BE1FAD@vcnet.vc.cvut.cz>
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 3135
Lines: 73

On 19 Nov 03 at 14:37, john stultz wrote:
> On Wed, 2003-11-19 at 03:21, Petr Vandrovec wrote:
> > Hi,
> >   today morning my kernel (2.6.0-test9-something) said that
> > TSC became unusable. There are no other messages around this
> > time in the log. It could be thermal throttling, but it should
> > print some message when X86_MCE_P4THERMAL is enabled, yes?
> > It happened after 17hrs 56min of uptime. System never produced
> > this message before. 
> 
> Hmmm. Interesting. You haven't seen it before and you've been running
> 2.6.0-testX for awhile?

I do not know when this monitoring entered kernel, but system is running
2.6.0-testX kernels since July 22, when it booted 2.6.0-test1-c1534...
 
> Any heavy cron jobs running at that time? 

atsar says:
         cpu %usr %sys %nice %idle pswtch/s runq nrproc lavg1 ..5 ..15
08:30:01 all 3    1    0     95    134      3    125    0.03 0.03 0.00
08:40:01 all 3    1    0     95    108      4    125    0.07 0.05 0.01
08:50:01 all 3    1    0     95     83      4    123    0.03 0.03 0.00

High process switch rate is caused by running teletext grabber. There is
peak at 6:30-6:50 caused by cron.daily.
 
> The message is caused after 100 consecutive timer ticks where it appears
> that the system has lost ticks. The assumption is that something has
> gone wrong and we can no longer trust the TSC as a time source
> (speedstep, for instance). Thermal throttling is a possibility, but I've
> not actually seen it occur. I'd have to defer to the cpufreq folks on
> that one. 

bash# grep FREQ .config 
# CONFIG_CPU_FREQ is not set
bash#

> If we're getting false positives, we may have to bump that
> 100-consecutive-ticks number up. 
> 
> Anything else quirky about the system?

I'm not aware of any problem. Three or four months ago I reported memory
corruption in one of bitkeeper files (32bits replaced with something
else), but it did not occured again since then.
 
> >   Are there some more information I could supply, or should I
> > simple live with fact that TSC stopped working on my P4 today
> > morning?
> 
> Well, the TSC didn't stop working, we just stopped using it as a time
> source. Things should work fine using just the PIT, although I'd be
> interested to hear if ntpd finally settled down after the change. If it
> cannot stay synced it may be an issue w/ your system's PIT. 

It did not produce any message since 9:54 (>14 hrs ago). Only difference
is that in the past /var/lib/ntp/ntp.drift listed value 4.496 to 4.967
(looking at all values I have logged in /var/log/syslog), but now
it says 45.662. But what I can expect from PIT...

> You may also want to check the ACPI PM timesource code in -mm4, and see
> how that works for you. 

Sometime ;-) Maybe if it will hit Linus kernel ;-)
                                                    Petr Vandrovec
                                                    

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/