On Fri, Aug 20, 2010 at 8:55 AM, jean-philippe francois
<[email protected]> wrote:
> Hi,
>
> Please correct me if I am sending this to the wrong list.
LKML is better for this issue. Added.
Lin Ming
>
> I would like to debug the following issue, which I suspect is related
> to cpu losing track of time when going idle.
>
> When I run the following script :
>
> #!/bin/bash
> # set hwclock to system clock to start
> # the log file with the same time
> hwclock -w
> #start with a fresh logfile
> touch clocktest.log
> echo START > clocktest.log
> while [[ true ]]
> do
> ? ?systime=$(date)
> ? ?hwtime=$(hwclock -r)
> ? ?echo "$systime | $hwtime" >> clocktest.log
> ? ?sleep 20
> done
>
> Expected output :
> On each line, the system time (left) and hardware time (right)
> should be roughly the same. Time delta between each line should be
> around 20 seconds
>
> Observed output :
> This is an edited version of the real logfile ?so that it fits on one
> line. the date was removed.
> START
> 15:09:43 (UTC+0200) | ?15:09:44 CEST -0.500555 secondes
> 15:10:04 (UTC+0200) | ?15:10:05 CEST -1.001562 secondes
> 15:10:25 (UTC+0200) | ?15:10:26 CEST -0.985944 secondes
> 15:10:46 (UTC+0200) | ?15:15:46 CEST -0.032792 secondes <-- 5 minute lag !
> 15:11:06 (UTC+0200) | ?15:16:07 CEST -0.985926 secondes
> 15:11:27 (UTC+0200) | ?15:16:28 CEST -1.001539 secondes
> 15:11:48 (UTC+0200) | ?15:16:49 CEST -0.985969 secondes
> 15:12:09 (UTC+0200) | ?15:17:10 CEST -0.985977 secondes
> 15:12:30 (UTC+0200) | ?15:17:31 CEST -1.001544 secondes
> 15:12:51 (UTC+0200) | ?15:17:52 CEST -0.985920 secondes
> 15:13:12 (UTC+0200) | ?15:18:13 CEST -0.985939 secondes
> 15:13:33 (UTC+0200) | ?15:18:34 CEST -1.001581 secondes
> 15:13:54 (UTC+0200) | ?15:18:55 CEST -0.985917 secondes
> 15:14:15 (UTC+0200) | ?15:29:15 CEST -0.079685 secondes <-- add
> another 10 minute
> 15:14:35 (UTC+0200) | ?15:29:36 CEST -1.001568 secondes
> 15:14:56 (UTC+0200) | ?15:29:57 CEST -0.985952 secondes
>
> As you can see, after 20 minutes without touching the computer, the
> system time is late by 15 minutes.
> This was tested on ubuntu 10.04 with a 2.6.32 ?"vanilla" kernel where
> "vanilla" means ubuntu packaged
> upstream kernel ?closest to the distribution shipped kernel.
>
> How can I debug and possibly fix this issue ?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
>
2010/8/21 Lin Ming <[email protected]>:
> On Fri, Aug 20, 2010 at 8:55 AM, jean-philippe francois
> <[email protected]> wrote:
>> Hi,
>>
>> Please correct me if I am sending this to the wrong list.
>
> LKML is better for this issue. Added.
>
Thank you. I have been looking at the ML archives, and it seems there
has been issue related to timekeeping lately. So I tested my script
with a decently recent kernel : 2.6.36-rc1.
The timekeeping bug described below still exists.
>>
>> I would like to debug the following issue, which I suspect is related
>> to cpu losing track of time when going idle.
>>
>> When I run the following script :
>>
>> #!/bin/bash
>> # set hwclock to system clock to start
>> # the log file with the same time
>> hwclock -w
>> #start with a fresh logfile
>> touch clocktest.log
>> echo START > clocktest.log
>> while [[ true ]]
>> do
>> ? ?systime=$(date)
>> ? ?hwtime=$(hwclock -r)
>> ? ?echo "$systime | $hwtime" >> clocktest.log
>> ? ?sleep 20
>> done
>>
>> Expected output :
>> On each line, the system time (left) and hardware time (right)
>> should be roughly the same. Time delta between each line should be
>> around 20 seconds
>>
>> Observed output :
>> This is an edited version of the real logfile ?so that it fits on one
>> line. the date was removed.
>> START
>> 15:09:43 (UTC+0200) | ?15:09:44 CEST -0.500555 secondes
>> 15:10:04 (UTC+0200) | ?15:10:05 CEST -1.001562 secondes
>> 15:10:25 (UTC+0200) | ?15:10:26 CEST -0.985944 secondes
>> 15:10:46 (UTC+0200) | ?15:15:46 CEST -0.032792 secondes <-- 5 minute lag !
>> 15:11:06 (UTC+0200) | ?15:16:07 CEST -0.985926 secondes
>> 15:11:27 (UTC+0200) | ?15:16:28 CEST -1.001539 secondes
>> 15:11:48 (UTC+0200) | ?15:16:49 CEST -0.985969 secondes
>> 15:12:09 (UTC+0200) | ?15:17:10 CEST -0.985977 secondes
>> 15:12:30 (UTC+0200) | ?15:17:31 CEST -1.001544 secondes
>> 15:12:51 (UTC+0200) | ?15:17:52 CEST -0.985920 secondes
>> 15:13:12 (UTC+0200) | ?15:18:13 CEST -0.985939 secondes
>> 15:13:33 (UTC+0200) | ?15:18:34 CEST -1.001581 secondes
>> 15:13:54 (UTC+0200) | ?15:18:55 CEST -0.985917 secondes
>> 15:14:15 (UTC+0200) | ?15:29:15 CEST -0.079685 secondes <-- add
>> another 10 minute
>> 15:14:35 (UTC+0200) | ?15:29:36 CEST -1.001568 secondes
>> 15:14:56 (UTC+0200) | ?15:29:57 CEST -0.985952 secondes
>>
>> As you can see, after 20 minutes without touching the computer, the
>> system time is late by 15 minutes.
>> This was tested on ubuntu 10.04 with a 2.6.32 ?"vanilla" kernel where
>> "vanilla" means ubuntu packaged
>> upstream kernel ?closest to the distribution shipped kernel.
>>
On Mon, Aug 23, 2010 at 1:46 AM, jean-philippe francois
<[email protected]> wrote:
> 2010/8/21 Lin Ming <[email protected]>:
>> On Fri, Aug 20, 2010 at 8:55 AM, jean-philippe francois
>> <[email protected]> wrote:
>>> Hi,
>>>
>>> Please correct me if I am sending this to the wrong list.
>>
>> LKML is better for this issue. Added.
>>
>
> Thank you. I have been looking at the ML archives, and it seems there
> has been issue related to timekeeping lately. So I tested my script
> with a decently recent kernel : 2.6.36-rc1.
>
> The timekeeping bug described below still exists.
>>>
>>> I would like to debug the following issue, which I suspect is related
>>> to cpu losing track of time when going idle.
>>>
>>> When I run the following script :
>>>
>>> #!/bin/bash
>>> # set hwclock to system clock to start
>>> # the log file with the same time
>>> hwclock -w
>>> #start with a fresh logfile
>>> touch clocktest.log
>>> echo START > clocktest.log
>>> while [[ true ]]
>>> do
>>> ? ?systime=$(date)
>>> ? ?hwtime=$(hwclock -r)
>>> ? ?echo "$systime | $hwtime" >> clocktest.log
>>> ? ?sleep 20
>>> done
>>>
>>> Expected output :
>>> On each line, the system time (left) and hardware time (right)
>>> should be roughly the same. Time delta between each line should be
>>> around 20 seconds
>>>
>>> Observed output :
>>> This is an edited version of the real logfile ?so that it fits on one
>>> line. the date was removed.
>>> START
>>> 15:09:43 (UTC+0200) | ?15:09:44 CEST -0.500555 secondes
>>> 15:10:04 (UTC+0200) | ?15:10:05 CEST -1.001562 secondes
>>> 15:10:25 (UTC+0200) | ?15:10:26 CEST -0.985944 secondes
>>> 15:10:46 (UTC+0200) | ?15:15:46 CEST -0.032792 secondes <-- 5 minute lag !
>>> 15:11:06 (UTC+0200) | ?15:16:07 CEST -0.985926 secondes
>>> 15:11:27 (UTC+0200) | ?15:16:28 CEST -1.001539 secondes
>>> 15:11:48 (UTC+0200) | ?15:16:49 CEST -0.985969 secondes
>>> 15:12:09 (UTC+0200) | ?15:17:10 CEST -0.985977 secondes
>>> 15:12:30 (UTC+0200) | ?15:17:31 CEST -1.001544 secondes
>>> 15:12:51 (UTC+0200) | ?15:17:52 CEST -0.985920 secondes
>>> 15:13:12 (UTC+0200) | ?15:18:13 CEST -0.985939 secondes
>>> 15:13:33 (UTC+0200) | ?15:18:34 CEST -1.001581 secondes
>>> 15:13:54 (UTC+0200) | ?15:18:55 CEST -0.985917 secondes
>>> 15:14:15 (UTC+0200) | ?15:29:15 CEST -0.079685 secondes <-- add
>>> another 10 minute
>>> 15:14:35 (UTC+0200) | ?15:29:36 CEST -1.001568 secondes
>>> 15:14:56 (UTC+0200) | ?15:29:57 CEST -0.985952 secondes
>>>
>>> As you can see, after 20 minutes without touching the computer, the
>>> system time is late by 15 minutes.
>>> This was tested on ubuntu 10.04 with a 2.6.32 ?"vanilla" kernel where
>>> "vanilla" means ubuntu packaged
>>> upstream kernel ?closest to the distribution shipped kernel.
Could you provide the output of:
cat /sys/devices/system/clocksource/clocksource0/available_clocksource
cat /sys/devices/system/clocksource/clocksource0/current_clocksource
I assume you're using the TSC, and hitting the halt-in-c3 issue, but
I'm curious why the clocksource watchdog isn't catching this and
dropping you back to a more stable clocksource.
thanks
-john
>
> Could you provide the output of:
> cat /sys/devices/system/clocksource/clocksource0/available_clocksource
hpet acpi_pm
> cat /sys/devices/system/clocksource/clocksource0/current_clocksource
>
hpet
> I assume you're using the TSC, and hitting the halt-in-c3 issue, but
> I'm curious why the clocksource watchdog isn't catching this and
> dropping you back to a more stable clocksource.
>
I also attached the dmesg output.
Thank you for your help.
Jean-Philippe Fran?ois
Hi,
My Timekeeping bug is still present, here is an updated script and log.
I am willing to make test, but I don't know what kind of debugging
info is needed.
I am all eyes to any advice that could help solve this BUG.
Thank you,
Jean-Philippe Fran?ois
On Fri, 2010-08-27 at 16:12 +0200, jean-philippe francois wrote:
> My Timekeeping bug is still present, here is an updated script and log.
> I am willing to make test, but I don't know what kind of debugging
> info is needed.
>
> cat /sys/devices/system/clocksource/clocksource0/available_clocksource
> hpet acpi_pm
> cat /sys/devices/system/clocksource/clocksource0/current_clocksource
> hpet
Huh. hpet was not what I would have expected.
So first, two experiments:
1) Does booting with "clock=acpi_pm" cause the issue to disappear?
2) Does booting with "nohz=off" cause the issue to disappear?
Venkatesh: You have any experience with HPETs that halt in idle?
thanks
-john
On Fri, Aug 27, 2010 at 11:11 AM, john stultz <[email protected]> wrote:
> On Fri, 2010-08-27 at 16:12 +0200, jean-philippe francois wrote:
>> My Timekeeping bug is still present, here is an updated script and log.
>> I am willing to make test, but I don't know what kind of debugging
>> info is needed.
>>
>> cat /sys/devices/system/clocksource/clocksource0/available_clocksource
>> hpet acpi_pm
>> cat /sys/devices/system/clocksource/clocksource0/current_clocksource
>> hpet
>
> Huh. hpet was not what I would have expected.
>
>
> So first, two experiments:
>
> 1) Does booting with "clock=acpi_pm" cause the issue to disappear?
>
> 2) Does booting with "nohz=off" cause the issue to disappear?
>
>
> Venkatesh: You have any experience with HPETs that halt in idle?
>
No. Haven't seen anything like this before with HPET.
The jump was
system | hardware(RTC)
15:48:43 | ven. 27 ao?t 2010 15:48:44 CEST -0.985908 secondes
15:49:04 | ven. 27 ao?t 2010 15:54:04 CEST -0.032800 secondes
We lost ~300 seconds and from dmesg
hpet0: 3 comparators, 64-bit 14.318180 MHz counter
Which is close to 2^32 HPET ticks. So, looks like we have some 32 bit
wraparound somewhere..
Thanks,
Venki
On Fri, 2010-08-27 at 17:10 -0700, Venkatesh Pallipadi wrote:
> On Fri, Aug 27, 2010 at 11:11 AM, john stultz <[email protected]> wrote:
> > On Fri, 2010-08-27 at 16:12 +0200, jean-philippe francois wrote:
> >> My Timekeeping bug is still present, here is an updated script and log.
> >> I am willing to make test, but I don't know what kind of debugging
> >> info is needed.
> >>
> >> cat /sys/devices/system/clocksource/clocksource0/available_clocksource
> >> hpet acpi_pm
> >> cat /sys/devices/system/clocksource/clocksource0/current_clocksource
> >> hpet
> >
> > Huh. hpet was not what I would have expected.
> >
> >
> > So first, two experiments:
> >
> > 1) Does booting with "clock=acpi_pm" cause the issue to disappear?
> >
> > 2) Does booting with "nohz=off" cause the issue to disappear?
> >
> >
> > Venkatesh: You have any experience with HPETs that halt in idle?
> >
>
> No. Haven't seen anything like this before with HPET.
>
> The jump was
>
> system | hardware(RTC)
> 15:48:43 | ven. 27 août 2010 15:48:44 CEST -0.985908 secondes
> 15:49:04 | ven. 27 août 2010 15:54:04 CEST -0.032800 secondes
>
> We lost ~300 seconds and from dmesg
> hpet0: 3 comparators, 64-bit 14.318180 MHz counter
>
> Which is close to 2^32 HPET ticks. So, looks like we have some 32 bit
> wraparound somewhere..
Huh. So you're thinking the timer tick scheduler is pushing way past the
HPET wrap length, and then we're missing an accumulation point and
things wrap under us?
We have some code to try to limit the nohz length to avoid hardware
wrapping, but honestly I'd be surprised if the system is actually that
idle for that long (no timers firing for 5 minutes? that'd be really
impressive!).
jean-philippe: Is the drift always in 5 minute increments? Can you leave
it idle for 3 minutes and see a similar 3 minute delay, or is it always
in units of 5 ?
thanks
-john
2010/8/28 john stultz <[email protected]>:
>> > Huh. hpet was not what I would have expected.
>> >
>> >
>> > So first, two experiments:
>> >
>> > 1) Does booting with "clock=acpi_pm" cause the issue to disappear?
>> >
Yes, the problem usually show up quite quickly, and I think I have let
my script run
for a long enough time to say it does not occur with clock=acpi_pm
>> > 2) Does booting with "nohz=off" cause the issue to disappear?
>> >
Not tested yet.
>
> jean-philippe: Is the drift always in 5 minute increments? Can you leave
> it idle for 3 minutes and see a similar 3 minute delay, or is it always
> in units of 5 ?
>
That is interesting !
I think I have always seen 5 minutes or multiples of this delay. I
will try to leave it running for a longer
period of time and come back to you with more results. If you look at
the first message of this thread
(on linux-acpi) the delay is first 5, then 10 minutes.
Thank you for your interest.
Jean-Philippe
On Sat, Aug 28, 2010 at 5:45 AM, jean-philippe francois
<[email protected]> wrote:
> 2010/8/28 john stultz <[email protected]>:
>
>>> > Huh. hpet was not what I would have expected.
>>> >
>>> >
>>> > So first, two experiments:
>>> >
>>> > 1) Does booting with "clock=acpi_pm" cause the issue to disappear?
>>> >
>
> Yes, the problem usually show up quite quickly, and I think I have let
> my script run
> for a long enough time to say it does not occur with clock=acpi_pm
>
>>> > 2) Does booting with "nohz=off" cause the issue to disappear?
>>> >
> Not tested yet.
>
>>
>> jean-philippe: Is the drift always in 5 minute increments? Can you leave
>> it idle for 3 minutes and see a similar 3 minute delay, or is it always
>> in units of 5 ?
>>
> That is interesting !
> I think I have always seen 5 minutes or multiples of this delay. I
> will try to leave it running for a longer
> period of time and come back to you with more results. If you look at
> the first message of this thread
> (on linux-acpi) the delay is first 5, then 10 minutes.
>
OK. So, the problem seems to be limited to HPET and 5 min multiples.
> john stultz <[email protected]>:
> Huh. So you're thinking the timer tick scheduler is pushing way past the
> HPET wrap length, and then we're missing an accumulation point and
> things wrap under us?
John: I don't think it is that we are going tickless beyond HPET wrap
length. HPET max delta should be preventing that.
We probably need to read out HPET hardware counter to see whether
things are OK there. Even if we miss the interrupt, time update should
notice that. One potential problem may be that HPET hardware can only
do 32 bit mode, but advertising 64 bit mode and we are getting
confused when we update time?
jean-philippe: What does hpet section of "cat /proc/timer_list" look
like? Also, can you boot with "hpet=verbose" option and send the
dmesg.
Thanks,
Venki
>
> jean-philippe: Is the drift always in 5 minute increments? Can you leave
> it idle for 3 minutes and see a similar 3 minute delay, or is it always
> in units of 5 ?
>
It looks like it is always a 5 minute delay, or sometimes 10.
Attached is an edited version with the interesting bits of a test running
for 8 hours
Jean-Philippe Fran?ois