LinuxLists.cc - TSC not updating after resume: Bug or Feature?

2008-12-17 17:34:47

by Deepak Saxena

[permalink] [raw]

Subject: TSC not updating after resume: Bug or Feature?

Hi,

I am using ftrace to get a detailed timing analysis of the resume process
on the OLPC XO laptop and am seeing that when we start running again, the
system timestamep is not being updated for several (hundreds of) thousands
of cycles (~2000 ftrace entries). From following the ftrace, what is
happening is that the clocksource is not updated until we run the cpu_idle()
thread due to an explicit scheduling operation in the resume path that occurs
via call to msleep from pci_set_power_state(). As I'm still fully groking the
timekeeping code, the question(s) I have is whether this is expected behaviour
and I should not assume valid timestamp data in the initial bits of the suspend path,
whether this is an OLPC-specific bug, or whether I've uncovered a generic bug in
the timekeeping implementation. This is on 2.6.27.7 as I've not gotten 28-rc
up and running on the XO but can move that up in priority if this behaviour
is different in newer kernels.

Thanks,
~Deepak

# dsaxena: Return from firmware
do_suspend_trac-1267 [00] 154.925216: restore_processor_state <-ret_point
do_suspend_trac-1267 [00] 154.925216: enable_sep_cpu <-restore_processor_state
do_suspend_trac-1267 [00] 154.925216: olpc_fixup_wakeup <-olpc_pm_enter
do_suspend_trac-1267 [00] 154.925216: geode_get_dev_base <-olpc_fixup_wakeup
do_suspend_trac-1267 [00] 154.925216: olpc_ec_cmd <-olpc_fixup_wakeup
do_suspend_trac-1267 [00] 154.925216: _spin_lock_irqsave <-olpc_ec_cmd
do_suspend_trac-1267 [00] 154.925216: __wait_on_ibf <-olpc_ec_cmd
do_suspend_trac-1267 [00] 154.925216: printk <-olpc_ec_cmd
do_suspend_trac-1267 [00] 154.925216: vprintk <-printk
do_suspend_trac-1267 [00] 154.925216: _spin_lock <-vprintk
do_suspend_trac-1267 [00] 154.925216: emit_log_char <-vprintk
do_suspend_trac-1267 [00] 154.925216: emit_log_char <-vprintk
....
# pci_set_power_state() calls msleep(10) as per PCI spec
do_suspend_trac-1267 [00] 154.925216: msleep <-pci_set_power_state
do_suspend_trac-1267 [00] 154.925216: msecs_to_jiffies <-msleep
do_suspend_trac-1267 [00] 154.925216: schedule_timeout_uninterruptible <-msleep
do_suspend_trac-1267 [00] 154.925216: schedule_timeout <-schedule_timeout_uninterruptible
do_suspend_trac-1267 [00] 154.925216: init_timer_on_stack <-schedule_timeout
do_suspend_trac-1267 [00] 154.925216: __init_timer <-init_timer_on_stack
do_suspend_trac-1267 [00] 154.925216: __mod_timer <-schedule_timeout
do_suspend_trac-1267 [00] 154.925216: __timer_stats_timer_set_start_info <-__mod_timer
do_suspend_trac-1267 [00] 154.925216: lock_timer_base <-__mod_timer
do_suspend_trac-1267 [00] 154.925216: _spin_lock_irqsave <-lock_timer_base
do_suspend_trac-1267 [00] 154.925216: internal_add_timer <-__mod_timer
do_suspend_trac-1267 [00] 154.925216: _spin_unlock_irqrestore <-__mod_timer
do_suspend_trac-1267 [00] 154.925216: _spin_lock <-schedule
do_suspend_trac-1267 [00] 154.925216: marker_probe_cb <-schedule
do_suspend_trac-1267 [00] 154.925216: _spin_unlock <-tracing_record_cmdline
do_suspend_trac-1267 [00] 154.925216: __switch_to <-schedule
<idle>-0 [00] 154.925216: _spin_unlock_irq <-finish_task_switch
<idle>-0 [00] 154.925216: tick_nohz_stop_sched_tick <-cpu_idle
<idle>-0 [00] 154.925216: ktime_get <-tick_nohz_stop_sched_tick
<idle>-0 [00] 154.925216: ktime_get_ts <-ktime_get
<idle>-0 [00] 154.925216: getnstimeofday <-ktime_get_ts
<idle>-0 [00] 154.925216: set_normalized_timespec <-ktime_get_ts
<idle>-0 [00] 154.925216: get_next_timer_interrupt <-tick_nohz_stop_sched_tick
<idle>-0 [00] 154.925216: _spin_lock <-get_next_timer_interrupt
<idle>-0 [00] 154.925216: _spin_unlock <-get_next_timer_interrupt
<idle>-0 [00] 154.925216: hrtimer_get_next_event <-get_next_timer_interrupt
<idle>-0 [00] 154.925216: _spin_lock_irqsave <-hrtimer_get_next_event
<idle>-0 [00] 154.925216: _spin_unlock_irqrestore <-hrtimer_get_next_event
<idle>-0 [00] 154.925216: rcu_needs_cpu <-tick_nohz_stop_sched_tick
<idle>-0 [00] 154.925216: rcu_pending <-rcu_needs_cpu
<idle>-0 [00] 154.925216: __rcu_pending <-rcu_pending
<idle>-0 [00] 154.925216: __rcu_pending <-rcu_pending
<idle>-0 [00] 154.925216: rcu_pending <-cpu_idle
<idle>-0 [00] 154.925216: __rcu_pending <-rcu_pending
<idle>-0 [00] 154.925216: __rcu_pending <-rcu_pending
<idle>-0 [00] 154.925216: default_idle <-cpu_idle
<idle>-0 [00] 154.925216: do_IRQ <-common_interrupt
<idle>-0 [00] 154.925216: irq_enter <-do_IRQ
<idle>-0 [00] 154.925216: tick_nohz_stop_idle <-irq_enter
<idle>-0 [00] 154.925216: ktime_get <-tick_nohz_stop_idle
<idle>-0 [00] 154.925216: ktime_get_ts <-ktime_get
<idle>-0 [00] 154.925216: getnstimeofday <-ktime_get_ts
<idle>-0 [00] 154.925216: set_normalized_timespec <-ktime_get_ts
<idle>-0 [00] 154.925216: ktime_get <-sched_clock_tick
<idle>-0 [00] 154.925216: ktime_get_ts <-ktime_get
<idle>-0 [00] 154.925216: getnstimeofday <-ktime_get_ts
<idle>-0 [00] 154.925216: set_normalized_timespec <-ktime_get_ts
# ktime_get_ts() calls getnstimeofday() which ends up calling clocksource_read()
# thus updating the kernel timestamp. note that we call getnstimeofday()
# twice above but for some reason they do not impact the time stamp...
# There is an IRQ in here, possibly an HRT-driven IRQ and I'm wondering
# if that is kicking the TSC into ticking?
<idle>-0 [00] 154.929837: touch_softlockup_watchdog <-sched_clock_idle_wakeup_event
<idle>-0 [00] 154.929839: tick_nohz_update_jiffies <-irq_enter

~Deepak

--
Deepak Saxena http://www.greenbasement.info
_____ __o Kernel Hacker, One Laptop Per Child (o>
------ -\<, Give One Laptop, Get One Laptop //\
----- ( )/ ( ) http://www.amazon.com/xo V_/_
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

2008-12-18 22:20:40

by Rafael J. Wysocki

[permalink] [raw]

Subject: Re: TSC not updating after resume: Bug or Feature?

On Thursday, 18 of December 2008, Ingo Molnar wrote:
>
> * Peter Zijlstra <[email protected]> wrote:
>
> > Rafael, would something like this explain why we had to revert Shaggy's
> > patch?

Well, I have yet to understand what the suspend-resume of the timekeeping code
actually does.

The original description sounds worrisome to me, it looks like we've overlooked
something at least.

> > His patch fixes the backward motion filter and I'm at an utter
> > loss why that would break suspend.
>
> yes, i'd love to have this commit reinstated:
>
> 5b7dba4: sched_clock: prevent scd->clock from moving backwards
>
> and the bug triggered by hibernation fixed instead.

Shaggy said he had an idea of what was wrong, so I expect an updated version
of the patch to appear.

Thanks,
Rafael

2008-12-18 22:26:04

by Dave Kleikamp

[permalink] [raw]

Subject: Re: TSC not updating after resume: Bug or Feature?

On Thu, 2008-12-18 at 23:19 +0100, Rafael J. Wysocki wrote:
> On Thursday, 18 of December 2008, Ingo Molnar wrote:
> >
> > * Peter Zijlstra <[email protected]> wrote:
> >
> > > Rafael, would something like this explain why we had to revert Shaggy's
> > > patch?
>
> Well, I have yet to understand what the suspend-resume of the timekeeping code
> actually does.
>
> The original description sounds worrisome to me, it looks like we've overlooked
> something at least.
>
> > > His patch fixes the backward motion filter and I'm at an utter
> > > loss why that would break suspend.
> >
> > yes, i'd love to have this commit reinstated:
> >
> > 5b7dba4: sched_clock: prevent scd->clock from moving backwards
> >
> > and the bug triggered by hibernation fixed instead.
>
> Shaggy said he had an idea of what was wrong, so I expect an updated version
> of the patch to appear.

Not really. I said I'd look at it, but all I know is that it looks like
something clock-related isn't initialized correctly after resume. I
don't know enough to have any idea how to fix it.

> Thanks,
> Rafael
--
David Kleikamp
IBM Linux Technology Center

2008-12-18 23:40:00

by Rafael J. Wysocki

[permalink] [raw]

Subject: Re: TSC not updating after resume: Bug or Feature?

On Thursday, 18 of December 2008, Dave Kleikamp wrote:
> On Thu, 2008-12-18 at 23:19 +0100, Rafael J. Wysocki wrote:
> > On Thursday, 18 of December 2008, Ingo Molnar wrote:
> > >
> > > * Peter Zijlstra <[email protected]> wrote:
> > >
> > > > Rafael, would something like this explain why we had to revert Shaggy's
> > > > patch?
> >
> > Well, I have yet to understand what the suspend-resume of the timekeeping code
> > actually does.
> >
> > The original description sounds worrisome to me, it looks like we've overlooked
> > something at least.
> >
> > > > His patch fixes the backward motion filter and I'm at an utter
> > > > loss why that would break suspend.
> > >
> > > yes, i'd love to have this commit reinstated:
> > >
> > > 5b7dba4: sched_clock: prevent scd->clock from moving backwards
> > >
> > > and the bug triggered by hibernation fixed instead.
> >
> > Shaggy said he had an idea of what was wrong, so I expect an updated version
> > of the patch to appear.
>
> Not really. I said I'd look at it, but all I know is that it looks like
> something clock-related isn't initialized correctly after resume. I
> don't know enough to have any idea how to fix it.

Well, thanks for the update.

I'm going to have a look at the thing tomorrow, but I'm not sure I'll be able
to fix it.

Thanks,
Rafael

2008-12-21 16:28:23

by Thomas Gleixner

[permalink] [raw]

Subject: Re: TSC not updating after resume: Bug or Feature?

On Thu, 18 Dec 2008, Rafael J. Wysocki wrote:

> On Thursday, 18 of December 2008, Ingo Molnar wrote:
> >
> > * Peter Zijlstra <[email protected]> wrote:
> >
> > > Rafael, would something like this explain why we had to revert Shaggy's
> > > patch?
>
> Well, I have yet to understand what the suspend-resume of the timekeeping code
> actually does.

Thats rather simple:

suspend() saves the current time of the persistent clock (if
available), forwards the timekeeping variables so they can be reused
on resume, disables timekeeping activities and shuts down the clock
events layer.

resume() estimates the suspend time via persistent clock (if
available) and update xtime with the sleep length. After that it
reactivates timekeeping and resumes clock events and high resolution
timers.

So the sole purpose is:
- dis/enable timekeeping and clock event devices.
- keep track of the suspend time (if a persistent clock is available)

We reactivate clock event devices and hrtimers from timekeeping_resume
because clock events depend on functional timekeeping.

> The original description sounds worrisome to me, it looks like we've overlooked
> something at least.

Care to explain ?

> > > His patch fixes the backward motion filter and I'm at an utter
> > > loss why that would break suspend.
> >
> > yes, i'd love to have this commit reinstated:
> >
> > 5b7dba4: sched_clock: prevent scd->clock from moving backwards
> >
> > and the bug triggered by hibernation fixed instead.

Hmm. Depending on the hibernation sleep time we can end up with a
pretty long delta between the pre suspend and the post resume call to
__update_sched_clock().

I have the feeling that sched_clock looks into stale values after
resume and the first call to __update_sched_clock() trips over the
stale scd->clock value. Shaggy's patch brings scd->clock into the mix
and that might cause the whole machinery to blow up on resume.

Also we need to investigate whether sched_clock is referencing gtod
values _before_ timekeeping is resume.

Thanks,

tglx

2008-12-21 19:46:34

by Thomas Gleixner

[permalink] [raw]

Subject: Re: TSC not updating after resume: Bug or Feature?

On Sun, 21 Dec 2008, Thomas Gleixner wrote:
> On Thu, 18 Dec 2008, Rafael J. Wysocki wrote:
> > > 5b7dba4: sched_clock: prevent scd->clock from moving backwards
> > >
> > > and the bug triggered by hibernation fixed instead.
>
> Hmm. Depending on the hibernation sleep time we can end up with a
> pretty long delta between the pre suspend and the post resume call to
> __update_sched_clock().
>
> I have the feeling that sched_clock looks into stale values after
> resume and the first call to __update_sched_clock() trips over the
> stale scd->clock value. Shaggy's patch brings scd->clock into the mix
> and that might cause the whole machinery to blow up on resume.
>
> Also we need to investigate whether sched_clock is referencing gtod
> values _before_ timekeeping is resume.

I checked the two bugzillas (12149 & 12155) and both reporters have
hpet=force on the command line.

One of the reporters said: "If I de-select CONFIG_HPET_TIMER then the
issue went away ..."

It looks like this was not further investigated. Is this problem
reproducible on other systems as well ?

Thanks,

tglx

2008-12-21 20:55:29

by Thomas Gleixner

[permalink] [raw]

Subject: Re: TSC not updating after resume: Bug or Feature?

On Sun, 21 Dec 2008, Thomas Gleixner wrote:
> On Thu, 18 Dec 2008, Rafael J. Wysocki wrote:
> Also we need to investigate whether sched_clock is referencing gtod
> values _before_ timekeeping is resume.

It does. I added a check into __update_sched_clock() and it triggers
on every resume:

WARNING: at /home/tglx/work/kernel/git/linux-2.6/kernel/sched_clock.c:117 __update_sched_clock+0x31/0xad()
Modules linked in: fuse i915 drm bridge stp bnep sco l2cap bluetooth coretemp sunrpc ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 cpufreq_ondemand acpi_cpufreq freq_table dm_mirror dm_region_hash dm_log dm_multipath uinput joydev thinkpad_acpi hwmon snd_hda_intel snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device pcspkr i2c_i801 snd_pcm_oss firewire_ohci i2c_core firewire_core yenta_socket iwlagn iTCO_wdt snd_mixer_oss rsrc_nonstatic crc_itu_t iTCO_vendor_support snd_pcm iwlcore snd_timer snd_page_alloc snd_hwdep rfkill snd soundcore e1000e wmi video output battery ac [last unloaded: microcode]
Pid: 3212, comm: pm-hibernate Not tainted 2.6.28-rc9 #80
Call Trace:
[<ffffffff8103ef04>] warn_on_slowpath+0x5d/0x82
[<ffffffff8117becf>] ? vsnprintf+0x669/0x6ba
[<ffffffff81099c38>] ? __mod_zone_page_state+0x9/0x5a
[<ffffffff810900b7>] ? __rmqueue+0x24/0x1fa
[<ffffffff81057689>] __update_sched_clock+0x31/0xad
[<ffffffff81090004>] ? __rmqueue_smallest+0xb7/0x146
[<ffffffff81057809>] sched_clock_cpu+0x104/0x120
[<ffffffff811b5030>] ? acpi_ns_search_one_scope+0xc/0x4b
[<ffffffff81367d14>] ? _spin_lock+0xe/0x11
[<ffffffff81038e17>] try_to_wake_up+0xff/0x1b1
[<ffffffff8128e4ca>] ? cpufreq_debug_printk+0x17/0x124
[<ffffffff81038ed6>] default_wake_function+0xd/0xf
[<ffffffff81052d94>] autoremove_wake_function+0x16/0x3d
[<ffffffff8103262f>] __wake_up_common+0x46/0x75
[<ffffffff81032c7c>] __wake_up+0x38/0x50
[<ffffffff8104f378>] insert_work+0x68/0x70
[<ffffffff8104fb94>] __queue_work+0x2f/0x42
[<ffffffff8104fc2f>] queue_work_on+0x47/0x50
[<ffffffff8104fdc1>] queue_work+0x1f/0x21
[<ffffffff8104fdde>] schedule_work+0x1b/0x1d
[<ffffffff8128f0c8>] cpufreq_resume+0x157/0x16c
[<ffffffff811fe436>] __sysdev_resume+0x3d/0x55
[<ffffffff811fe484>] sysdev_resume+0x36/0x77
[<ffffffff81203e28>] device_power_up+0x15/0x20
[<ffffffff810670c9>] hibernation_snapshot+0x181/0x210
[<ffffffff81067220>] hibernate+0xc8/0x1a6
[<ffffffff81065f75>] state_store+0x5e/0xdd
[<ffffffff81176f83>] kobj_attr_store+0x17/0x19
[<ffffffff811106db>] sysfs_write_file+0xe4/0x119
[<ffffffff810c1856>] vfs_write+0xb0/0x10a
[<ffffffff810c197e>] sys_write+0x4c/0x74
[<ffffffff8100c1fa>] system_call_fastpath+0x16/0x1b

2008-12-21 21:19:49

by Thomas Gleixner

[permalink] [raw]

Subject: Re: TSC not updating after resume: Bug or Feature?

Deepak,

On Wed, 17 Dec 2008, Deepak Saxena wrote:
>
> Hi,
>
> I am using ftrace to get a detailed timing analysis of the resume process
> on the OLPC XO laptop and am seeing that when we start running again, the
> system timestamep is not being updated for several (hundreds of) thousands
> of cycles (~2000 ftrace entries). From following the ftrace, what is
> happening is that the clocksource is not updated until we run the cpu_idle()
> thread due to an explicit scheduling operation in the resume path that occurs
> via call to msleep from pci_set_power_state(). As I'm still fully groking the
> timekeeping code, the question(s) I have is whether this is expected behaviour
> and I should not assume valid timestamp data in the initial bits of the suspend path,
> whether this is an OLPC-specific bug, or whether I've uncovered a generic bug in
> the timekeeping implementation. This is on 2.6.27.7 as I've not gotten 28-rc
> up and running on the XO but can move that up in priority if this behaviour
> is different in newer kernels.

Can you please upload the full trace somewhere ?

Thanks,

tglx

2008-12-21 22:43:48

by Fabio Comolli

[permalink] [raw]

Subject: Re: TSC not updating after resume: Bug or Feature?

Hi Thomas.

On Sun, Dec 21, 2008 at 8:43 PM, Thomas Gleixner <[email protected]> wrote:
> On Sun, 21 Dec 2008, Thomas Gleixner wrote:
>> On Thu, 18 Dec 2008, Rafael J. Wysocki wrote:
>> > > 5b7dba4: sched_clock: prevent scd->clock from moving backwards
>> > >
>> > > and the bug triggered by hibernation fixed instead.
>>
>> Hmm. Depending on the hibernation sleep time we can end up with a
>> pretty long delta between the pre suspend and the post resume call to
>> __update_sched_clock().
>>
>> I have the feeling that sched_clock looks into stale values after
>> resume and the first call to __update_sched_clock() trips over the
>> stale scd->clock value. Shaggy's patch brings scd->clock into the mix
>> and that might cause the whole machinery to blow up on resume.
>>
>> Also we need to investigate whether sched_clock is referencing gtod
>> values _before_ timekeeping is resume.
>
> I checked the two bugzillas (12149 & 12155) and both reporters have
> hpet=force on the command line.

Well, I'm the submitter of bug 12155 and I never had hpet=force on my
command line. I don't know where you found it.
Actually my command line is:

-------------------------------------------------------------------------------------------------------------------------
fcomolli@hawking:~> cat /proc/cmdline
root=/dev/sda2 resume=/dev/sda3 i8042.nomux=1 pciehp.pciehp_force=1
quiet splash=silent vga=791
-------------------------------------------------------------------------------------------------------------------------

By the way, I don't know if it matters but the problema happened with
in-kernel hibernation and also in out-of-tree TuxOnIce hibernation.
Maybe this can help debugging the issue, I don't know.

>
> One of the reporters said: "If I de-select CONFIG_HPET_TIMER then the
> issue went away ..."
>
> It looks like this was not further investigated. Is this problem
> reproducible on other systems as well ?
>
> Thanks,
>
> tglx
> --

Regards,
Fabio

> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2008-12-22 14:50:46

by Thomas Gleixner

[permalink] [raw]

Subject: Re: TSC not updating after resume: Bug or Feature?

Fabio,

On Sun, 21 Dec 2008, Fabio Comolli wrote:
> Well, I'm the submitter of bug 12155 and I never had hpet=force on my
> command line. I don't know where you found it.
> Actually my command line is:

My bad. I confused it with the HPET force enable. (Your BIOS does not
advertise HPET).

> By the way, I don't know if it matters but the problema happened with
> in-kernel hibernation and also in out-of-tree TuxOnIce hibernation.
> Maybe this can help debugging the issue, I don't know.

Hmm, does not ring a bell here. Can you please apply the patch below
to mainline and retest ?

Thanks,

tglx
-----
diff --git a/kernel/sched_clock.c b/kernel/sched_clock.c
index e8ab096..c18a6dd 100644
--- a/kernel/sched_clock.c
+++ b/kernel/sched_clock.c
@@ -86,6 +86,14 @@ void sched_clock_init(void)
sched_clock_running = 1;
}

+void sched_clock_debug(void)
+{
+ struct sched_clock_data *scd = this_scd();
+
+ printk(KERN_INFO "SCD raw %llu gtod %llu clock %llu now %llu\n",
+ scd->tick_raw, scd->tick_gtod, scd->clock, sched_clock());
+}
+
/*
* min,max except they take wrapping into account
*/
@@ -118,13 +126,13 @@ static u64 __update_sched_clock(struct sched_clock_data *scd, u64 now)

/*
* scd->clock = clamp(scd->tick_gtod + delta,
- * max(scd->tick_gtod, scd->clock),
- * scd->tick_gtod + TICK_NSEC);
+ * max(scd->tick_gtod, scd->clock),
+ * max(scd->clock, scd->tick_gtod + TICK_NSEC));
*/

clock = scd->tick_gtod + delta;
min_clock = wrap_max(scd->tick_gtod, scd->clock);
- max_clock = scd->tick_gtod + TICK_NSEC;
+ max_clock = wrap_max(scd->clock, scd->tick_gtod + TICK_NSEC);

clock = wrap_max(clock, min_clock);
clock = wrap_min(clock, max_clock);
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index fa05e88..bafbd2d 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -304,6 +304,8 @@ static int timekeeping_suspended;
/* time in seconds when suspend began */
static unsigned long timekeeping_suspend_time;

+extern void sched_clock_debug(void);
+
/**
* timekeeping_resume - Resumes the generic timekeeping subsystem.
* @dev: unused
@@ -336,6 +338,8 @@ static int timekeeping_resume(struct sys_device *dev)
timekeeping_suspended = 0;
write_sequnlock_irqrestore(&xtime_lock, flags);

+ sched_clock_debug();
+
touch_softlockup_watchdog();

clockevents_notify(CLOCK_EVT_NOTIFY_RESUME, NULL);
@@ -357,6 +361,8 @@ static int timekeeping_suspend(struct sys_device *dev, pm_message_t state)
timekeeping_suspended = 1;
write_sequnlock_irqrestore(&xtime_lock, flags);

+ sched_clock_debug();
+
clockevents_notify(CLOCK_EVT_NOTIFY_SUSPEND, NULL);

return 0;

2008-12-22 15:01:00

[permalink] [raw]

Subject: Re: TSC not updating after resume: Bug or Feature?

* Thomas Gleixner <[email protected]> wrote:

> > By the way, I don't know if it matters but the problema happened with
> > in-kernel hibernation and also in out-of-tree TuxOnIce hibernation.
> > Maybe this can help debugging the issue, I don't know.
>
> Hmm, does not ring a bell here. Can you please apply the patch below to
> mainline and retest ?

... and he should send a dmesg after a suspend cycle, right?

Ingo

2008-12-22 18:48:30

by Thomas Gleixner

[permalink] [raw]

Subject: Re: TSC not updating after resume: Bug or Feature?

On Mon, 22 Dec 2008, Ingo Molnar wrote:
> * Thomas Gleixner <[email protected]> wrote:
>
> > > By the way, I don't know if it matters but the problema happened with
> > > in-kernel hibernation and also in out-of-tree TuxOnIce hibernation.
> > > Maybe this can help debugging the issue, I don't know.
> >
> > Hmm, does not ring a bell here. Can you please apply the patch below to
> > mainline and retest ?
>
> ... and he should send a dmesg after a suspend cycle, right?

Yes :)

I digged more in the bugzillas. Toralf added some debug to
__update_sched_clock():

- max_clock = wrap_max(scd->clock, scd->tick_gtod + TICK_NSEC);
+ max_clock = scd->tick_gtod + TICK_NSEC;
+ if (scd->clock > max_clock)
+ printk(KERN_INFO "%d %d\n", scd->clock, max_clock);

The interesting output is:

Dec 14 21:55:55 n22 Back to C!
Dec 14 21:55:55 n22 Extended CMOS year: 2000
Dec 14 21:55:55 n22 Force enabled HPET at resume
Dec 14 21:55:55 n22 212611283 77

The 77 is totaly bogus and it's likely not just a truncation of the
64bit value because (scd->clock > max_clock) evaluates to true. This
output is _AFTER_ timekeeping resume because the HPET force enable
message comes from timekeeping resume.

Toralf, can you please try the patch below on top of mainline ? Make
sure that CONFIG_PRINTK_TIME is off, otherwise the kernel
deadlocks. Please send the dmesg output from a suspend/resume cycle.

Thanks,

tglx
----
diff --git a/kernel/sched_clock.c b/kernel/sched_clock.c
index e8ab096..9c307a2 100644
--- a/kernel/sched_clock.c
+++ b/kernel/sched_clock.c
@@ -86,6 +86,14 @@ void sched_clock_init(void)
sched_clock_running = 1;
}

+void sched_clock_debug(void)
+{
+ struct sched_clock_data *scd = this_scd();
+
+ printk(KERN_INFO "SCD raw %llu gtod %llu clock %llu now %llu\n",
+ scd->tick_raw, scd->tick_gtod, scd->clock, sched_clock());
+}
+
/*
* min,max except they take wrapping into account
*/
@@ -125,6 +133,11 @@ static u64 __update_sched_clock(struct sched_clock_data *scd, u64 now)
clock = scd->tick_gtod + delta;
min_clock = wrap_max(scd->tick_gtod, scd->clock);
max_clock = scd->tick_gtod + TICK_NSEC;
+ if (scd->clock > max_clock) {
+ printk(KERN_INFO "SC: %llu %llu %llu %llu %llu %llu\n",
+ scd->clock, scd->tick_raw, scd->tick_gtod, now,
+ max_clock, clock);
+ }

clock = wrap_max(clock, min_clock);
clock = wrap_min(clock, max_clock);
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index fa05e88..bafbd2d 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -304,6 +304,8 @@ static int timekeeping_suspended;
/* time in seconds when suspend began */
static unsigned long timekeeping_suspend_time;

+extern void sched_clock_debug(void);
+
/**
* timekeeping_resume - Resumes the generic timekeeping subsystem.
* @dev: unused
@@ -336,6 +338,8 @@ static int timekeeping_resume(struct sys_device *dev)
timekeeping_suspended = 0;
write_sequnlock_irqrestore(&xtime_lock, flags);

+ sched_clock_debug();
+
touch_softlockup_watchdog();

clockevents_notify(CLOCK_EVT_NOTIFY_RESUME, NULL);
@@ -357,6 +361,8 @@ static int timekeeping_suspend(struct sys_device *dev, pm_message_t state)
timekeeping_suspended = 1;
write_sequnlock_irqrestore(&xtime_lock, flags);

+ sched_clock_debug();
+
clockevents_notify(CLOCK_EVT_NOTIFY_SUSPEND, NULL);

return 0;

2008-12-22 20:20:12

by Fabio Comolli

[permalink] [raw]

Subject: Re: TSC not updating after resume: Bug or Feature?

Hi Thomas.
Please find the log attached (only one hibernate/resume cycle).

Regads,
Fabio

On Mon, Dec 22, 2008 at 3:48 PM, Thomas Gleixner <[email protected]> wrote:
> Fabio,
>
> On Sun, 21 Dec 2008, Fabio Comolli wrote:
>> Well, I'm the submitter of bug 12155 and I never had hpet=force on my
>> command line. I don't know where you found it.
>> Actually my command line is:
>
> My bad. I confused it with the HPET force enable. (Your BIOS does not
> advertise HPET).
>
>> By the way, I don't know if it matters but the problema happened with
>> in-kernel hibernation and also in out-of-tree TuxOnIce hibernation.
>> Maybe this can help debugging the issue, I don't know.
>
> Hmm, does not ring a bell here. Can you please apply the patch below
> to mainline and retest ?
>
> Thanks,
>
> tglx
> -----
> diff --git a/kernel/sched_clock.c b/kernel/sched_clock.c
> index e8ab096..c18a6dd 100644
> --- a/kernel/sched_clock.c
> +++ b/kernel/sched_clock.c
> @@ -86,6 +86,14 @@ void sched_clock_init(void)
> sched_clock_running = 1;
> }
>
> +void sched_clock_debug(void)
> +{
> + struct sched_clock_data *scd = this_scd();
> +
> + printk(KERN_INFO "SCD raw %llu gtod %llu clock %llu now %llu\n",
> + scd->tick_raw, scd->tick_gtod, scd->clock, sched_clock());
> +}
> +
> /*
> * min,max except they take wrapping into account
> */
> @@ -118,13 +126,13 @@ static u64 __update_sched_clock(struct sched_clock_data *scd, u64 now)
>
> /*
> * scd->clock = clamp(scd->tick_gtod + delta,
> - * max(scd->tick_gtod, scd->clock),
> - * scd->tick_gtod + TICK_NSEC);
> + * max(scd->tick_gtod, scd->clock),
> + * max(scd->clock, scd->tick_gtod + TICK_NSEC));
> */
>
> clock = scd->tick_gtod + delta;
> min_clock = wrap_max(scd->tick_gtod, scd->clock);
> - max_clock = scd->tick_gtod + TICK_NSEC;
> + max_clock = wrap_max(scd->clock, scd->tick_gtod + TICK_NSEC);
>
> clock = wrap_max(clock, min_clock);
> clock = wrap_min(clock, max_clock);
> diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
> index fa05e88..bafbd2d 100644
> --- a/kernel/time/timekeeping.c
> +++ b/kernel/time/timekeeping.c
> @@ -304,6 +304,8 @@ static int timekeeping_suspended;
> /* time in seconds when suspend began */
> static unsigned long timekeeping_suspend_time;
>
> +extern void sched_clock_debug(void);
> +
> /**
> * timekeeping_resume - Resumes the generic timekeeping subsystem.
> * @dev: unused
> @@ -336,6 +338,8 @@ static int timekeeping_resume(struct sys_device *dev)
> timekeeping_suspended = 0;
> write_sequnlock_irqrestore(&xtime_lock, flags);
>
> + sched_clock_debug();
> +
> touch_softlockup_watchdog();
>
> clockevents_notify(CLOCK_EVT_NOTIFY_RESUME, NULL);
> @@ -357,6 +361,8 @@ static int timekeeping_suspend(struct sys_device *dev, pm_message_t state)
> timekeeping_suspended = 1;
> write_sequnlock_irqrestore(&xtime_lock, flags);
>
> + sched_clock_debug();
> +
> clockevents_notify(CLOCK_EVT_NOTIFY_SUSPEND, NULL);
>
> return 0;
>
>

Attachments:

(No filename) (3.14 kB)
dmesg.log (52.64 kB)
Download all attachments

2008-12-22 20:37:27

by Thomas Gleixner

[permalink] [raw]

Subject: Re: TSC not updating after resume: Bug or Feature?

On Mon, 22 Dec 2008, Thomas Gleixner wrote:
> On Mon, 22 Dec 2008, Ingo Molnar wrote:
> > * Thomas Gleixner <[email protected]> wrote:
> >
> > > > By the way, I don't know if it matters but the problema happened with
> > > > in-kernel hibernation and also in out-of-tree TuxOnIce hibernation.
> > > > Maybe this can help debugging the issue, I don't know.
> > >
> > > Hmm, does not ring a bell here. Can you please apply the patch below to
> > > mainline and retest ?
> >
> > ... and he should send a dmesg after a suspend cycle, right?
>
> Yes :)
>
> I digged more in the bugzillas. Toralf added some debug to
> __update_sched_clock():
>
> - max_clock = wrap_max(scd->clock, scd->tick_gtod + TICK_NSEC);
> + max_clock = scd->tick_gtod + TICK_NSEC;
> + if (scd->clock > max_clock)
> + printk(KERN_INFO "%d %d\n", scd->clock, max_clock);
>
> The interesting output is:
>
> Dec 14 21:55:55 n22 Back to C!
> Dec 14 21:55:55 n22 Extended CMOS year: 2000
> Dec 14 21:55:55 n22 Force enabled HPET at resume
> Dec 14 21:55:55 n22 212611283 77
>
> The 77 is totaly bogus and it's likely not just a truncation of the
> 64bit value because (scd->clock > max_clock) evaluates to true. This
> output is _AFTER_ timekeeping resume because the HPET force enable
> message comes from timekeeping resume.

Seems I'm talking to myself, but I think I finally decoded the
mystery:

resume()
cpufreq_resume()
tsc:time_cpufreq_notifier()
set_cyc2ns_scale()
sched_clock_idle_sleep_event()
sched_clock_tick()
ktime_get()
hpet_read()

This happens _BEFORE_ timekeeping has resumed, so hpet_read() returns
nonsense and the timekeeping code uses the stale pre suspend
xtime/clocksource reference values to calculate the time. So the gtod
reference in sched_clock can result in total crap depending on the
time when the suspend happened.

Shaggys patch clamps sched_clock to the stale scd->clock value which
might explain the further wreckage.

The above sequence happens only for CPUs with a CPU frequency coupled
TSC, so on newer machines with CPU frequency invariant TSC this does
not happen.

/me stomps off to find a box to confirm that.

Thanks,

tglx

2008-12-22 20:39:43

by Fabio Comolli

[permalink] [raw]

Subject: Re: TSC not updating after resume: Bug or Feature?

Hi Thomas.

On Mon, Dec 22, 2008 at 9:36 PM, Thomas Gleixner <[email protected]> wrote:
> On Mon, 22 Dec 2008, Thomas Gleixner wrote:
>> On Mon, 22 Dec 2008, Ingo Molnar wrote:
>> > * Thomas Gleixner <[email protected]> wrote:
>> >
>> > > > By the way, I don't know if it matters but the problema happened with
>> > > > in-kernel hibernation and also in out-of-tree TuxOnIce hibernation.
>> > > > Maybe this can help debugging the issue, I don't know.
>> > >
>> > > Hmm, does not ring a bell here. Can you please apply the patch below to
>> > > mainline and retest ?
>> >
>> > ... and he should send a dmesg after a suspend cycle, right?
>>
>> Yes :)
>>
>> I digged more in the bugzillas. Toralf added some debug to
>> __update_sched_clock():
>>
>> - max_clock = wrap_max(scd->clock, scd->tick_gtod + TICK_NSEC);
>> + max_clock = scd->tick_gtod + TICK_NSEC;
>> + if (scd->clock > max_clock)
>> + printk(KERN_INFO "%d %d\n", scd->clock, max_clock);
>>
>> The interesting output is:
>>
>> Dec 14 21:55:55 n22 Back to C!
>> Dec 14 21:55:55 n22 Extended CMOS year: 2000
>> Dec 14 21:55:55 n22 Force enabled HPET at resume
>> Dec 14 21:55:55 n22 212611283 77
>>
>> The 77 is totaly bogus and it's likely not just a truncation of the
>> 64bit value because (scd->clock > max_clock) evaluates to true. This
>> output is _AFTER_ timekeeping resume because the HPET force enable
>> message comes from timekeeping resume.
>
> Seems I'm talking to myself, but I think I finally decoded the
> mystery:
>
> resume()
> cpufreq_resume()
> tsc:time_cpufreq_notifier()
> set_cyc2ns_scale()
> sched_clock_idle_sleep_event()
> sched_clock_tick()
> ktime_get()
> hpet_read()
>
> This happens _BEFORE_ timekeeping has resumed, so hpet_read() returns
> nonsense and the timekeeping code uses the stale pre suspend
> xtime/clocksource reference values to calculate the time. So the gtod
> reference in sched_clock can result in total crap depending on the
> time when the suspend happened.
>
> Shaggys patch clamps sched_clock to the stale scd->clock value which
> might explain the further wreckage.
>
> The above sequence happens only for CPUs with a CPU frequency coupled
> TSC, so on newer machines with CPU frequency invariant TSC this does
> not happen.
>
> /me stomps off to find a box to confirm that.
>

Ok, just for reference here is my

fcomolli@hawking:~> cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 13
model name : Intel(R) Pentium(R) M processor 1.73GHz
stepping : 8
cpu MHz : 800.000
cache size : 2048 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat clflush dts acpi mmx fxsr sse sse2 ss tm pbe nx up bts est tm2
bogomips : 1596.95
clflush size : 64
power management:

Regards,
Fabio
> Thanks,
>
> tglx
>

2008-12-22 21:14:47

by Thomas Gleixner

[permalink] [raw]

Subject: Re: TSC not updating after resume: Bug or Feature?

Fabio,

On Mon, 22 Dec 2008, Fabio Comolli wrote:
> Hi Thomas.
>
> On Mon, Dec 22, 2008 at 9:36 PM, Thomas Gleixner <[email protected]> wrote:
> > On Mon, 22 Dec 2008, Thomas Gleixner wrote:
> >> On Mon, 22 Dec 2008, Ingo Molnar wrote:
> >> > * Thomas Gleixner <[email protected]> wrote:
> >> >
> >> > > > By the way, I don't know if it matters but the problema happened with
> >> > > > in-kernel hibernation and also in out-of-tree TuxOnIce hibernation.
> >> > > > Maybe this can help debugging the issue, I don't know.
> >> > >
> >> > > Hmm, does not ring a bell here. Can you please apply the patch below to
> >> > > mainline and retest ?
> >> >
> >> > ... and he should send a dmesg after a suspend cycle, right?
> >>
> >> Yes :)
> >>
> >> I digged more in the bugzillas. Toralf added some debug to
> >> __update_sched_clock():
> >>
> >> - max_clock = wrap_max(scd->clock, scd->tick_gtod + TICK_NSEC);
> >> + max_clock = scd->tick_gtod + TICK_NSEC;
> >> + if (scd->clock > max_clock)
> >> + printk(KERN_INFO "%d %d\n", scd->clock, max_clock);
> >>
> >> The interesting output is:
> >>
> >> Dec 14 21:55:55 n22 Back to C!
> >> Dec 14 21:55:55 n22 Extended CMOS year: 2000
> >> Dec 14 21:55:55 n22 Force enabled HPET at resume
> >> Dec 14 21:55:55 n22 212611283 77
> >>
> >> The 77 is totaly bogus and it's likely not just a truncation of the
> >> 64bit value because (scd->clock > max_clock) evaluates to true. This
> >> output is _AFTER_ timekeeping resume because the HPET force enable
> >> message comes from timekeeping resume.
> >
> > Seems I'm talking to myself, but I think I finally decoded the
> > mystery:
> >
> > resume()
> > cpufreq_resume()
> > tsc:time_cpufreq_notifier()
> > set_cyc2ns_scale()
> > sched_clock_idle_sleep_event()
> > sched_clock_tick()
> > ktime_get()
> > hpet_read()
> >
> > This happens _BEFORE_ timekeeping has resumed, so hpet_read() returns
> > nonsense and the timekeeping code uses the stale pre suspend
> > xtime/clocksource reference values to calculate the time. So the gtod
> > reference in sched_clock can result in total crap depending on the
> > time when the suspend happened.
> >
> > Shaggys patch clamps sched_clock to the stale scd->clock value which
> > might explain the further wreckage.
> >
> > The above sequence happens only for CPUs with a CPU frequency coupled
> > TSC, so on newer machines with CPU frequency invariant TSC this does
> > not happen.
> >
> > /me stomps off to find a box to confirm that.
> >
>
> Ok, just for reference here is my
>
> fcomolli@hawking:~> cat /proc/cpuinfo
> processor : 0
> vendor_id : GenuineIntel
> cpu family : 6
> model : 13
> model name : Intel(R) Pentium(R) M processor 1.73GHz
> stepping : 8
> cpu MHz : 800.000
> cache size : 2048 KB
> fdiv_bug : no
> hlt_bug : no
> f00f_bug : no
> coma_bug : no
> fpu : yes
> fpu_exception : yes
> cpuid level : 2
> wp : yes
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
> pat clflush dts acpi mmx fxsr sse sse2 ss tm pbe nx up bts est tm2
> bogomips : 1596.95
> clflush size : 64
> power management:

Can please you revert the last patch and apply the following ? Does
the WARN_ON trigger ?

Thanks,

tglx
---
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index fa05e88..0d1526d 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -46,6 +46,9 @@ struct timespec xtime __attribute__ ((aligned (16)));
struct timespec wall_to_monotonic __attribute__ ((aligned (16)));
static unsigned long total_sleep_time; /* seconds */

+/* flag for if timekeeping is suspended */
+static int timekeeping_suspended;
+
static struct timespec xtime_cache __attribute__ ((aligned (16)));
void update_xtime_cache(u64 nsec)
{
@@ -92,6 +95,8 @@ void getnstimeofday(struct timespec *ts)
unsigned long seq;
s64 nsecs;

+ WARN_ON(timekeeping_suspended);
+
do {
seq = read_seqbegin(&xtime_lock);

@@ -299,8 +304,6 @@ void __init timekeeping_init(void)
write_sequnlock_irqrestore(&xtime_lock, flags);
}

-/* flag for if timekeeping is suspended */
-static int timekeeping_suspended;
/* time in seconds when suspend began */
static unsigned long timekeeping_suspend_time;

2008-12-22 21:35:29

by Fabio Comolli

[permalink] [raw]

Subject: Re: TSC not updating after resume: Bug or Feature?

Hi Thomas.

On Mon, Dec 22, 2008 at 10:13 PM, Thomas Gleixner <[email protected]> wrote:
> Fabio,
>
> On Mon, 22 Dec 2008, Fabio Comolli wrote:
>> Hi Thomas.
>>
>> On Mon, Dec 22, 2008 at 9:36 PM, Thomas Gleixner <[email protected]> wrote:
>> > On Mon, 22 Dec 2008, Thomas Gleixner wrote:
>> >> On Mon, 22 Dec 2008, Ingo Molnar wrote:
>> >> > * Thomas Gleixner <[email protected]> wrote:
>> >> >
>> >> > > > By the way, I don't know if it matters but the problema happened with
>> >> > > > in-kernel hibernation and also in out-of-tree TuxOnIce hibernation.
>> >> > > > Maybe this can help debugging the issue, I don't know.
>> >> > >
>> >> > > Hmm, does not ring a bell here. Can you please apply the patch below to
>> >> > > mainline and retest ?
>> >> >
>> >> > ... and he should send a dmesg after a suspend cycle, right?
>> >>
>> >> Yes :)
>> >>
>> >> I digged more in the bugzillas. Toralf added some debug to
>> >> __update_sched_clock():
>> >>
>> >> - max_clock = wrap_max(scd->clock, scd->tick_gtod + TICK_NSEC);
>> >> + max_clock = scd->tick_gtod + TICK_NSEC;
>> >> + if (scd->clock > max_clock)
>> >> + printk(KERN_INFO "%d %d\n", scd->clock, max_clock);
>> >>
>> >> The interesting output is:
>> >>
>> >> Dec 14 21:55:55 n22 Back to C!
>> >> Dec 14 21:55:55 n22 Extended CMOS year: 2000
>> >> Dec 14 21:55:55 n22 Force enabled HPET at resume
>> >> Dec 14 21:55:55 n22 212611283 77
>> >>
>> >> The 77 is totaly bogus and it's likely not just a truncation of the
>> >> 64bit value because (scd->clock > max_clock) evaluates to true. This
>> >> output is _AFTER_ timekeeping resume because the HPET force enable
>> >> message comes from timekeeping resume.
>> >
>> > Seems I'm talking to myself, but I think I finally decoded the
>> > mystery:
>> >
>> > resume()
>> > cpufreq_resume()
>> > tsc:time_cpufreq_notifier()
>> > set_cyc2ns_scale()
>> > sched_clock_idle_sleep_event()
>> > sched_clock_tick()
>> > ktime_get()
>> > hpet_read()
>> >
>> > This happens _BEFORE_ timekeeping has resumed, so hpet_read() returns
>> > nonsense and the timekeeping code uses the stale pre suspend
>> > xtime/clocksource reference values to calculate the time. So the gtod
>> > reference in sched_clock can result in total crap depending on the
>> > time when the suspend happened.
>> >
>> > Shaggys patch clamps sched_clock to the stale scd->clock value which
>> > might explain the further wreckage.
>> >
>> > The above sequence happens only for CPUs with a CPU frequency coupled
>> > TSC, so on newer machines with CPU frequency invariant TSC this does
>> > not happen.
>> >
>> > /me stomps off to find a box to confirm that.
>> >
>>
>> Ok, just for reference here is my
>>
>> fcomolli@hawking:~> cat /proc/cpuinfo
>> processor : 0
>> vendor_id : GenuineIntel
>> cpu family : 6
>> model : 13
>> model name : Intel(R) Pentium(R) M processor 1.73GHz
>> stepping : 8
>> cpu MHz : 800.000
>> cache size : 2048 KB
>> fdiv_bug : no
>> hlt_bug : no
>> f00f_bug : no
>> coma_bug : no
>> fpu : yes
>> fpu_exception : yes
>> cpuid level : 2
>> wp : yes
>> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
>> pat clflush dts acpi mmx fxsr sse sse2 ss tm pbe nx up bts est tm2
>> bogomips : 1596.95
>> clflush size : 64
>> power management:
>
> Can please you revert the last patch and apply the following ? Does
> the WARN_ON trigger ?
>
> Thanks,
>
> tglx
> ---
> diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
> index fa05e88..0d1526d 100644
> --- a/kernel/time/timekeeping.c
> +++ b/kernel/time/timekeeping.c
> @@ -46,6 +46,9 @@ struct timespec xtime __attribute__ ((aligned (16)));
> struct timespec wall_to_monotonic __attribute__ ((aligned (16)));
> static unsigned long total_sleep_time; /* seconds */
>
> +/* flag for if timekeeping is suspended */
> +static int timekeeping_suspended;
> +
> static struct timespec xtime_cache __attribute__ ((aligned (16)));
> void update_xtime_cache(u64 nsec)
> {
> @@ -92,6 +95,8 @@ void getnstimeofday(struct timespec *ts)
> unsigned long seq;
> s64 nsecs;
>
> + WARN_ON(timekeeping_suspended);
> +
> do {
> seq = read_seqbegin(&xtime_lock);
>
> @@ -299,8 +304,6 @@ void __init timekeeping_init(void)
> write_sequnlock_irqrestore(&xtime_lock, flags);
> }
>
> -/* flag for if timekeeping is suspended */
> -static int timekeeping_suspended;
> /* time in seconds when suspend began */
> static unsigned long timekeeping_suspend_time;
>
>

Yes, it does:

[ 159.487588] Syncing filesystems ... done.
[ 159.488523] Freezing user space processes ... (elapsed 0.00 seconds) done.
[ 159.489763] Freezing remaining freezable tasks ... (elapsed 0.00
seconds) done.
[ 159.491144] PM: Shrinking memory... done (0 pages freed)
[ 159.556730] PM: Freed 0 kbytes in 0.06 seconds (0.00 MB/s)
[ 159.556752] Suspending console(s) (use no_console_suspend to debug)
[ 159.557328] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[ 159.621764] pciehp 0000:00:1c.0:pcie02: pciehp_suspend ENTRY
[ 159.621989] ACPI handle has no context!
[ 159.621999] tifm_7xx1 0000:06:06.3: PCI INT A disabled
[ 159.622008] ACPI handle has no context!
[ 159.636240] ACPI handle has no context!
[ 159.652272] eth1: Going into suspend...
[ 159.655644] ipw2200 0000:06:05.0: PCI INT A disabled
[ 159.655652] ACPI handle has no context!
[ 159.668043] [fglrx] Power down the ASIC .
[ 159.764052] fglrx_pci 0000:01:00.0: PCI INT A disabled
[ 159.764175] ata2: port disabled. ignoring.
[ 159.764232] ata_piix 0000:00:1f.1: PCI INT A disabled
[ 159.764507] Intel ICH 0000:00:1e.2: PCI INT A disabled
[ 159.764646] ehci_hcd 0000:00:1d.7: PCI INT A disabled
[ 159.764686] uhci_hcd 0000:00:1d.3: PCI INT D disabled
[ 159.764725] uhci_hcd 0000:00:1d.2: PCI INT C disabled
[ 159.764764] uhci_hcd 0000:00:1d.1: PCI INT B disabled
[ 159.764803] uhci_hcd 0000:00:1d.0: PCI INT A disabled
[ 159.764808] pciehp 0000:00:1c.0:pcie02: pciehp_suspend ENTRY
[ 159.765572] ACPI: Preparing to enter system sleep state S4
[ 159.766924] Disabling non-boot CPUs ...
[ 159.767100] PM: Creating hibernation image:
[ 159.768005] PM: Need to copy 178481 pages
[ 159.768005] Intel machine check architecture supported.
[ 159.768005] Intel machine check reporting enabled on CPU#0.
[ 159.768005] ------------[ cut here ]------------
[ 159.768005] WARNING: at kernel/time/timekeeping.c:98
getnstimeofday+0x21/0xcd()
[ 159.768005] Modules linked in: cisco_ipsec(P) fglrx(P)
[ 159.768005] Pid: 5883, comm: s2disk Tainted: P
2.6.28-rc9-00050-gf8fd8a7 #9
[ 159.768005] Call Trace:
[ 159.768005] [<c0122ef8>] warn_on_slowpath+0x40/0x61
[ 159.768005] [<c024ddec>] acpi_ex_insert_into_field+0x2cf/0x2ef
[ 159.768005] [<c0154c53>] __rmqueue_smallest+0xbe/0x133
[ 159.768005] [<c01389a2>] getnstimeofday+0x21/0xcd
[ 159.768005] [<c0135e26>] ktime_get_ts+0x1d/0x3f
[ 159.768005] [<c0135e57>] ktime_get+0xf/0x2b
[ 159.768005] [<c013736b>] sched_clock_tick+0x46/0x83
[ 159.768005] [<c01373ad>] sched_clock_idle_wakeup_event+0x5/0xa
[ 159.768005] [<c0108670>] set_cyc2ns_scale+0x3f/0x5e
[ 159.768005] [<c01088af>] time_cpufreq_notifier+0xf9/0x103
[ 159.768005] [<c0136782>] notifier_call_chain+0x2a/0x52
[ 159.768005] [<c0136866>] __srcu_notifier_call_chain+0x35/0x4a
[ 159.768005] [<c0136884>] srcu_notifier_call_chain+0x9/0xc
[ 159.768005] [<c031cbca>] cpufreq_resume+0xf3/0x112
[ 159.768005] [<c028c3f3>] __sysdev_resume+0x24/0x34
[ 159.768005] [<c028c421>] sysdev_resume+0x1e/0x50
[ 159.768005] [<c0290637>] device_power_up+0x8/0x10
[ 159.768005] [<c0149478>] create_image+0x8f/0x9f
[ 159.768005] [<c01494ed>] hibernation_snapshot+0x65/0xce
[ 159.768005] [<c014bddc>] snapshot_ioctl+0x208/0x47d
[ 159.768005] [<c014bbd4>] snapshot_ioctl+0x0/0x47d
[ 159.768005] [<c017a279>] vfs_ioctl+0x1c/0x5f
[ 159.768005] [<c017a785>] do_vfs_ioctl+0x167/0x172
[ 159.768005] [<c017a7bc>] sys_ioctl+0x2c/0x42
[ 159.768005] [<c01038cd>] sysenter_do_call+0x12/0x35
[ 159.768005] ---[ end trace 828aa214905e8df0 ]---
[ 313.187879] Force enabled HPET at resume
[ 159.771124] ACPI: Waking up from system sleep state S4
[ 159.792183] ACPI: EC: non-query interrupt received, switching to
interrupt mode
[ 159.855414] pcieport-driver 0000:00:01.0: setting latency timer to 64
[ 159.855497] pcieport-driver 0000:00:1c.0: setting latency timer to 64
[ 159.855502] pciehp 0000:00:1c.0:pcie02: pciehp_resume ENTRY
[ 160.856182] uhci_hcd 0000:00:1d.0: PCI INT A -> GSI 23 (level, low) -> IRQ 23
[ 160.856190] uhci_hcd 0000:00:1d.0: setting latency timer to 64
[ 160.856237] uhci_hcd 0000:00:1d.1: PCI INT B -> GSI 19 (level, low) -> IRQ 19
[ 160.856244] uhci_hcd 0000:00:1d.1: setting latency timer to 64
[ 160.856291] uhci_hcd 0000:00:1d.2: PCI INT C -> GSI 18 (level, low) -> IRQ 18
[ 160.856297] uhci_hcd 0000:00:1d.2: setting latency timer to 64
[ 160.856343] uhci_hcd 0000:00:1d.3: PCI INT D -> GSI 16 (level, low) -> IRQ 16
[ 160.856350] uhci_hcd 0000:00:1d.3: setting latency timer to 64
[ 160.856396] ehci_hcd 0000:00:1d.7: PCI INT A -> GSI 23 (level, low) -> IRQ 23
[ 160.856404] ehci_hcd 0000:00:1d.7: setting latency timer to 64
[ 160.856521] pci 0000:00:1e.0: setting latency timer to 64
[ 160.856569] Intel ICH 0000:00:1e.2: PCI INT A -> GSI 17 (level,
low) -> IRQ 17
[ 160.856576] Intel ICH 0000:00:1e.2: setting latency timer to 64
[ 161.870198] ata_piix 0000:00:1f.1: restoring config space at offset
0x1 (was 0x2880001, writing 0x2880005)
[ 161.870211] ata_piix 0000:00:1f.1: PCI INT A -> GSI 18 (level, low) -> IRQ 18
[ 161.870216] ata_piix 0000:00:1f.1: setting latency timer to 64
[ 161.870669] fglrx_pci 0000:01:00.0: power state changed by ACPI to D0
[ 161.870679] fglrx_pci 0000:01:00.0: PCI INT A -> GSI 16 (level,
low) -> IRQ 16
[ 161.870684] fglrx_pci 0000:01:00.0: setting latency timer to 64
[ 161.883293] ata2: port disabled. ignoring.
[ 161.884515] [fglrx] Power up the ASIC
[ 161.884554] eth1: Coming out of suspend...
[ 161.900173] ipw2200 0000:06:05.0: enabling device (0000 -> 0002)
[ 161.900181] ipw2200 0000:06:05.0: PCI INT A -> GSI 20 (level, low) -> IRQ 20
[ 161.900196] ipw2200 0000:06:05.0: restoring config space at offset
0xf (was 0x18030100, writing 0x1803010b)
[ 161.900226] ipw2200 0000:06:05.0: restoring config space at offset
0x4 (was 0x0, writing 0xc8206000)
[ 161.900234] ipw2200 0000:06:05.0: restoring config space at offset
0x3 (was 0x0, writing 0x8008)
[ 161.900243] ipw2200 0000:06:05.0: restoring config space at offset
0x1 (was 0x2900002, writing 0x2900116)
[ 162.072723] ata1.01: ACPI cmd ef/03:0c:00:00:00:b0 filtered out
[ 162.072726] ata1.01: ACPI cmd ef/03:22:00:00:00:b0 filtered out
[ 162.080461] ata1.00: ACPI cmd ef/03:0c:00:00:00:a0 filtered out
[ 162.080463] ata1.00: ACPI cmd ef/03:45:00:00:00:a0 filtered out
[ 162.096951] ata1.00: configured for UDMA/100
[ 162.128666] ata1.01: configured for MWDMA2
[ 162.129488] sd 0:0:0:0: [sda] 156301488 512-byte hardware sectors:
(80.0 GB/74.5 GiB)
[ 162.129519] sd 0:0:0:0: [sda] Write Protect is off
[ 162.129521] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
[ 162.129569] sd 0:0:0:0: [sda] Write cache: enabled, read cache:
enabled, doesn't support DPO or FUA
[ 162.129621] sd 0:0:0:0: [sda] 156301488 512-byte hardware sectors:
(80.0 GB/74.5 GiB)
[ 162.129648] sd 0:0:0:0: [sda] Write Protect is off
[ 162.129650] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
[ 162.129696] sd 0:0:0:0: [sda] Write cache: enabled, read cache:
enabled, doesn't support DPO or FUA
[ 162.168175] firewire_ohci 0000:06:06.2: restoring config space at
offset 0xf (was 0x4020300, writing 0x402030b)
[ 162.168203] firewire_ohci 0000:06:06.2: restoring config space at
offset 0x5 (was 0x0, writing 0xc8200000)
[ 162.168211] firewire_ohci 0000:06:06.2: restoring config space at
offset 0x4 (was 0x0, writing 0xc8208000)
[ 162.168218] firewire_ohci 0000:06:06.2: restoring config space at
offset 0x3 (was 0x800000, writing 0x808008)
[ 162.168228] firewire_ohci 0000:06:06.2: restoring config space at
offset 0x1 (was 0x2100000, writing 0x2100116)
[ 162.256038] tifm_7xx1 0000:06:06.3: restoring config space at
offset 0xf (was 0x40701ff, writing 0x407010a)
[ 162.256067] tifm_7xx1 0000:06:06.3: restoring config space at
offset 0x4 (was 0x0, writing 0xc8204000)
[ 162.256075] tifm_7xx1 0000:06:06.3: restoring config space at
offset 0x3 (was 0x800000, writing 0x808008)
[ 162.256085] tifm_7xx1 0000:06:06.3: restoring config space at
offset 0x1 (was 0x2100000, writing 0x2100106)
[ 162.256107] tifm_7xx1 0000:06:06.3: PCI INT A -> GSI 22 (level,
low) -> IRQ 22
[ 162.256283] pciehp 0000:00:1c.0:pcie02: pciehp_resume ENTRY
[ 163.260341] sd 0:0:0:0: [sda] Starting disk
[ 163.800031] usb 3-1: reset full speed USB device using uhci_hcd and address 2
[ 164.014985] Restarting tasks ... done.

2008-12-22 22:11:38

by Thomas Gleixner

[permalink] [raw]

Subject: Re: TSC not updating after resume: Bug or Feature?

Fabio,

On Mon, 22 Dec 2008, Fabio Comolli wrote:
> On Mon, Dec 22, 2008 at 10:13 PM, Thomas Gleixner <[email protected]> wrote:
> >
> > Can please you revert the last patch and apply the following ? Does
> > the WARN_ON trigger ?
> >

> Yes, it does:

> [ 159.768005] [<c01389a2>] getnstimeofday+0x21/0xcd
> [ 159.768005] [<c0135e26>] ktime_get_ts+0x1d/0x3f
> [ 159.768005] [<c0135e57>] ktime_get+0xf/0x2b
> [ 159.768005] [<c013736b>] sched_clock_tick+0x46/0x83
> [ 159.768005] [<c01373ad>] sched_clock_idle_wakeup_event+0x5/0xa
> [ 159.768005] [<c0108670>] set_cyc2ns_scale+0x3f/0x5e
> [ 159.768005] [<c01088af>] time_cpufreq_notifier+0xf9/0x103
> [ 159.768005] [<c0136782>] notifier_call_chain+0x2a/0x52
> [ 159.768005] [<c0136866>] __srcu_notifier_call_chain+0x35/0x4a
> [ 159.768005] [<c0136884>] srcu_notifier_call_chain+0x9/0xc
> [ 159.768005] [<c031cbca>] cpufreq_resume+0xf3/0x112
> [ 159.768005] [<c028c3f3>] __sysdev_resume+0x24/0x34
> [ 159.768005] [<c028c421>] sysdev_resume+0x1e/0x50

Thanks for testing. It's exaclty the code path I described :)

So my code analysis holds and your test confirms my suspicion that
Shaggy's patch just unearthed some other weirdness in the
suspend/resume code.

Can you please apply the following hack^Wpatch and retest ? It
restores Shaggys patch, but prevents the sched_clock_tick() call when
timekeeping is not resumed. The WARN_ON should not longer trigger
except there is some other code path which fiddles with that as well.

If I'm not completely nuts then this should solve your suspend/resume
problem really instead of papering over the root cause.

The patch should work on top of 2.6.27.10 as well, so whatever is
easier to verify for you is fine.

Thanks,

tglx
---
diff --git a/kernel/sched_clock.c b/kernel/sched_clock.c
index e8ab096..dd97801 100644
--- a/kernel/sched_clock.c
+++ b/kernel/sched_clock.c
@@ -124,7 +124,7 @@ static u64 __update_sched_clock(struct sched_clock_data *scd, u64 now)

clock = scd->tick_gtod + delta;
min_clock = wrap_max(scd->tick_gtod, scd->clock);
- max_clock = scd->tick_gtod + TICK_NSEC;
+ max_clock = wrap_max(scd->clock, scd->tick_gtod + TICK_NSEC);

clock = wrap_max(clock, min_clock);
clock = wrap_min(clock, max_clock);
@@ -222,11 +222,16 @@ void sched_clock_idle_sleep_event(void)
}
EXPORT_SYMBOL_GPL(sched_clock_idle_sleep_event);

+extern int timekeeping_suspended;
+
/*
* We just idled delta nanoseconds (called with irqs disabled):
*/
void sched_clock_idle_wakeup_event(u64 delta_ns)
{
+ if (timekeeping_suspended)
+ return;
+
sched_clock_tick();
touch_softlockup_watchdog();
}
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index fa05e88..50ba3d0 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -46,6 +46,9 @@ struct timespec xtime __attribute__ ((aligned (16)));
struct timespec wall_to_monotonic __attribute__ ((aligned (16)));
static unsigned long total_sleep_time; /* seconds */

+/* flag for if timekeeping is suspended */
+int timekeeping_suspended;
+
static struct timespec xtime_cache __attribute__ ((aligned (16)));
void update_xtime_cache(u64 nsec)
{
@@ -92,6 +95,8 @@ void getnstimeofday(struct timespec *ts)
unsigned long seq;
s64 nsecs;

+ WARN_ON(timekeeping_suspended);
+
do {
seq = read_seqbegin(&xtime_lock);

@@ -299,8 +304,6 @@ void __init timekeeping_init(void)
write_sequnlock_irqrestore(&xtime_lock, flags);
}

-/* flag for if timekeeping is suspended */
-static int timekeeping_suspended;
/* time in seconds when suspend began */
static unsigned long timekeeping_suspend_time;

2008-12-22 22:13:00

by Fabio Comolli

[permalink] [raw]

Subject: Re: TSC not updating after resume: Bug or Feature?

Won't apply:

fcomolli@hawking:~/software/GIT-TREES/linux-2.6> patch -p1 --dry-run <
/home/fcomolli/0009c-tglx-debug.patch
patching file kernel/sched_clock.c
patching file kernel/time/timekeeping.c
Hunk #1 succeeded at 49 with fuzz 2 (offset 3 lines).
Hunk #2 succeeded at 100 with fuzz 2 (offset 5 lines).
Hunk #3 FAILED at 309.
1 out of 3 hunks FAILED -- saving rejects to file kernel/time/timekeeping.c.rej

Should I revert you previous one?

Regards,
Fabio

On Mon, Dec 22, 2008 at 11:05 PM, Thomas Gleixner <[email protected]> wrote:
> Fabio,
>
> On Mon, 22 Dec 2008, Fabio Comolli wrote:
>> On Mon, Dec 22, 2008 at 10:13 PM, Thomas Gleixner <[email protected]> wrote:
>> >
>> > Can please you revert the last patch and apply the following ? Does
>> > the WARN_ON trigger ?
>> >
>
>> Yes, it does:
>
>> [ 159.768005] [<c01389a2>] getnstimeofday+0x21/0xcd
>> [ 159.768005] [<c0135e26>] ktime_get_ts+0x1d/0x3f
>> [ 159.768005] [<c0135e57>] ktime_get+0xf/0x2b
>> [ 159.768005] [<c013736b>] sched_clock_tick+0x46/0x83
>> [ 159.768005] [<c01373ad>] sched_clock_idle_wakeup_event+0x5/0xa
>> [ 159.768005] [<c0108670>] set_cyc2ns_scale+0x3f/0x5e
>> [ 159.768005] [<c01088af>] time_cpufreq_notifier+0xf9/0x103
>> [ 159.768005] [<c0136782>] notifier_call_chain+0x2a/0x52
>> [ 159.768005] [<c0136866>] __srcu_notifier_call_chain+0x35/0x4a
>> [ 159.768005] [<c0136884>] srcu_notifier_call_chain+0x9/0xc
>> [ 159.768005] [<c031cbca>] cpufreq_resume+0xf3/0x112
>> [ 159.768005] [<c028c3f3>] __sysdev_resume+0x24/0x34
>> [ 159.768005] [<c028c421>] sysdev_resume+0x1e/0x50
>
> Thanks for testing. It's exaclty the code path I described :)
>
> So my code analysis holds and your test confirms my suspicion that
> Shaggy's patch just unearthed some other weirdness in the
> suspend/resume code.
>
> Can you please apply the following hack^Wpatch and retest ? It
> restores Shaggys patch, but prevents the sched_clock_tick() call when
> timekeeping is not resumed. The WARN_ON should not longer trigger
> except there is some other code path which fiddles with that as well.
>
> If I'm not completely nuts then this should solve your suspend/resume
> problem really instead of papering over the root cause.
>
> The patch should work on top of 2.6.27.10 as well, so whatever is
> easier to verify for you is fine.
>
> Thanks,
>
> tglx
> ---
> diff --git a/kernel/sched_clock.c b/kernel/sched_clock.c
> index e8ab096..dd97801 100644
> --- a/kernel/sched_clock.c
> +++ b/kernel/sched_clock.c
> @@ -124,7 +124,7 @@ static u64 __update_sched_clock(struct sched_clock_data *scd, u64 now)
>
> clock = scd->tick_gtod + delta;
> min_clock = wrap_max(scd->tick_gtod, scd->clock);
> - max_clock = scd->tick_gtod + TICK_NSEC;
> + max_clock = wrap_max(scd->clock, scd->tick_gtod + TICK_NSEC);
>
> clock = wrap_max(clock, min_clock);
> clock = wrap_min(clock, max_clock);
> @@ -222,11 +222,16 @@ void sched_clock_idle_sleep_event(void)
> }
> EXPORT_SYMBOL_GPL(sched_clock_idle_sleep_event);
>
> +extern int timekeeping_suspended;
> +
> /*
> * We just idled delta nanoseconds (called with irqs disabled):
> */
> void sched_clock_idle_wakeup_event(u64 delta_ns)
> {
> + if (timekeeping_suspended)
> + return;
> +
> sched_clock_tick();
> touch_softlockup_watchdog();
> }
> diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
> index fa05e88..50ba3d0 100644
> --- a/kernel/time/timekeeping.c
> +++ b/kernel/time/timekeeping.c
> @@ -46,6 +46,9 @@ struct timespec xtime __attribute__ ((aligned (16)));
> struct timespec wall_to_monotonic __attribute__ ((aligned (16)));
> static unsigned long total_sleep_time; /* seconds */
>
> +/* flag for if timekeeping is suspended */
> +int timekeeping_suspended;
> +
> static struct timespec xtime_cache __attribute__ ((aligned (16)));
> void update_xtime_cache(u64 nsec)
> {
> @@ -92,6 +95,8 @@ void getnstimeofday(struct timespec *ts)
> unsigned long seq;
> s64 nsecs;
>
> + WARN_ON(timekeeping_suspended);
> +
> do {
> seq = read_seqbegin(&xtime_lock);
>
> @@ -299,8 +304,6 @@ void __init timekeeping_init(void)
> write_sequnlock_irqrestore(&xtime_lock, flags);
> }
>
> -/* flag for if timekeeping is suspended */
> -static int timekeeping_suspended;
> /* time in seconds when suspend began */
> static unsigned long timekeeping_suspend_time;
>
>

2008-12-22 22:17:27

by Thomas Gleixner

[permalink] [raw]

Subject: Re: TSC not updating after resume: Bug or Feature?

On Mon, 22 Dec 2008, Fabio Comolli wrote:

> Won't apply:
>
> fcomolli@hawking:~/software/GIT-TREES/linux-2.6> patch -p1 --dry-run <
> /home/fcomolli/0009c-tglx-debug.patch
> patching file kernel/sched_clock.c
> patching file kernel/time/timekeeping.c
> Hunk #1 succeeded at 49 with fuzz 2 (offset 3 lines).
> Hunk #2 succeeded at 100 with fuzz 2 (offset 5 lines).
> Hunk #3 FAILED at 309.
> 1 out of 3 hunks FAILED -- saving rejects to file kernel/time/timekeeping.c.rej
>
> Should I revert you previous one?

Oops, yes.

Thanks,

tglx

2008-12-22 22:30:01

by Fabio Comolli

[permalink] [raw]

Subject: Re: TSC not updating after resume: Bug or Feature?

OK, it seems that you got it.

The system survived two hibernation / resume cycles without triggering
the warning and, probably the most important part, the dmesg does not
have any "jumps" before the "Force enable HPET on resume" line as
before:

[ 275.647704] PM: Creating hibernation image:
[ 275.648005] PM: Need to copy 177844 pages
[ 275.648005] Intel machine check architecture supported.
[ 275.648005] Intel machine check reporting enabled on CPU#0.
[ 275.648005] Force enabled HPET at resume
[ 275.648005] ACPI: Waking up from system sleep state S4
[ 275.669038] ACPI: EC: non-query interrupt received, switching to
interrupt mode
[ 275.740697] pcieport-driver 0000:00:01.0: setting latency timer to 64
[ 275.740781] pcieport-driver 0000:00:1c.0: setting latency timer to 64
[ 275.740786] pciehp 0000:00:1c.0:pcie02: pciehp_resume ENTRY

...and so on.

As i mentioned in the bug report, to be sure that the bug doesn't
trigger I have to wait at least on day and > 10 hibernation / resumee
cycles.

But with your first patch the bug triggered at the second cycle, so
I'm optimist :-)

I'll keep you informed. Many thanks in advance,
Fabio

On Mon, Dec 22, 2008 at 11:16 PM, Thomas Gleixner <[email protected]> wrote:
>
> On Mon, 22 Dec 2008, Fabio Comolli wrote:
>
>> Won't apply:
>>
>> fcomolli@hawking:~/software/GIT-TREES/linux-2.6> patch -p1 --dry-run <
>> /home/fcomolli/0009c-tglx-debug.patch
>> patching file kernel/sched_clock.c
>> patching file kernel/time/timekeeping.c
>> Hunk #1 succeeded at 49 with fuzz 2 (offset 3 lines).
>> Hunk #2 succeeded at 100 with fuzz 2 (offset 5 lines).
>> Hunk #3 FAILED at 309.
>> 1 out of 3 hunks FAILED -- saving rejects to file kernel/time/timekeeping.c.rej
>>
>> Should I revert you previous one?
>
> Oops, yes.
>
> Thanks,
>
> tglx
>

2008-12-22 22:34:16

by Thomas Gleixner

[permalink] [raw]

Subject: Re: TSC not updating after resume: Bug or Feature?

Fabio,

On Mon, 22 Dec 2008, Fabio Comolli wrote:

> OK, it seems that you got it.
>
> The system survived two hibernation / resume cycles without triggering
> the warning and, probably the most important part, the dmesg does not
> have any "jumps" before the "Force enable HPET on resume" line as
> before:
>
> [ 275.647704] PM: Creating hibernation image:
> [ 275.648005] PM: Need to copy 177844 pages
> [ 275.648005] Intel machine check architecture supported.
> [ 275.648005] Intel machine check reporting enabled on CPU#0.
> [ 275.648005] Force enabled HPET at resume
> [ 275.648005] ACPI: Waking up from system sleep state S4
> [ 275.669038] ACPI: EC: non-query interrupt received, switching to
> interrupt mode
> [ 275.740697] pcieport-driver 0000:00:01.0: setting latency timer to 64
> [ 275.740781] pcieport-driver 0000:00:1c.0: setting latency timer to 64
> [ 275.740786] pciehp 0000:00:1c.0:pcie02: pciehp_resume ENTRY
>
> ...and so on.
>
> As i mentioned in the bug report, to be sure that the bug doesn't
> trigger I have to wait at least on day and > 10 hibernation / resumee
> cycles.

Sure.

> But with your first patch the bug triggered at the second cycle, so
> I'm optimist :-)
>
> I'll keep you informed. Many thanks in advance,

Many thanks to you for following up on this.

tglx

2008-12-23 00:28:57

by Rafael J. Wysocki

[permalink] [raw]

Subject: Re: TSC not updating after resume: Bug or Feature?

On Monday, 22 of December 2008, Thomas Gleixner wrote:
> Fabio,
>
> On Mon, 22 Dec 2008, Fabio Comolli wrote:
>
> > OK, it seems that you got it.
> >
> > The system survived two hibernation / resume cycles without triggering
> > the warning and, probably the most important part, the dmesg does not
> > have any "jumps" before the "Force enable HPET on resume" line as
> > before:
> >
> > [ 275.647704] PM: Creating hibernation image:
> > [ 275.648005] PM: Need to copy 177844 pages
> > [ 275.648005] Intel machine check architecture supported.
> > [ 275.648005] Intel machine check reporting enabled on CPU#0.
> > [ 275.648005] Force enabled HPET at resume
> > [ 275.648005] ACPI: Waking up from system sleep state S4
> > [ 275.669038] ACPI: EC: non-query interrupt received, switching to
> > interrupt mode
> > [ 275.740697] pcieport-driver 0000:00:01.0: setting latency timer to 64
> > [ 275.740781] pcieport-driver 0000:00:1c.0: setting latency timer to 64
> > [ 275.740786] pciehp 0000:00:1c.0:pcie02: pciehp_resume ENTRY
> >
> > ...and so on.
> >
> > As i mentioned in the bug report, to be sure that the bug doesn't
> > trigger I have to wait at least on day and > 10 hibernation / resumee
> > cycles.
>
> Sure.
>
> > But with your first patch the bug triggered at the second cycle, so
> > I'm optimist :-)
> >
> > I'll keep you informed. Many thanks in advance,
>
> Many thanks to you for following up on this.

Well, thanks a lot to you both for debugging this and finding a fix.

Rafael

2008-12-23 00:47:57

by Rafael J. Wysocki

[permalink] [raw]

Subject: Re: TSC not updating after resume: Bug or Feature?

On Sunday, 21 of December 2008, Thomas Gleixner wrote:
> On Thu, 18 Dec 2008, Rafael J. Wysocki wrote:
>
> > On Thursday, 18 of December 2008, Ingo Molnar wrote:
> > >
> > > * Peter Zijlstra <[email protected]> wrote:
> > >
> > > > Rafael, would something like this explain why we had to revert Shaggy's
> > > > patch?
> >
> > Well, I have yet to understand what the suspend-resume of the timekeeping code
> > actually does.
>
> Thats rather simple:
>
> suspend() saves the current time of the persistent clock (if
> available), forwards the timekeeping variables so they can be reused
> on resume, disables timekeeping activities and shuts down the clock
> events layer.
>
> resume() estimates the suspend time via persistent clock (if
> available) and update xtime with the sleep length. After that it
> reactivates timekeeping and resumes clock events and high resolution
> timers.
>
> So the sole purpose is:
> - dis/enable timekeeping and clock event devices.
> - keep track of the suspend time (if a persistent clock is available)
>
> We reactivate clock event devices and hrtimers from timekeeping_resume
> because clock events depend on functional timekeeping.

Thanks for the explanation. In fact, the reactivation of clock event devices
and hrtimers is the part I'm not familiar with.

> > The original description sounds worrisome to me, it looks like we've overlooked
> > something at least.
>
> Care to explain ?

Well, the fact that in the resume code path the clocksource is only updated as
a result of executing pci_set_power_state() is worrisome. I would be more
reliable to update it directly at one point.

Thanks,
Rafael

2008-12-23 07:03:52

by Thomas Gleixner

[permalink] [raw]

Subject: Re: TSC not updating after resume: Bug or Feature?

On Tue, 23 Dec 2008, Rafael J. Wysocki wrote:
> > Care to explain ?
>
> Well, the fact that in the resume code path the clocksource is only updated as
> a result of executing pci_set_power_state() is worrisome. I would be more
> reliable to update it directly at one point.

Hmm. We reactivate the clock sources from timekeeping resume():

clocksource_resume();

How is this related to pci_set_power_state() ?

Thanks,

tglx

2008-12-23 14:15:38

by Rafael J. Wysocki

[permalink] [raw]

Subject: Re: TSC not updating after resume: Bug or Feature?

On Tuesday, 23 of December 2008, Thomas Gleixner wrote:
> On Tue, 23 Dec 2008, Rafael J. Wysocki wrote:
> > > Care to explain ?
> >
> > Well, the fact that in the resume code path the clocksource is only updated as
> > a result of executing pci_set_power_state() is worrisome. I would be more
> > reliable to update it directly at one point.
>
> Hmm. We reactivate the clock sources from timekeeping resume():
>
> clocksource_resume();
>
> How is this related to pci_set_power_state() ?

>From the original report:

"I am using ftrace to get a detailed timing analysis of the resume process
on the OLPC XO laptop and am seeing that when we start running again, the
system timestamep is not being updated for several (hundreds of) thousands
of cycles (~2000 ftrace entries). From following the ftrace, what is
happening is that the clocksource is not updated until we run the cpu_idle()
thread due to an explicit scheduling operation in the resume path that occurs
via call to msleep from pci_set_power_state()."

That's what I was referring to.

Thanks,
Rafael

2008-12-23 19:42:54

by Fabio Comolli

[permalink] [raw]

Subject: Re: TSC not updating after resume: Bug or Feature?

Hi Thomas.
Everything seems to work fine after further testing.
Regards,
Fabio

On Mon, Dec 22, 2008 at 11:33 PM, Thomas Gleixner <[email protected]> wrote:
> Fabio,
>
> On Mon, 22 Dec 2008, Fabio Comolli wrote:
>
>> OK, it seems that you got it.
>>
>> The system survived two hibernation / resume cycles without triggering
>> the warning and, probably the most important part, the dmesg does not
>> have any "jumps" before the "Force enable HPET on resume" line as
>> before:
>>
>> [ 275.647704] PM: Creating hibernation image:
>> [ 275.648005] PM: Need to copy 177844 pages
>> [ 275.648005] Intel machine check architecture supported.
>> [ 275.648005] Intel machine check reporting enabled on CPU#0.
>> [ 275.648005] Force enabled HPET at resume
>> [ 275.648005] ACPI: Waking up from system sleep state S4
>> [ 275.669038] ACPI: EC: non-query interrupt received, switching to
>> interrupt mode
>> [ 275.740697] pcieport-driver 0000:00:01.0: setting latency timer to 64
>> [ 275.740781] pcieport-driver 0000:00:1c.0: setting latency timer to 64
>> [ 275.740786] pciehp 0000:00:1c.0:pcie02: pciehp_resume ENTRY
>>
>> ...and so on.
>>
>> As i mentioned in the bug report, to be sure that the bug doesn't
>> trigger I have to wait at least on day and > 10 hibernation / resumee
>> cycles.
>
> Sure.
>
>> But with your first patch the bug triggered at the second cycle, so
>> I'm optimist :-)
>>
>> I'll keep you informed. Many thanks in advance,
>
> Many thanks to you for following up on this.
>
> tglx
>

2008-12-23 22:04:12

by Deepak Saxena

[permalink] [raw]

Subject: Re: TSC not updating after resume: Bug or Feature?

On Sun, Dec 21, 2008 at 10:19:12PM +0100, Thomas Gleixner wrote:
> > of cycles (~2000 ftrace entries). From following the ftrace, what is
> > happening is that the clocksource is not updated until we run the cpu_idle()
> > thread due to an explicit scheduling operation in the resume path that occurs
> > via call to msleep from pci_set_power_state(). As I'm still fully groking the
> > timekeeping code, the question(s) I have is whether this is expected behaviour
> > and I should not assume valid timestamp data in the initial bits of the suspend path,
> > whether this is an OLPC-specific bug, or whether I've uncovered a generic bug in
> > the timekeeping implementation. This is on 2.6.27.7 as I've not gotten 28-rc
> > up and running on the XO but can move that up in priority if this behaviour
> > is different in newer kernels.
>
> Can you please upload the full trace somewhere ?
>
> Thanks,

Thomas,

Sorry for the delayed response...travelling and have very limited
connectivity. I will read the thread in detail when back online in a
few days.

http://dev.laptop.org/~dsaxena/suspend-trace-1267

~Deepak

2008-12-25 18:32:15

by Pavel Machek

[permalink] [raw]

Subject: Re: TSC not updating after resume: Bug or Feature?

Hi!
> > >
> > > Can please you revert the last patch and apply the following ? Does
> > > the WARN_ON trigger ?
> > >
>
> > Yes, it does:
>
> > [ 159.768005] [<c01389a2>] getnstimeofday+0x21/0xcd
> > [ 159.768005] [<c0135e26>] ktime_get_ts+0x1d/0x3f
> > [ 159.768005] [<c0135e57>] ktime_get+0xf/0x2b
> > [ 159.768005] [<c013736b>] sched_clock_tick+0x46/0x83
> > [ 159.768005] [<c01373ad>] sched_clock_idle_wakeup_event+0x5/0xa
> > [ 159.768005] [<c0108670>] set_cyc2ns_scale+0x3f/0x5e
> > [ 159.768005] [<c01088af>] time_cpufreq_notifier+0xf9/0x103
> > [ 159.768005] [<c0136782>] notifier_call_chain+0x2a/0x52
> > [ 159.768005] [<c0136866>] __srcu_notifier_call_chain+0x35/0x4a
> > [ 159.768005] [<c0136884>] srcu_notifier_call_chain+0x9/0xc
> > [ 159.768005] [<c031cbca>] cpufreq_resume+0xf3/0x112
> > [ 159.768005] [<c028c3f3>] __sysdev_resume+0x24/0x34
> > [ 159.768005] [<c028c421>] sysdev_resume+0x1e/0x50
>
> Thanks for testing. It's exaclty the code path I described :)
>
> So my code analysis holds and your test confirms my suspicion that
> Shaggy's patch just unearthed some other weirdness in the
> suspend/resume code.
>
> Can you please apply the following hack^Wpatch and retest ? It
> restores Shaggys patch, but prevents the sched_clock_tick() call when
> timekeeping is not resumed. The WARN_ON should not longer trigger
> except there is some other code path which fiddles with that as well.
>
> If I'm not completely nuts then this should solve your suspend/resume
> problem really instead of papering over the root cause.

Should we move timekeeping resume before cpufreq resume, instead of this?

> The patch should work on top of 2.6.27.10 as well, so whatever is
> easier to verify for you is fine.
>
> Thanks,
>
> tglx
> ---
> diff --git a/kernel/sched_clock.c b/kernel/sched_clock.c
> index e8ab096..dd97801 100644
> --- a/kernel/sched_clock.c
> +++ b/kernel/sched_clock.c
> @@ -124,7 +124,7 @@ static u64 __update_sched_clock(struct sched_clock_data *scd, u64 now)
>
> clock = scd->tick_gtod + delta;
> min_clock = wrap_max(scd->tick_gtod, scd->clock);
> - max_clock = scd->tick_gtod + TICK_NSEC;
> + max_clock = wrap_max(scd->clock, scd->tick_gtod + TICK_NSEC);
>
> clock = wrap_max(clock, min_clock);
> clock = wrap_min(clock, max_clock);
> @@ -222,11 +222,16 @@ void sched_clock_idle_sleep_event(void)
> }
> EXPORT_SYMBOL_GPL(sched_clock_idle_sleep_event);
>
> +extern int timekeeping_suspended;
> +
> /*
> * We just idled delta nanoseconds (called with irqs disabled):
> */
> void sched_clock_idle_wakeup_event(u64 delta_ns)
> {
> + if (timekeeping_suspended)
> + return;
> +
> sched_clock_tick();
> touch_softlockup_watchdog();
> }
> diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
> index fa05e88..50ba3d0 100644
> --- a/kernel/time/timekeeping.c
> +++ b/kernel/time/timekeeping.c
> @@ -46,6 +46,9 @@ struct timespec xtime __attribute__ ((aligned (16)));
> struct timespec wall_to_monotonic __attribute__ ((aligned (16)));
> static unsigned long total_sleep_time; /* seconds */
>
> +/* flag for if timekeeping is suspended */
> +int timekeeping_suspended;
> +
> static struct timespec xtime_cache __attribute__ ((aligned (16)));
> void update_xtime_cache(u64 nsec)
> {
> @@ -92,6 +95,8 @@ void getnstimeofday(struct timespec *ts)
> unsigned long seq;
> s64 nsecs;
>
> + WARN_ON(timekeeping_suspended);
> +
> do {
> seq = read_seqbegin(&xtime_lock);
>
> @@ -299,8 +304,6 @@ void __init timekeeping_init(void)
> write_sequnlock_irqrestore(&xtime_lock, flags);
> }
>
> -/* flag for if timekeeping is suspended */
> -static int timekeeping_suspended;
> /* time in seconds when suspend began */
> static unsigned long timekeeping_suspend_time;

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2008-12-26 08:47:54

by Peter Zijlstra

[permalink] [raw]

Subject: Re: TSC not updating after resume: Bug or Feature?

On Wed, 2008-12-24 at 15:09 +0100, Pavel Machek wrote:
> Hi!
> > > >
> > > > Can please you revert the last patch and apply the following ? Does
> > > > the WARN_ON trigger ?
> > > >
> >
> > > Yes, it does:
> >
> > > [ 159.768005] [<c01389a2>] getnstimeofday+0x21/0xcd
> > > [ 159.768005] [<c0135e26>] ktime_get_ts+0x1d/0x3f
> > > [ 159.768005] [<c0135e57>] ktime_get+0xf/0x2b
> > > [ 159.768005] [<c013736b>] sched_clock_tick+0x46/0x83
> > > [ 159.768005] [<c01373ad>] sched_clock_idle_wakeup_event+0x5/0xa
> > > [ 159.768005] [<c0108670>] set_cyc2ns_scale+0x3f/0x5e
> > > [ 159.768005] [<c01088af>] time_cpufreq_notifier+0xf9/0x103
> > > [ 159.768005] [<c0136782>] notifier_call_chain+0x2a/0x52
> > > [ 159.768005] [<c0136866>] __srcu_notifier_call_chain+0x35/0x4a
> > > [ 159.768005] [<c0136884>] srcu_notifier_call_chain+0x9/0xc
> > > [ 159.768005] [<c031cbca>] cpufreq_resume+0xf3/0x112
> > > [ 159.768005] [<c028c3f3>] __sysdev_resume+0x24/0x34
> > > [ 159.768005] [<c028c421>] sysdev_resume+0x1e/0x50
> >
> > Thanks for testing. It's exaclty the code path I described :)
> >
> > So my code analysis holds and your test confirms my suspicion that
> > Shaggy's patch just unearthed some other weirdness in the
> > suspend/resume code.
> >
> > Can you please apply the following hack^Wpatch and retest ? It
> > restores Shaggys patch, but prevents the sched_clock_tick() call when
> > timekeeping is not resumed. The WARN_ON should not longer trigger
> > except there is some other code path which fiddles with that as well.
> >
> > If I'm not completely nuts then this should solve your suspend/resume
> > problem really instead of papering over the root cause.
>
> Should we move timekeeping resume before cpufreq resume, instead of this?

That seems a sensible suggestion (but I've got no clue on how practical
that is). A quick look at the boot code suggests that regular bootups
have that sequence as well, so mirroring that in the resume code seems
like the best all-round solution.

2008-12-17 17:50:52

by Steven Rostedt

[permalink] [raw]

Subject: Re: TSC not updating after resume: Bug or Feature?

I added Peter Zijlstra too, since he works on the cpu clock as well.

-- Steve

On Wed, 2008-12-17 at 09:27 -0800, Deepak Saxena wrote:
> Hi,
>
> I am using ftrace to get a detailed timing analysis of the resume process
> on the OLPC XO laptop and am seeing that when we start running again, the
> system timestamep is not being updated for several (hundreds of) thousands
> of cycles (~2000 ftrace entries). From following the ftrace, what is
> happening is that the clocksource is not updated until we run the cpu_idle()
> thread due to an explicit scheduling operation in the resume path that occurs
> via call to msleep from pci_set_power_state(). As I'm still fully groking the
> timekeeping code, the question(s) I have is whether this is expected behaviour
> and I should not assume valid timestamp data in the initial bits of the suspend path,
> whether this is an OLPC-specific bug, or whether I've uncovered a generic bug in
> the timekeeping implementation. This is on 2.6.27.7 as I've not gotten 28-rc
> up and running on the XO but can move that up in priority if this behaviour
> is different in newer kernels.
>
> Thanks,
> ~Deepak
>
> # dsaxena: Return from firmware
> do_suspend_trac-1267 [00] 154.925216: restore_processor_state <-ret_point
> do_suspend_trac-1267 [00] 154.925216: enable_sep_cpu <-restore_processor_state
> do_suspend_trac-1267 [00] 154.925216: olpc_fixup_wakeup <-olpc_pm_enter
> do_suspend_trac-1267 [00] 154.925216: geode_get_dev_base <-olpc_fixup_wakeup
> do_suspend_trac-1267 [00] 154.925216: olpc_ec_cmd <-olpc_fixup_wakeup
> do_suspend_trac-1267 [00] 154.925216: _spin_lock_irqsave <-olpc_ec_cmd
> do_suspend_trac-1267 [00] 154.925216: __wait_on_ibf <-olpc_ec_cmd
> do_suspend_trac-1267 [00] 154.925216: printk <-olpc_ec_cmd
> do_suspend_trac-1267 [00] 154.925216: vprintk <-printk
> do_suspend_trac-1267 [00] 154.925216: _spin_lock <-vprintk
> do_suspend_trac-1267 [00] 154.925216: emit_log_char <-vprintk
> do_suspend_trac-1267 [00] 154.925216: emit_log_char <-vprintk
> ....
> # pci_set_power_state() calls msleep(10) as per PCI spec
> do_suspend_trac-1267 [00] 154.925216: msleep <-pci_set_power_state
> do_suspend_trac-1267 [00] 154.925216: msecs_to_jiffies <-msleep
> do_suspend_trac-1267 [00] 154.925216: schedule_timeout_uninterruptible <-msleep
> do_suspend_trac-1267 [00] 154.925216: schedule_timeout <-schedule_timeout_uninterruptible
> do_suspend_trac-1267 [00] 154.925216: init_timer_on_stack <-schedule_timeout
> do_suspend_trac-1267 [00] 154.925216: __init_timer <-init_timer_on_stack
> do_suspend_trac-1267 [00] 154.925216: __mod_timer <-schedule_timeout
> do_suspend_trac-1267 [00] 154.925216: __timer_stats_timer_set_start_info <-__mod_timer
> do_suspend_trac-1267 [00] 154.925216: lock_timer_base <-__mod_timer
> do_suspend_trac-1267 [00] 154.925216: _spin_lock_irqsave <-lock_timer_base
> do_suspend_trac-1267 [00] 154.925216: internal_add_timer <-__mod_timer
> do_suspend_trac-1267 [00] 154.925216: _spin_unlock_irqrestore <-__mod_timer
> do_suspend_trac-1267 [00] 154.925216: _spin_lock <-schedule
> do_suspend_trac-1267 [00] 154.925216: marker_probe_cb <-schedule
> do_suspend_trac-1267 [00] 154.925216: _spin_unlock <-tracing_record_cmdline
> do_suspend_trac-1267 [00] 154.925216: __switch_to <-schedule
> <idle>-0 [00] 154.925216: _spin_unlock_irq <-finish_task_switch
> <idle>-0 [00] 154.925216: tick_nohz_stop_sched_tick <-cpu_idle
> <idle>-0 [00] 154.925216: ktime_get <-tick_nohz_stop_sched_tick
> <idle>-0 [00] 154.925216: ktime_get_ts <-ktime_get
> <idle>-0 [00] 154.925216: getnstimeofday <-ktime_get_ts
> <idle>-0 [00] 154.925216: set_normalized_timespec <-ktime_get_ts
> <idle>-0 [00] 154.925216: get_next_timer_interrupt <-tick_nohz_stop_sched_tick
> <idle>-0 [00] 154.925216: _spin_lock <-get_next_timer_interrupt
> <idle>-0 [00] 154.925216: _spin_unlock <-get_next_timer_interrupt
> <idle>-0 [00] 154.925216: hrtimer_get_next_event <-get_next_timer_interrupt
> <idle>-0 [00] 154.925216: _spin_lock_irqsave <-hrtimer_get_next_event
> <idle>-0 [00] 154.925216: _spin_unlock_irqrestore <-hrtimer_get_next_event
> <idle>-0 [00] 154.925216: rcu_needs_cpu <-tick_nohz_stop_sched_tick
> <idle>-0 [00] 154.925216: rcu_pending <-rcu_needs_cpu
> <idle>-0 [00] 154.925216: __rcu_pending <-rcu_pending
> <idle>-0 [00] 154.925216: __rcu_pending <-rcu_pending
> <idle>-0 [00] 154.925216: rcu_pending <-cpu_idle
> <idle>-0 [00] 154.925216: __rcu_pending <-rcu_pending
> <idle>-0 [00] 154.925216: __rcu_pending <-rcu_pending
> <idle>-0 [00] 154.925216: default_idle <-cpu_idle
> <idle>-0 [00] 154.925216: do_IRQ <-common_interrupt
> <idle>-0 [00] 154.925216: irq_enter <-do_IRQ
> <idle>-0 [00] 154.925216: tick_nohz_stop_idle <-irq_enter
> <idle>-0 [00] 154.925216: ktime_get <-tick_nohz_stop_idle
> <idle>-0 [00] 154.925216: ktime_get_ts <-ktime_get
> <idle>-0 [00] 154.925216: getnstimeofday <-ktime_get_ts
> <idle>-0 [00] 154.925216: set_normalized_timespec <-ktime_get_ts
> <idle>-0 [00] 154.925216: ktime_get <-sched_clock_tick
> <idle>-0 [00] 154.925216: ktime_get_ts <-ktime_get
> <idle>-0 [00] 154.925216: getnstimeofday <-ktime_get_ts
> <idle>-0 [00] 154.925216: set_normalized_timespec <-ktime_get_ts
> # ktime_get_ts() calls getnstimeofday() which ends up calling clocksource_read()
> # thus updating the kernel timestamp. note that we call getnstimeofday()
> # twice above but for some reason they do not impact the time stamp...
> # There is an IRQ in here, possibly an HRT-driven IRQ and I'm wondering
> # if that is kicking the TSC into ticking?
> <idle>-0 [00] 154.929837: touch_softlockup_watchdog <-sched_clock_idle_wakeup_event
> <idle>-0 [00] 154.929839: tick_nohz_update_jiffies <-irq_enter
>
>
> ~Deepak
>

2008-12-18 09:25:31

by Peter Zijlstra

[permalink] [raw]

Subject: Re: TSC not updating after resume: Bug or Feature?

On Wed, 2008-12-17 at 12:50 -0500, Steven Rostedt wrote:
> I added Peter Zijlstra too, since he works on the cpu clock as well.

Thanks Steve..

> On Wed, 2008-12-17 at 09:27 -0800, Deepak Saxena wrote:
> > Hi,
> >
> > I am using ftrace to get a detailed timing analysis of the resume process
> > on the OLPC XO laptop and am seeing that when we start running again, the
> > system timestamep is not being updated for several (hundreds of) thousands
> > of cycles (~2000 ftrace entries). From following the ftrace, what is
> > happening is that the clocksource is not updated until we run the cpu_idle()
> > thread due to an explicit scheduling operation in the resume path that occurs
> > via call to msleep from pci_set_power_state(). As I'm still fully groking the
> > timekeeping code, the question(s) I have is whether this is expected behaviour
> > and I should not assume valid timestamp data in the initial bits of the suspend path,
> > whether this is an OLPC-specific bug, or whether I've uncovered a generic bug in
> > the timekeeping implementation. This is on 2.6.27.7 as I've not gotten 28-rc
> > up and running on the XO but can move that up in priority if this behaviour
> > is different in newer kernels.

Right, so cpu_clock() relies on gtod and tsc to generate a timestamp. We
use a tick_raw (tsc) and tick_gtod, we then use tick_raw to compute a
delta in tsc and add that to tick_gtod and clip the stuff to a jiffy
window and filter backward motion.

On resume I imagine the TSC has a totally unrelated value to the TSC
before suspend, this would mean the clock would quickly end up either at
whatever place the backward motion filter saw last (new TSC < old TSC)
or at the jiffy window clip (new TSC > old TSC).

Time will only start running correctly again once the jiffy tick starts.

Then there is the acpi idle code and NOHZ stuff, that call the
sched_clock_tick() by hand in order to re-sync time (because an unknown
quantity of time passed and gtod has all the machinery of figuring that
out).

So yes, this is expected behaviour,.. OTOH I really didn't think much
about code relying on this timesource _that_ early and such.

Rafael, would something like this explain why we had to revert Shaggy's
patch? His patch fixes the backward motion filter and I'm at an utter
loss why that would break suspend.

> >
> > # dsaxena: Return from firmware
> > do_suspend_trac-1267 [00] 154.925216: restore_processor_state <-ret_point
> > do_suspend_trac-1267 [00] 154.925216: enable_sep_cpu <-restore_processor_state
> > do_suspend_trac-1267 [00] 154.925216: olpc_fixup_wakeup <-olpc_pm_enter
> > do_suspend_trac-1267 [00] 154.925216: geode_get_dev_base <-olpc_fixup_wakeup
> > do_suspend_trac-1267 [00] 154.925216: olpc_ec_cmd <-olpc_fixup_wakeup
> > do_suspend_trac-1267 [00] 154.925216: _spin_lock_irqsave <-olpc_ec_cmd
> > do_suspend_trac-1267 [00] 154.925216: __wait_on_ibf <-olpc_ec_cmd
> > do_suspend_trac-1267 [00] 154.925216: printk <-olpc_ec_cmd
> > do_suspend_trac-1267 [00] 154.925216: vprintk <-printk
> > do_suspend_trac-1267 [00] 154.925216: _spin_lock <-vprintk
> > do_suspend_trac-1267 [00] 154.925216: emit_log_char <-vprintk
> > do_suspend_trac-1267 [00] 154.925216: emit_log_char <-vprintk
> > ....
> > # pci_set_power_state() calls msleep(10) as per PCI spec
> > do_suspend_trac-1267 [00] 154.925216: msleep <-pci_set_power_state
> > do_suspend_trac-1267 [00] 154.925216: msecs_to_jiffies <-msleep
> > do_suspend_trac-1267 [00] 154.925216: schedule_timeout_uninterruptible <-msleep
> > do_suspend_trac-1267 [00] 154.925216: schedule_timeout <-schedule_timeout_uninterruptible
> > do_suspend_trac-1267 [00] 154.925216: init_timer_on_stack <-schedule_timeout
> > do_suspend_trac-1267 [00] 154.925216: __init_timer <-init_timer_on_stack
> > do_suspend_trac-1267 [00] 154.925216: __mod_timer <-schedule_timeout
> > do_suspend_trac-1267 [00] 154.925216: __timer_stats_timer_set_start_info <-__mod_timer
> > do_suspend_trac-1267 [00] 154.925216: lock_timer_base <-__mod_timer
> > do_suspend_trac-1267 [00] 154.925216: _spin_lock_irqsave <-lock_timer_base
> > do_suspend_trac-1267 [00] 154.925216: internal_add_timer <-__mod_timer
> > do_suspend_trac-1267 [00] 154.925216: _spin_unlock_irqrestore <-__mod_timer
> > do_suspend_trac-1267 [00] 154.925216: _spin_lock <-schedule
> > do_suspend_trac-1267 [00] 154.925216: marker_probe_cb <-schedule
> > do_suspend_trac-1267 [00] 154.925216: _spin_unlock <-tracing_record_cmdline
> > do_suspend_trac-1267 [00] 154.925216: __switch_to <-schedule
> > <idle>-0 [00] 154.925216: _spin_unlock_irq <-finish_task_switch
> > <idle>-0 [00] 154.925216: tick_nohz_stop_sched_tick <-cpu_idle
> > <idle>-0 [00] 154.925216: ktime_get <-tick_nohz_stop_sched_tick
> > <idle>-0 [00] 154.925216: ktime_get_ts <-ktime_get
> > <idle>-0 [00] 154.925216: getnstimeofday <-ktime_get_ts
> > <idle>-0 [00] 154.925216: set_normalized_timespec <-ktime_get_ts
> > <idle>-0 [00] 154.925216: get_next_timer_interrupt <-tick_nohz_stop_sched_tick
> > <idle>-0 [00] 154.925216: _spin_lock <-get_next_timer_interrupt
> > <idle>-0 [00] 154.925216: _spin_unlock <-get_next_timer_interrupt
> > <idle>-0 [00] 154.925216: hrtimer_get_next_event <-get_next_timer_interrupt
> > <idle>-0 [00] 154.925216: _spin_lock_irqsave <-hrtimer_get_next_event
> > <idle>-0 [00] 154.925216: _spin_unlock_irqrestore <-hrtimer_get_next_event
> > <idle>-0 [00] 154.925216: rcu_needs_cpu <-tick_nohz_stop_sched_tick
> > <idle>-0 [00] 154.925216: rcu_pending <-rcu_needs_cpu
> > <idle>-0 [00] 154.925216: __rcu_pending <-rcu_pending
> > <idle>-0 [00] 154.925216: __rcu_pending <-rcu_pending
> > <idle>-0 [00] 154.925216: rcu_pending <-cpu_idle
> > <idle>-0 [00] 154.925216: __rcu_pending <-rcu_pending
> > <idle>-0 [00] 154.925216: __rcu_pending <-rcu_pending
> > <idle>-0 [00] 154.925216: default_idle <-cpu_idle
> > <idle>-0 [00] 154.925216: do_IRQ <-common_interrupt
> > <idle>-0 [00] 154.925216: irq_enter <-do_IRQ
> > <idle>-0 [00] 154.925216: tick_nohz_stop_idle <-irq_enter
> > <idle>-0 [00] 154.925216: ktime_get <-tick_nohz_stop_idle
> > <idle>-0 [00] 154.925216: ktime_get_ts <-ktime_get
> > <idle>-0 [00] 154.925216: getnstimeofday <-ktime_get_ts
> > <idle>-0 [00] 154.925216: set_normalized_timespec <-ktime_get_ts
> > <idle>-0 [00] 154.925216: ktime_get <-sched_clock_tick
> > <idle>-0 [00] 154.925216: ktime_get_ts <-ktime_get
> > <idle>-0 [00] 154.925216: getnstimeofday <-ktime_get_ts
> > <idle>-0 [00] 154.925216: set_normalized_timespec <-ktime_get_ts
> > # ktime_get_ts() calls getnstimeofday() which ends up calling clocksource_read()
> > # thus updating the kernel timestamp. note that we call getnstimeofday()
> > # twice above but for some reason they do not impact the time stamp...
> > # There is an IRQ in here, possibly an HRT-driven IRQ and I'm wondering
> > # if that is kicking the TSC into ticking?
> > <idle>-0 [00] 154.929837: touch_softlockup_watchdog <-sched_clock_idle_wakeup_event
> > <idle>-0 [00] 154.929839: tick_nohz_update_jiffies <-irq_enter

2008-12-18 12:40:26

[permalink] [raw]

Subject: Re: TSC not updating after resume: Bug or Feature?

* Peter Zijlstra <[email protected]> wrote:

> Rafael, would something like this explain why we had to revert Shaggy's
> patch? His patch fixes the backward motion filter and I'm at an utter
> loss why that would break suspend.

yes, i'd love to have this commit reinstated:

5b7dba4: sched_clock: prevent scd->clock from moving backwards

and the bug triggered by hibernation fixed instead.

Ingo