TSC's get reset after suspend/resume and this leads to a scenario of
rq->clock (sched_clock_cpu()) less than rq->age_stamp. This leads
to a big value returned by scale_rt_power() and the resulting big group
power set by the update_group_power() is causing improper load balancing
between busy and idle cpu's after suspend/resume.
This resulted in multi-threaded workloads (like kernel-compilation) go slower
after suspend/resume cycle on core i5 laptops.
Realign rq->age_stamp to rq->clock and reset rq->rt_avg in rq_online_rt().
Addresses the primary issue reported in the
https://bugzilla.kernel.org/show_bug.cgi?id=15559
(Bad performance after suspend on Intel i7 and i5)
Reported-by: Florian Pritz <[email protected]>
Signed-off-by: Suresh Siddha <[email protected]>
Cc: [email protected] [2.6.32+]
---
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index d10c80e..a00af73 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -1550,6 +1550,12 @@ static void rq_online_rt(struct rq *rq)
__enable_runtime(rq);
cpupri_set(&rq->rd->cpupri, rq->cpu, rq->rt.highest_prio.curr);
+
+ /*
+ * Reset RT average stat and its time stamp.
+ */
+ rq->age_stamp = rq->clock;
+ rq->rt_avg = 0;
}
/* Assumes rq->lock is held */
On Fri, 2010-08-13 at 12:45 -0700, Suresh Siddha wrote:
> plain text document attachment (sched_reset_rt_avg_stat_online.patch)
> TSC's get reset after suspend/resume and this leads to a scenario of
> rq->clock (sched_clock_cpu()) less than rq->age_stamp. This leads
> to a big value returned by scale_rt_power() and the resulting big group
> power set by the update_group_power() is causing improper load balancing
> between busy and idle cpu's after suspend/resume.
ARGH, so i[357] westmere mobile stops TSC on some power state?
Why don't we sync it back to the other CPUs instead?
Or does it simply mark TSCs unstable and leaves it at that?
In any case, this needs to be fixed at the clock level, not like this.
On Mon, 2010-08-16 at 00:47 -0700, Peter Zijlstra wrote:
> On Fri, 2010-08-13 at 12:45 -0700, Suresh Siddha wrote:
> > plain text document attachment (sched_reset_rt_avg_stat_online.patch)
> > TSC's get reset after suspend/resume and this leads to a scenario of
> > rq->clock (sched_clock_cpu()) less than rq->age_stamp. This leads
> > to a big value returned by scale_rt_power() and the resulting big group
> > power set by the update_group_power() is causing improper load balancing
> > between busy and idle cpu's after suspend/resume.
>
> ARGH, so i[357] westmere mobile stops TSC on some power state?
WSM has working TSC with constant rate across P/C/T-states. This issue
is about suspend/resume (S-states).
> Why don't we sync it back to the other CPUs instead?
All the cpu's entered suspend state and during resume it gets reset for
all the CPU's.
>
> Or does it simply mark TSCs unstable and leaves it at that?
TSCs are stable and in sync after resume aswell. If we want to do SW
sync, we need to keep track of the time we spent in the suspend state
and do a SW sync (during resume) that can potentially disturb the HW
sync.
>
> In any case, this needs to be fixed at the clock level, not like this.
If we have more such dependencies on TSC then we may need to address the
issue at clock level aswell. Nevertheless, across cpu online/offline,
current scheduler code is expecting TSC (sched_clock) to be going
forward and not sure why we need to carry the rt_avg history across
online/offline.
thanks,
suresh
On Mon, 2010-08-16 at 10:36 -0700, Suresh Siddha wrote:
> On Mon, 2010-08-16 at 00:47 -0700, Peter Zijlstra wrote:
> > On Fri, 2010-08-13 at 12:45 -0700, Suresh Siddha wrote:
> > > plain text document attachment (sched_reset_rt_avg_stat_online.patch)
> > > TSC's get reset after suspend/resume and this leads to a scenario of
> > > rq->clock (sched_clock_cpu()) less than rq->age_stamp. This leads
> > > to a big value returned by scale_rt_power() and the resulting big group
> > > power set by the update_group_power() is causing improper load balancing
> > > between busy and idle cpu's after suspend/resume.
> >
> > ARGH, so i[357] westmere mobile stops TSC on some power state?
>
> WSM has working TSC with constant rate across P/C/T-states. This issue
> is about suspend/resume (S-states).
Hurm..
> > Why don't we sync it back to the other CPUs instead?
>
> All the cpu's entered suspend state and during resume it gets reset for
> all the CPU's.
Bloody lovely..
> > Or does it simply mark TSCs unstable and leaves it at that?
>
> TSCs are stable and in sync after resume aswell. If we want to do SW
> sync, we need to keep track of the time we spent in the suspend state
> and do a SW sync (during resume) that can potentially disturb the HW
> sync.
Nah, no need to track the time spend in S-states, simply not going
backwards would be enough, save before entering S, restore after coming
out.
You can use something like:
suspend:
__get_cpu_var(cyc2ns_suspend) = sched_clock();
resume:
for_each_possible_cpu(i)
per_cpu(cyc2ns_offset, i) += per_cpu(cyc2ns_suspend);
or something like that to keep sched_clock() stable, which is exactly
what most (all?) its users expect when we report the TSC is usable.
Not sure how to arrange the suspend bit to run on all cpus though, as I
think we offline them all first or something.
> > In any case, this needs to be fixed at the clock level, not like this.
>
> If we have more such dependencies on TSC then we may need to address the
> issue at clock level aswell. Nevertheless, across cpu online/offline,
> current scheduler code is expecting TSC (sched_clock) to be going
> forward and not sure why we need to carry the rt_avg history across
> online/offline.
We assume sched_clock_cpu() _never_ goes backwards, when
sched_clock_stable, sched_clock_cpu() == sched_clock() (we could, and
probably should, do better on clock continuity when we flip
sched_clock_stable).
We carry rt_avg over suspend much like we carry pretty much all state
over suspend, including load_avg etc.. no reason to special case it at
all.
On Mon, 2010-08-16 at 21:25 +0200, Peter Zijlstra wrote:
> You can use something like:
>
> suspend:
> __get_cpu_var(cyc2ns_suspend) = sched_clock();
>
> resume:
> for_each_possible_cpu(i)
> per_cpu(cyc2ns_offset, i) += per_cpu(cyc2ns_suspend);
>
> or something like that to keep sched_clock() stable, which is exactly
> what most (all?) its users expect when we report the TSC is usable.
That's actually broken, you only want a single offset, otherwise we
de-sync the TSC, which is bad.
So simply store the sched_clock() value at suspend time on the single
CPU that is still running, then on resume make sure sched_clock()
continues there by adding that stamp to all CPU offsets.
On Tue, 2010-08-17 at 01:51 -0700, Peter Zijlstra wrote:
> On Mon, 2010-08-16 at 21:25 +0200, Peter Zijlstra wrote:
> > You can use something like:
> >
> > suspend:
> > __get_cpu_var(cyc2ns_suspend) = sched_clock();
> >
> > resume:
> > for_each_possible_cpu(i)
> > per_cpu(cyc2ns_offset, i) += per_cpu(cyc2ns_suspend);
> >
> > or something like that to keep sched_clock() stable, which is exactly
> > what most (all?) its users expect when we report the TSC is usable.
>
> That's actually broken, you only want a single offset, otherwise we
> de-sync the TSC, which is bad.
>
> So simply store the sched_clock() value at suspend time on the single
> CPU that is still running, then on resume make sure sched_clock()
> continues there by adding that stamp to all CPU offsets.
Peter, That might not be enough. I should add that in my Lenovo T410
(having 2 core wsm cpu), TSC's are somehow set to a strange big value
(for example 0xfffffffebc22f02e) after resume from S3. It looks like
bios might be writing TSC during resume. I am not sure if this is the
case for other OEM laptops aswell. I am checking.
So such large values of TSC (leading to a very big difference between
rq->clock and rq->age_stamp) wont be correctly handled by
scale_rt_power() either.
thanks,
suresh
On Wed, 2010-08-18 at 17:20 -0700, Suresh Siddha wrote:
> On Tue, 2010-08-17 at 01:51 -0700, Peter Zijlstra wrote:
> > On Mon, 2010-08-16 at 21:25 +0200, Peter Zijlstra wrote:
> > > You can use something like:
> > >
> > > suspend:
> > > __get_cpu_var(cyc2ns_suspend) = sched_clock();
> > >
> > > resume:
> > > for_each_possible_cpu(i)
> > > per_cpu(cyc2ns_offset, i) += per_cpu(cyc2ns_suspend);
> > >
> > > or something like that to keep sched_clock() stable, which is exactly
> > > what most (all?) its users expect when we report the TSC is usable.
> >
> > That's actually broken, you only want a single offset, otherwise we
> > de-sync the TSC, which is bad.
> >
> > So simply store the sched_clock() value at suspend time on the single
> > CPU that is still running, then on resume make sure sched_clock()
> > continues there by adding that stamp to all CPU offsets.
>
>
> Peter, That might not be enough. I should add that in my Lenovo T410
> (having 2 core wsm cpu), TSC's are somehow set to a strange big value
> (for example 0xfffffffebc22f02e) after resume from S3. It looks like
> bios might be writing TSC during resume. I am not sure if this is the
> case for other OEM laptops aswell. I am checking.
ARGH, please kill all SMM support for future CPUs ;-)
Are the TSCs still sync'ed though? If so, we can still compute a offset
and continue with things, albeit it requires something like:
local_irq_save(flags);
__get_cpu_var(cyc2ns_offset) = 0;
offset = cyc2ns_suspend - sched_clock();
local_irq_restore(flags);
for_each_possible_cpu(i)
per_cpu(cyc2ns_offset, i) = offset;
Which would take the funny offset into account and make it resume at
where we left off.
If they got out of sync, we need to flip sched_clock_stable and work on
getting the sched_clock.c code to be monotonic over such a flip.
> So such large values of TSC (leading to a very big difference between
> rq->clock and rq->age_stamp) wont be correctly handled by
> scale_rt_power() either.
Still, we need to fix the clock, not fudge the users.
On Thu, 2010-08-19 at 01:53 -0700, Peter Zijlstra wrote:
> ARGH, please kill all SMM support for future CPUs ;-)
>
> Are the TSCs still sync'ed though?
Yes.
> If so, we can still compute a offset
> and continue with things, albeit it requires something like:
>
> local_irq_save(flags);
> __get_cpu_var(cyc2ns_offset) = 0;
> offset = cyc2ns_suspend - sched_clock();
> local_irq_restore(flags);
>
> for_each_possible_cpu(i)
> per_cpu(cyc2ns_offset, i) = offset;
>
> Which would take the funny offset into account and make it resume at
> where we left off.
>
> If they got out of sync, we need to flip sched_clock_stable and work on
> getting the sched_clock.c code to be monotonic over such a flip.
>
> > So such large values of TSC (leading to a very big difference between
> > rq->clock and rq->age_stamp) wont be correctly handled by
> > scale_rt_power() either.
>
> Still, we need to fix the clock, not fudge the users.
Ok. I have appended a patch doing this. Seems to fix the scheduler
performance issue triggered by suspend/resume. Can you please Ack it?
Thomas/Peter/Ingo: can you please pick this up if you have no other
objections. Thanks.
---
From: Suresh Siddha <[email protected]>
Subject: x86, tsc: recompute cyc2ns_offset's during resume from sleep states
TSC's get reset after suspend/resume (even on cpu's with invariant TSC which
runs at a constant rate across ACPI P-, C- and T-states). And in some systems
BIOS seem to reinit TSC to arbitrary large value (still sync'd across cpu's)
during resume.
This leads to a scenario of scheduler rq->clock (sched_clock_cpu()) less than
rq->age_stamp (introduced in 2.6.32). This leads to a big value returned by
scale_rt_power() and the resulting big group power set by the update_group_power()
is causing improper load balancing between busy and idle cpu's after suspend/resume.
This resulted in multi-threaded workloads (like kernel-compilation) go slower
after suspend/resume cycle on core i5 laptops.
Fix this by recomputing cyc2ns_offset's during resume, so that sched_clock()
continues from the point where it was left off during suspend.
Reported-by: Florian Pritz <[email protected]>
Signed-off-by: Suresh Siddha <[email protected]>
Cc: [email protected] [2.6.32+]
---
diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
index c042729..1ca132f 100644
--- a/arch/x86/include/asm/tsc.h
+++ b/arch/x86/include/asm/tsc.h
@@ -59,5 +59,7 @@ extern void check_tsc_sync_source(int cpu);
extern void check_tsc_sync_target(void);
extern int notsc_setup(char *);
+extern void save_sched_clock_state(void);
+extern void restore_sched_clock_state(void);
#endif /* _ASM_X86_TSC_H */
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index ce8e502..d632934 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -626,6 +626,44 @@ static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
local_irq_restore(flags);
}
+static unsigned long long cyc2ns_suspend;
+
+void save_sched_clock_state(void)
+{
+ if (!sched_clock_stable)
+ return;
+
+ cyc2ns_suspend = sched_clock();
+}
+
+/*
+ * Even on processors with invariant TSC, TSC gets reset in some the
+ * ACPI system sleep states. And in some systems BIOS seem to reinit TSC to
+ * arbitrary value (still sync'd across cpu's) during resume from such sleep
+ * states. To cope up with this, recompute the cyc2ns_offset for each cpu so
+ * that sched_clock() continues from the point where it was left off during
+ * suspend.
+ */
+void restore_sched_clock_state(void)
+{
+ unsigned long long offset;
+ unsigned long flags;
+ int cpu;
+
+ if (!sched_clock_stable)
+ return;
+
+ local_irq_save(flags);
+
+ get_cpu_var(cyc2ns_offset) = 0;
+ offset = cyc2ns_suspend - sched_clock();
+
+ for_each_possible_cpu(cpu)
+ per_cpu(cyc2ns_offset, cpu) = offset;
+
+ local_irq_restore(flags);
+}
+
#ifdef CONFIG_CPU_FREQ
/* Frequency scaling support. Adjust the TSC based timer when the cpu frequency
diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c
index e7e8c5f..87bb35e 100644
--- a/arch/x86/power/cpu.c
+++ b/arch/x86/power/cpu.c
@@ -113,6 +113,7 @@ static void __save_processor_state(struct saved_context *ctxt)
void save_processor_state(void)
{
__save_processor_state(&saved_context);
+ save_sched_clock_state();
}
#ifdef CONFIG_X86_32
EXPORT_SYMBOL(save_processor_state);
@@ -229,6 +230,7 @@ static void __restore_processor_state(struct saved_context *ctxt)
void restore_processor_state(void)
{
__restore_processor_state(&saved_context);
+ restore_sched_clock_state();
}
#ifdef CONFIG_X86_32
EXPORT_SYMBOL(restore_processor_state);
On Thu, 2010-08-19 at 17:03 -0700, Suresh Siddha wrote:
> Thomas/Peter/Ingo: can you please pick this up if you have no other
> objections. Thanks.
Looks good, will queue. Thanks Suresh!
Commit-ID: cd7240c0b900eb6d690ccee088a6c9b46dae815a
Gitweb: http://git.kernel.org/tip/cd7240c0b900eb6d690ccee088a6c9b46dae815a
Author: Suresh Siddha <[email protected]>
AuthorDate: Thu, 19 Aug 2010 17:03:38 -0700
Committer: Ingo Molnar <[email protected]>
CommitDate: Fri, 20 Aug 2010 14:59:02 +0200
x86, tsc, sched: Recompute cyc2ns_offset's during resume from sleep states
TSC's get reset after suspend/resume (even on cpu's with invariant TSC
which runs at a constant rate across ACPI P-, C- and T-states). And in
some systems BIOS seem to reinit TSC to arbitrary large value (still
sync'd across cpu's) during resume.
This leads to a scenario of scheduler rq->clock (sched_clock_cpu()) less
than rq->age_stamp (introduced in 2.6.32). This leads to a big value
returned by scale_rt_power() and the resulting big group power set by the
update_group_power() is causing improper load balancing between busy and
idle cpu's after suspend/resume.
This resulted in multi-threaded workloads (like kernel-compilation) go
slower after suspend/resume cycle on core i5 laptops.
Fix this by recomputing cyc2ns_offset's during resume, so that
sched_clock() continues from the point where it was left off during
suspend.
Reported-by: Florian Pritz <[email protected]>
Signed-off-by: Suresh Siddha <[email protected]>
Cc: <[email protected]> # [v2.6.32+]
Signed-off-by: Peter Zijlstra <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/include/asm/tsc.h | 2 ++
arch/x86/kernel/tsc.c | 38 ++++++++++++++++++++++++++++++++++++++
arch/x86/power/cpu.c | 2 ++
3 files changed, 42 insertions(+), 0 deletions(-)
diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
index c042729..1ca132f 100644
--- a/arch/x86/include/asm/tsc.h
+++ b/arch/x86/include/asm/tsc.h
@@ -59,5 +59,7 @@ extern void check_tsc_sync_source(int cpu);
extern void check_tsc_sync_target(void);
extern int notsc_setup(char *);
+extern void save_sched_clock_state(void);
+extern void restore_sched_clock_state(void);
#endif /* _ASM_X86_TSC_H */
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index ce8e502..d632934 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -626,6 +626,44 @@ static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
local_irq_restore(flags);
}
+static unsigned long long cyc2ns_suspend;
+
+void save_sched_clock_state(void)
+{
+ if (!sched_clock_stable)
+ return;
+
+ cyc2ns_suspend = sched_clock();
+}
+
+/*
+ * Even on processors with invariant TSC, TSC gets reset in some the
+ * ACPI system sleep states. And in some systems BIOS seem to reinit TSC to
+ * arbitrary value (still sync'd across cpu's) during resume from such sleep
+ * states. To cope up with this, recompute the cyc2ns_offset for each cpu so
+ * that sched_clock() continues from the point where it was left off during
+ * suspend.
+ */
+void restore_sched_clock_state(void)
+{
+ unsigned long long offset;
+ unsigned long flags;
+ int cpu;
+
+ if (!sched_clock_stable)
+ return;
+
+ local_irq_save(flags);
+
+ get_cpu_var(cyc2ns_offset) = 0;
+ offset = cyc2ns_suspend - sched_clock();
+
+ for_each_possible_cpu(cpu)
+ per_cpu(cyc2ns_offset, cpu) = offset;
+
+ local_irq_restore(flags);
+}
+
#ifdef CONFIG_CPU_FREQ
/* Frequency scaling support. Adjust the TSC based timer when the cpu frequency
diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c
index e7e8c5f..87bb35e 100644
--- a/arch/x86/power/cpu.c
+++ b/arch/x86/power/cpu.c
@@ -113,6 +113,7 @@ static void __save_processor_state(struct saved_context *ctxt)
void save_processor_state(void)
{
__save_processor_state(&saved_context);
+ save_sched_clock_state();
}
#ifdef CONFIG_X86_32
EXPORT_SYMBOL(save_processor_state);
@@ -229,6 +230,7 @@ static void __restore_processor_state(struct saved_context *ctxt)
void restore_processor_state(void)
{
__restore_processor_state(&saved_context);
+ restore_sched_clock_state();
}
#ifdef CONFIG_X86_32
EXPORT_SYMBOL(restore_processor_state);